WS-(un)ReliableMessaging and the US Postal Service

Sunday, November 24, 2013
posted by daveb
Reliable messaging - sorry but it's not

Reliable messaging – sorry but it’s not

Reliable messaging is the concept that I can send you a message, and even though the channel over which I’m sending it on (let’s say for argument’s sake, the US Postal Service) is not 100% reliable (the odd letter goes missing), I can still say what I need to say to you, and you can say what you need to say to me.

How it works is that when I send you a letter, you send one back to me as soon as you get it – a receipt if you like. I might send you another 2 or 3 letters, but I’m always keeping track of the receipts you send back, and if I find I’m missing a receipt, after a while, I’ll re-send that letter. I keep a copy of all letters I’ve sent you, just in case one goes missing and I need to re-send it. And with the re-sent letter, I’ll similarly track the receipt, and even re-send again, until you get it.

This, by the way, is exactly what goes on in TCP/IP protocol, one of the key protocols of the Internet. And it works very well. We have an unreliable network (the Internet), and yet we can give some level of assurance that data gets to where it needs to go if we use the TCP/IP protocol.

The key difference between the way TCP/IP works and the way the letter exchange example works is this: in the case of TCP/IP, we have a protocol stack that consists of layers on top of other layers, each unaware of the others, and each having a distinct responsibility. TCP is a “transport” layer protocol – its job is to transport data between 2 endpoints, and conceal all the tricky ACK’s and re-transmission stuff from the upper layers of the protocol stack. It delivers a useful service to the upper protocol layers.

This all sounds good, until you realise that TCP/IP is not sufficient to guarantee message delivery between applications over the network. Why not? Lets say my application sends your application an electronic message over TCP/IP. Your TCP/IP stack gets my data, and ACK’s it, and I get your ACK and, as far as TCP is concerned its job is done. We can even close the connection. Then, the unthinkable happens and your application crashes and loses my message. I will never know you’ve lost the data, and yet I have the receipt giving me a completely false sense of security that the data got to you.

What went wrong here? Why did the seemingly robust process breakdown? To answer this we have to go back to the letter exchange example. But now, instead of us being aware that there is a protocol required to overcome the US Postal Service’s inherent unreliability, we instead use “Reliable Post Inc”, a competitor to the US Postal Service. “Reliable Post Inc” implements a re-transmission and receipting mechanism for us, to take away all that annoying copying, receipting and re-transmission stuff. Now lets say “Reliable Post Inc” arrives at your mailbox with the letter, delivers it, then issues me the receipt. But in an unfortunate accident your letterbox is burnt to the ground in a Guy Fawkes prank before you could get your mail. You never get the letter. I have your receipt (because “Reliable Post Inc” sent it to me once they had dropped the letter into your box) so I have this false sense of security that you have it. What “Reliable Post Inc” should have done is waited for you to tell them that you had received and read the message. Only then should they send me the receipt. But this is annoying and involves you in the receipting process, and the whole idea of outsourcing delivery to “Reliable Post Inc” was so that we didn’t have to think about that.

So now we’ve brought the letter analogy back in line with TCP/IP, and what we’ve discovered is that, if we really want 100% reliability, we cannot simply outsource it to another party, because as soon as we do that, there’s this weak point where the message is handed over between us and the underlying service. True reliability is just something you and I are going to have to be aware of, and have a protocol in place to deal with.

Enter WS-ReliableMessaging. I won’t explain how it works, because its basically like TCP/IP, but at a higher layer up the WS-* (SOAP/XML based) protocol stack. Which begs the immediate question, if TCP/IP (over which most SOAP messages ultimately find themselves being transmitted) didn’t give us reliable messaging, how exactly is another layer, which does exactly the same thing going to achieve it?

Of course the answer is, it doesn’t, for the exact same reason TCP/IP doesn’t: you can’t completely outsource reliability to another party.

In terms of WS-ReliableMessaging, it can improve reliability for a certain class of message failure, but don’t let it fool you into thinking you have 100% reliability – you’re still going to have to develop your own protocol to deal with failure after the message has been receipted. This makes reliability a bona fide business concern.

WS-ReliableMessaging makes the following claims which it calls Delivery Assurances:

  1. 1. Messages arrive at least once
  2. 2. Messages arrive at most once
  3. 3. Messages arrive exactly once
  4. 4. Messages arrive in the order they were sent

Item 1 can be better resolved as a business concern, as we have seen. Item 2 can be handled at the business level by making message interactions idempotent. Item 3 is simply the intersection of 1 and 2.

Item 4 is about order, and this is interesting. Marc de Graauw explains the relationship between the order of operations and the business layer in his InfoQ article:

The first strange thing is that apparently the order is a property of messages which is important to the business layer. So if it is important to the business layer, why isn’t there a sequence number in the business message itself? We have a message, with its own business-level semantics, and the order is important: so why isn’t there some element or attribute in the message, on a business level, which indicates the order?

- source: http://www.infoq.com/articles/no-reliable-messaging

To backup my point that true reliability is a business as opposed to a protocol concern, I thought I’d share a transcript of an interview between Carl Franklin, Richard Campbell (from the Dot Net Rocks podcast) and their guest Jim Webber. Jim Webber is a contributor to some of the WS-* standards that come out of OASIS. This is from 29th April 2008, but still very relevant today.

Jim Webber: The reliable messaging stuff is actually relatively straightforward too. It’s a protocol where we tag message sequence, metadata into messages and recipients of messages may notice when they are missing a sequence number or two and they can ask for retransmission. The protocol itself is relatively straightforward. Look for gaps in numbers and ask those gaps to be filled in if you find they’re missing. There are some subtleties around how to build that. For example if I’m the sender of the message, I have to hold onto that message until I’m quite sure that it has been ACK’ed by the recipient because I may be asked at any point until I have been ACK’ed to retransmit. But ultimately this stuff is not really too dissimilar from the kind of stuff that goes on way down the stack in TCP.

Richard Campbell: I mean the concepts are pretty straightforward. It’s just how we’re going to recover from a message that never showed up.

Jim Webber: Absolutely. So the irony of reliable messaging is that it is not reliable messaging. It can cover up glitches.

Richard Campbell: It is recoverable messaging.

Jim Webber: It is somewhat recoverable. So it would cover the odd glitch where a message or two goes missing but the minute that a meteorite smashes through your data center no amounts of reliable messaging on the planet is going to help you recover immediately from that catastrophe.

Richard Campbell: But the outside world is going to get a clear notice that their messages didn’t get delivered.

Jim Webber: Absolutely. Although I think WS-ReliableMessaging and friends have some validity, I actually think a much more robust pattern is to make your services aware of the protocols to which they conform. I know that sounds really lofty but what I actually mean is if you write your services so that they are message centric and so that they understand that message A is followed by message B or followed by message C or message D then those services know when something has gone wrong and they can be programmed robustly to react to the failure of a message. The problem with the WS-ReliableMessaging and forgive me to my friends who are involved writing some of those specs, but the problem is they encourage some sloppy thinking on the part of the service developer again.

If you take WSDL and WS-ReliableMessaging the appealing thought is OK I’m reliable now. I don’t need to worry about the protocol that my service works with. I just can do this RPC style thing and the reliable messaging protocol will take care of any glitches, which is only true up to a point and when you get an actual programmatic failure which WS-ReliableMessaging can’t mask, it leaks at a really inopportune moment and cripples your service. Although I can actually see the utility [of WS-ReliableMessaging], when I’m building services I tend to avoid it because I want my services to know when there should be a message there for them and to take proactive action, as it is more robust that way, to chase down those messages when they don’t arrive.

The full transcript of the interview is available, along with the audio from Dot Net Rocks.

References:

Nobody needs reliable messaging

END-TO-END ARGUMENTS IN SYSTEM DESIGN (Saltzer, Reed and Clark)

WS-ReliableMessaging Wikipedia article

WS-ReliableMessaging OASIS Standard

 



One Response to “WS-(un)ReliableMessaging and the US Postal Service”

  1. Evan says:

    This is an excellent summary of reliable messaging. Great work.

Leave a Reply