Archive for November, 2013

WS-(un)ReliableMessaging and the US Postal Service

Sunday, November 24, 2013
posted by daveb
Reliable messaging - sorry but it's not

Reliable messaging – sorry but it’s not

Reliable messaging is the concept that I can send you a message, and even though the channel over which I’m sending it on (let’s say for argument’s sake, the US Postal Service) is not 100% reliable (the odd letter goes missing), I can still say what I need to say to you, and you can say what you need to say to me.

How it works is that when I send you a letter, you send one back to me as soon as you get it – a receipt if you like. I might send you another 2 or 3 letters, but I’m always keeping track of the receipts you send back, and if I find I’m missing a receipt, after a while, I’ll re-send that letter. I keep a copy of all letters I’ve sent you, just in case one goes missing and I need to re-send it. And with the re-sent letter, I’ll similarly track the receipt, and even re-send again, until you get it.

This, by the way, is exactly what goes on in TCP/IP protocol, one of the key protocols of the Internet. And it works very well. We have an unreliable network (the Internet), and yet we can give some level of assurance that data gets to where it needs to go if we use the TCP/IP protocol.

The key difference between the way TCP/IP works and the way the letter exchange example works is this: in the case of TCP/IP, we have a protocol stack that consists of layers on top of other layers, each unaware of the others, and each having a distinct responsibility. TCP is a “transport” layer protocol – its job is to transport data between 2 endpoints, and conceal all the tricky ACK’s and re-transmission stuff from the upper layers of the protocol stack. It delivers a useful service to the upper protocol layers.

This all sounds good, until you realise that TCP/IP is not sufficient to guarantee message delivery between applications over the network. Why not? Lets say my application sends your application an electronic message over TCP/IP. Your TCP/IP stack gets my data, and ACK’s it, and I get your ACK and, as far as TCP is concerned its job is done. We can even close the connection. Then, the unthinkable happens and your application crashes and loses my message. I will never know you’ve lost the data, and yet I have the receipt giving me a completely false sense of security that the data got to you.

What went wrong here? Why did the seemingly robust process breakdown? To answer this we have to go back to the letter exchange example. But now, instead of us being aware that there is a protocol required to overcome the US Postal Service’s inherent unreliability, we instead use “Reliable Post Inc”, a competitor to the US Postal Service. “Reliable Post Inc” implements a re-transmission and receipting mechanism for us, to take away all that annoying copying, receipting and re-transmission stuff. Now lets say “Reliable Post Inc” arrives at your mailbox with the letter, delivers it, then issues me the receipt. But in an unfortunate accident your letterbox is burnt to the ground in a Guy Fawkes prank before you could get your mail. You never get the letter. I have your receipt (because “Reliable Post Inc” sent it to me once they had dropped the letter into your box) so I have this false sense of security that you have it. What “Reliable Post Inc” should have done is waited for you to tell them that you had received and read the message. Only then should they send me the receipt. But this is annoying and involves you in the receipting process, and the whole idea of outsourcing delivery to “Reliable Post Inc” was so that we didn’t have to think about that.

So now we’ve brought the letter analogy back in line with TCP/IP, and what we’ve discovered is that, if we really want 100% reliability, we cannot simply outsource it to another party, because as soon as we do that, there’s this weak point where the message is handed over between us and the underlying service. True reliability is just something you and I are going to have to be aware of, and have a protocol in place to deal with.

Enter WS-ReliableMessaging. I won’t explain how it works, because its basically like TCP/IP, but at a higher layer up the WS-* (SOAP/XML based) protocol stack. Which begs the immediate question, if TCP/IP (over which most SOAP messages ultimately find themselves being transmitted) didn’t give us reliable messaging, how exactly is another layer, which does exactly the same thing going to achieve it?

Of course the answer is, it doesn’t, for the exact same reason TCP/IP doesn’t: you can’t completely outsource reliability to another party.

In terms of WS-ReliableMessaging, it can improve reliability for a certain class of message failure, but don’t let it fool you into thinking you have 100% reliability – you’re still going to have to develop your own protocol to deal with failure after the message has been receipted. This makes reliability a bona fide business concern.

WS-ReliableMessaging makes the following claims which it calls Delivery Assurances:

  1. 1. Messages arrive at least once
  2. 2. Messages arrive at most once
  3. 3. Messages arrive exactly once
  4. 4. Messages arrive in the order they were sent

Item 1 can be better resolved as a business concern, as we have seen. Item 2 can be handled at the business level by making message interactions idempotent. Item 3 is simply the intersection of 1 and 2.

Item 4 is about order, and this is interesting. Marc de Graauw explains the relationship between the order of operations and the business layer in his InfoQ article:

The first strange thing is that apparently the order is a property of messages which is important to the business layer. So if it is important to the business layer, why isn’t there a sequence number in the business message itself? We have a message, with its own business-level semantics, and the order is important: so why isn’t there some element or attribute in the message, on a business level, which indicates the order?

– source:

To backup my point that true reliability is a business as opposed to a protocol concern, I thought I’d share a transcript of an interview between Carl Franklin, Richard Campbell (from the Dot Net Rocks podcast) and their guest Jim Webber. Jim Webber is a contributor to some of the WS-* standards that come out of OASIS. This is from 29th April 2008, but still very relevant today.

Jim Webber: The reliable messaging stuff is actually relatively straightforward too. It’s a protocol where we tag message sequence, metadata into messages and recipients of messages may notice when they are missing a sequence number or two and they can ask for retransmission. The protocol itself is relatively straightforward. Look for gaps in numbers and ask those gaps to be filled in if you find they’re missing. There are some subtleties around how to build that. For example if I’m the sender of the message, I have to hold onto that message until I’m quite sure that it has been ACK’ed by the recipient because I may be asked at any point until I have been ACK’ed to retransmit. But ultimately this stuff is not really too dissimilar from the kind of stuff that goes on way down the stack in TCP.

Richard Campbell: I mean the concepts are pretty straightforward. It’s just how we’re going to recover from a message that never showed up.

Jim Webber: Absolutely. So the irony of reliable messaging is that it is not reliable messaging. It can cover up glitches.

Richard Campbell: It is recoverable messaging.

Jim Webber: It is somewhat recoverable. So it would cover the odd glitch where a message or two goes missing but the minute that a meteorite smashes through your data center no amounts of reliable messaging on the planet is going to help you recover immediately from that catastrophe.

Richard Campbell: But the outside world is going to get a clear notice that their messages didn’t get delivered.

Jim Webber: Absolutely. Although I think WS-ReliableMessaging and friends have some validity, I actually think a much more robust pattern is to make your services aware of the protocols to which they conform. I know that sounds really lofty but what I actually mean is if you write your services so that they are message centric and so that they understand that message A is followed by message B or followed by message C or message D then those services know when something has gone wrong and they can be programmed robustly to react to the failure of a message. The problem with the WS-ReliableMessaging and forgive me to my friends who are involved writing some of those specs, but the problem is they encourage some sloppy thinking on the part of the service developer again.

If you take WSDL and WS-ReliableMessaging the appealing thought is OK I’m reliable now. I don’t need to worry about the protocol that my service works with. I just can do this RPC style thing and the reliable messaging protocol will take care of any glitches, which is only true up to a point and when you get an actual programmatic failure which WS-ReliableMessaging can’t mask, it leaks at a really inopportune moment and cripples your service. Although I can actually see the utility [of WS-ReliableMessaging], when I’m building services I tend to avoid it because I want my services to know when there should be a message there for them and to take proactive action, as it is more robust that way, to chase down those messages when they don’t arrive.

The full transcript of the interview is available, along with the audio from Dot Net Rocks.


Nobody needs reliable messaging


WS-ReliableMessaging Wikipedia article

WS-ReliableMessaging OASIS Standard


Enterprise IT is broken (part 1)

Monday, November 18, 2013
posted by daveb

Broken windows in a St Petersburg abandoned cinema


You may have heard of Conways law:

“.. organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations”

Conway’s law has proven to be true for every software project I have ever been involved in. Take the client-server application where the client was developed in C++ and the server in C. The client side developers were young and hip, and into OO. The server side developers were ex-mainframe developers in their 50’s. It’s fair to say the two parties did not see eye to about much when it came to software design. The friction between the parties came to a head in the shared communication libraries for client-server communication, which they co-developed. The libraries were littered with pre-processor definitions and macros that seemed to be at war with one another. The arguments between teams carried over into the version control comments. The shared communication libraries were some of the most un-maintainable, un-readable and bug ridden code imaginable, even though the purely client-side and purely server-side code were reasonably tidy on their own.

It was around 2008 when I began my adventures into this thing we call “the enterprise”. I was going to be an “enterprise developer” and take on development of a key part of the infrastructure – a credit card transaction processing interface. I understood that enterprise development meant you had to be lean – unlike a software company selling software products or software as a service, there were no economies of scale – you only build one instance of enterprise software.

As I began to find my way round some of the custom developed services and applications, a few questions started to emerge – like what version control system do you use here? Answer: “yes we have been meaning to get to that”. Ok, so there had been only one developer on that part of the project previously, and he was a bit green, so I decided I shouldn’t pre-judge the organization based on that alone.

More questions started to appear as my predecessor walked me through the operations side of things. He showed me how they had left an older version of the credit card processing API in production because there were an number of external clients using that interface, and they could not force them to upgrade to the new interface. Fair enough. I asked about where the source code was for the old version, in case I need to go back to it should a bug need to be fixed. Answer: “… well there shouldn’t really be any bugs, because it’s been there for years now”.

It turned out that work had started on “version 2″ without any kind of version control branch or label or even so much as a simple zip backup of “version 1″ source code. They had lost the intellectual property investment of the stable “version 1″, and had re-written large chunks of it to create “version 2″, which was not backward compatible, and was considerably less stable than the previous version. Unbelievable.

“Version 2″ had been 18 months in development, and had only very recently been deployed to production. Therefore, no business value had been delivered for 18 months. Business stakeholders had lost patience, and almost lost complete confidence in the development team.

Since the recent “version 2″ update, the phone had been ringing hot, and my predecessor would have an often lengthy discussion with an upset customer who had lost sales due to downtime and bugs with the service. I was now supposed to take these calls, and be an apologist for this mess.

At this point, things were looking so bad I was seriously considering walking out the door before I was even two weeks into the job.

However, I resolved to take on the project as a challenge, and that is the only thing that kept me there. I enquired about the testing approach: unit testing, integration testing, user acceptance testing and so on. In short:

Unit Testing: “what’s that exactly?”

Integration Testing: a spreadsheet containing a single sheet with dozens columns and a hundred or so rows, each representing a test scenario. It was un-printable, un-readable, inaccurate and was the un-loved product of something the boss had instructed the developers to do. The developers didn’t feel their job was to test the product, and instead of resolving this dispute with the boss, they had yielded to his pressure, but then done the worst possible job on it, to make the point that developers can’t test! This communication breakdown, and many other examples like it had almost eroded all value from the services being developed.

User Acceptance Testing: none

As we delved into architecture there were more surprises waiting for me. Like the messaging interface that used a database table as a queue, and had a service polling the table for new records every 500ms. This, I later discovered, would occasionally deadlock itself with another transaction and become the deadlock victim, meaning someone’s credit card transaction failed. The reason for using a table as a queue: the solution architect was a relational database guy and insisted this solution be adopted when the developers had hit some problems with the message queuing technology they were using.

Turns out there were more surprises in store

What is unbelievable is not that this dysfunctional situation could exist, but that project management and project stakeholders had no idea that these problems existed in the development practices and system architecture. They knew something was wrong, but lacked the ability to see any deeper inside the box. Nor did they have any notion that the industry has long since found best practices that address the very problems that were slowly destroying all value in the services they were providing.

At first I thought this organization was an anomaly, and that I would be unlikely to see anything this bad again. But then I started hearing about others who had seen similar things. And then I saw inside other organizations. I started to realize that what I’d seen was not an anomaly at all, it was practically commonplace. Sure, some were better than others, but I had yet to see inside an enterprise that had anything even remotely approaching a quality development team, with a solid set of practices that was able to deliver business value.

Conway’s law seemed to be holding true. Frictions between personalities and departments led directly to confusing and inconsistent software architectures. In fact Conway’s law can even be used in the reverse – where there exist strange inconstancies in a software architecture, you get an insight into the friction between different personalities or departments involved in its design.

If you want to assess your development team, and you aren’t a developer, just use the Joel Test. It goes like this:

  1. Do you use source control?
  2. Can you make a build in one step?
  3. Do you make daily builds?
  4. Do you have a bug database?
  5. Do you fix bugs before writing new code?
  6. Do you have an up-to-date schedule?
  7. Do you have a spec?
  8. Do programmers have quiet working conditions?
  9. Do you use the best tools money can buy?
  10. Do you have testers?
  11. Do new candidates write code during their interview?
  12. Do you do hallway usability testing?

Add one point for every “yes” answer and there’s your score. As Joel himself says:

” The truth is that most software organizations are running with a score of 2 or 3, and they need serious help, because companies like Microsoft run at 12 full-time.”