Error Handling in Distributed Systems: Exception Handling Within Components

Crashed Arrivals Screen in an Airport. The screen has crashed; it's not a screen of arrivals that have crashed!

I'm on a bit of a roll when it comes to exception handling tips, so here's another technique that's worked well in the last few systems I've had a hand in, this time for coordinating exception handling within a component of a distributed system.

A component in a distributed system receives requests from remote components and reacts by making requests of its own to remote components in its environment that it depends on, before sending back a response. However, because its clients and dependencies are out of its control, it cannot guarantee that the requests it receives are correct or that the services it depends on are available when it must service a request. By "request" I mean either a client-server style request/response interaction or an asynchronous event received from a message broker.

In Java terms, because both bad requests and failed dependencies are out of the component's control, they should be reported by throwing checked exceptions. However, the way they should be handled is significantly different. A bad request should never be retried, but could be logged for manual repair and replay if that makes sense for the system. The failure of a dependency is (hopefully) temporary, and so the request can be retried later.

Apart from RMI, Java frameworks for writing distributed systems don't make this distinction in the exceptions they throw or handle. Therefore, when building distributed systems, among the first things I write are two base exception classes: BadRequestException and EnvironmentException. Depending on the communication protocol, the application will handle these in different ways:

	Bad Request Exception	Environment Exception
HTTP (e.g. Servlets)	Return a 4xx response code	Return a 5xx response code
JMS (e.g. Message Driven Beans)	Move message to a hospital queue and commit the transaction.	Roll back the transaction, leaving the message on the input queue for later redelivery

Because frameworks don't make the distinction between Bad Requests and Environment Exceptions, I keep the framework code - servlet or MDB class, for example - as thin as possible, doing little more than delegating the request to an interface that throws BadRequestException or EnvironmentException and handling each kind of error as appropriate.

The client-side code of an synchronous remote call needs to translate the status it receives appropriately. If the returned status indicates that the client made a bad request (e.g. an HTTP 4xx code), it should throw a RuntimeException to indicate that a programming error has been detected. If the status indicates an environment exception (e.g. an HTTP 5xx code), it should throw a checked exception so the compiler ensures that the exception is handled. I usually wrap that logic up in a convenient proxy object.

Photo by Ian Hughes used under the Creative Commons Attribution license.

Mistaeks I Hav Made

Error Handling in Distributed Systems: Exception Handling Within Components