RabbitMq Delayed Retry Approaches (That Work)

Problem: You want either a fixed or exponential back-off with your retries. There are a few ways of making this work and all use the dead letter exchange functionality. The best I have seen is NServiceBus and unfortunately there are a few blog posts out there with just plain bad advice. In this post we'll look at the bad advice, some per application solutions and also the NServiceBus method which works as a central retry infrastructure for an entire virtual host.

Simple Wait Exchange and Queue Pattern (Doesn't work)

The common pattern that I have seen on the web is that you set up a wait exchange and queue where you send messages for retries. You set a message TTL and set the dead letter exchange of your wait queue to your principal exchange. So if you set the TTL on your message to 5 minutes, then the message sits in the wait queue for 5 minutes then gets dead lettered back to your applications exchange to be consumed again.

Do a search in Google and you'll find plenty of examples of blog posts that describe this pattern but they don't take into account a critical feature of dead letter behaviour.

Messages are only removed from the head of the queue. You cannot use a single wait queue for any back-off strategy. If you have a message with a TTL of 10 minutes at the head of the queue and a message with a TTL of 1 minute behind it, the second message will wait for ten minutes.

This pattern does not work. In those blog posts, thankfully I see commenters correcting the poster in each case and pointing this out.

The Multiple Wait Queues Per Application Approach

Create multiple wait queues and set a message TTL on the queues themselves. Let's say that 1, 5 and 15 minutes are your back-off time periods. So create three exchanges and queues with these TTLs and make each queue dead letter to your application's exchange.

The downsides of this approach are that

  • you need to set up this wait infrastructure for each application

  • you have limited flexibility on the back-off periods. Though you can set it on a per application basis.

Shared Multiple Wait Queues Approach

Similar to the above but you just have one set of queues that are reused by different applications. The question now is how do you make sure that each message is routed back to your application's exchange?

Answer: via a topic exchange and routing key. Just make sure that each wait queue has a "Dispatch" exchange as it's dead letter exchange. Then make sure that your application's exchange or queue has a binding to that Dispatch exchange with a binding key that uniquely identifies it.

When you send a message to the wait infrastructure, set the routing key that will match your application's exchange/queue.

The good thing is that you don't need per application wait exchanges and queues.

The bad things are that:

  • If the original message had a routing key, then you've just replaced it. However, if you really need that original routing key, just add it as a message header so you can extract it on the retries.

  • You are limited in the back-off periods as all applications share the same wait infrastructure. You could mitigate that problem by creating many wait queues to cover a large range of wait periods.

The NServiceBus Approach

The NServiceBus way is truly genius. I don't know if they came up with it, but it is really cool.

You can read up about it on the NServiceBus website here, or see my visual description of it in my earlier blog post here.

It allows for any period of waiting, at second resolution, from one second up to a few years, by using a series 27 exchanges and queues that dead letter to the level below depending on the routing key. 

The benefit is that you only need one set of exchanges and queues for your retry infrastructure and it supports any wait period you want.

The downside is that if your message had an original routing key, then that is lost. But, again, you could always add that as a message header by convention. So that if your application needs it, it can get it.

Message Expiry and the Dead Letter Approaches

So far, all the approaches rely on the dead letter capability of RabbitMq. This has a drawback that affects message expiry. When a message is dead lettered, the expiry header is removed and replaced with the "original-expiration" header.

If you have a message that is only valid for say, 1 hour, and you use a dead letter retry approach then you risk the message being processed after the original expiry has elapsed. The original "expiration" header is removed after the first dead lettering, and so the message could sit in the applications queue for hours or days and still get processed in the end.

This can be solved by adding a custom expiration header that your application will respect and will transfer onto the retry messages. When your application receives a message, it should compare the custom expiration header against the message timestamp and discard it if the expiration period has passed. 

Any Other Patterns and Gotchas?

If anyone knows of any other patterns and any other gotchas (like the message expiry problem) then leave a comment or email me please!