Retries with backoff in Azure Service Bus retries

Retries with backoff in Azure Service Bus

In my current project, we receive messages from Azure Service Bus that we need to deliver to edge devices that are located all over the country. The connections to these devices are not always stable, and sometimes we need to retry message delivery for up to 24 hours. We are using Azure Functions for the actual message delivery, which due to their stateless nature are not well suited for such long running retries.

We hoped that Azure Service Bus would provide exponential backoff functionality out of the box, but unfortunately this is not the case. We ended up implementing a workaround using scheduled messages, and created our own library effectively adding retries with backoff to the Service Bus trigger for Node.js Azure Functions. In this blog, I will explain our solution. Or if you prefer, you can skip straight to the library code on GitHub.

Why not just use a different message broker?

In our project we have a preference for Azure-managed services. The only Azure-managed messaging service that provides the kind of retries we are looking for out of the box is Event Grid. Unfortunately, our functions run in VNETs where inbound connections are not allowed. Event Grid needs to establish a connection to its subscribers, which means that it is not an option for us. If you have no such networking restrictions however, you might be better off using Event Grid if you need long running retries.

Service Bus default behavior, and why it does not suffice

Azure Service Bus does offer some retry functionality. The queue client even offers exponential backoff. At first glance this seems perfect—, and you would not be the first to think so. Unfortunately, this retry policy is only applied when an error occurs connecting to Service Bus. It notably does nothing when the consumer of a message abandons the message or throws an error during message processing.

When a consumer of a Service Bus queue experiences a problem while processing a message, they can choose to abandon the message. If this happens while the max try count has not been reached, the message becomes available again on the queue immediately. It immediately gets reprocessed and probably errors again, repeating the cycle. This continues until the max try count configured on the queue has been reached at which point the message is sent to a DLQ.

Retries with backoff in Azure Service Bus image 6

These retries are likely to do more harm than good. When a retriable error occurs, it is typically due to some downstream dependency having issues. That means that we need to give this dependency time to recover, but this functionality is doing the exact opposite. The dependency gets bombarded with retries, which is likely to just exacerbate the problem.

I should note that there is currently a GitHub issue aimed at improving this functionality by offering an ‘abandon with delay’ option. When that gets completed, I intend to update this article with an improved solution making use of this feature.

Solution Sketch

While we cannot specify a delay when abandoning a message, it is possible to schedule a message for a specific time when sending it to Service Bus. So when we want to abandon a message, we can instead:

  • Schedule a copy of the message to become available on a suitable time
  • Complete the original message so it gets removed from Service Bus

Actually, it is not quite that simple. Our consumer will need to know if a message is a retry, and if so how many times it has been retried. It might also want to know some additional information about the original message, such as the original message timestamp and expiry time. This means we need to add a message wrapper containing this information when we reschedule the message on Service Bus. We also need to check if the expiry time or max try count have been reached.

Note that since we handle retries in the function logic, we should set maxTryCount to 1 on the Service Bus Queue, effectively disabling retries on the Service Bus level. With all that, our retry mechanism looks like this:

Retries with backoff in Azure Service Bus image 5

While this works, our consumer is now cluttered with a bunch of logic related to just the retry mechanism. Let’s extract the retry logic to a library instead.

Retries with backoff in Azure Service Bus image 2

That is much better. We still need to implement the retry logic, but this way we only need to implement it once.

Solution limitations

Our solution relies on republishing messages as new messages with a delay. That inherently means that we can never give any message ordering guarantees. In particular, it means that we cannot support message sessions.

Solution Implementation: Node.js Azure Functions (programming model v4)

First, a disclaimer. I am the only person working on this library, and maintaining it is not a top priority for me. I recommend you copy my code or build your own library inspired by my code rather than relying on it directly for production scenarios.

Our library is basically an extension of the @azure/functions app.serviceBusQueue() function that is used to create Service Bus Queue triggers. The normal Service Bus Queue trigger looks like this (example taken from Azure Service Bus trigger for Azure Functions | Microsoft Learn):

const { app } = require('@azure/functions');
app.serviceBusQueue('serviceBusQueueTrigger1', {
    connection: 'MyServiceBusConnection',
    queueName: 'testqueue',
    handler: (message, context) => {
        context.log('Service bus queue function processed message:', message);
        context.log('EnqueuedTimeUtc =', context.triggerMetadata.enqueuedTimeUtc);
        context.log('DeliveryCount =', context.triggerMetadata.deliveryCount);
        context.log('MessageId =', context.triggerMetadata.messageId);
    },
});

Below is an example use of the Service Bus Queue trigger using our library. Note that in addition to the retry configuration, we added type parameters for extra ease of use in TypeSrcipt. I’m using ESM style imports here, but the library supports CommonJS as well.

import { serviceBusQueueWithRetries, type ServiceBusRetryInvocationContext } from '@joost_lambregts/azure-functions-servicebus-retries'
serviceBusQueueWithRetries('serviceBusQueueTrigger1', {
  queueName: 'my-queue-name',
  connection: 'MyServiceBusConnection',
  handler: handleMessage,
  retryConfiguration: {
    maxRetries: 15,
    delaySeconds: 60,
    sendConnectionString: 'Endpoint=sb://some-namespace.servicebus.windows.net/;SharedAccessKeyName=send;SharedAccessKey=some-key;EntityPath=some-queue',
  }
  messageExpiryStrategy: 'ignore'
})
export async function handleMessage(message: MyMessageType, context: ServiceBusRetryInvocationContext): Promise {
  // message has the same structure as how it was originally published to servive bus
  context.info(`${message.someProperty}`)
// Some retry information is added to the context object
  if (context.publishCount > 1) {
    context.info(`Publish count: ${context.publishCount}`)
    context.info(`Original messageId: ${context.originalBindingData?.messageId}`)
    context.info(`Original publish time: ${context.originalBindingData?.enqueuedTimeUtc}`)
    context.info(`original expiration time: ${context.originalBindingData?.expiresAtUtc}`)
  }
}

As mentioned before, this function is basically just a wrapper around the original app.serviceBusQueue function that handles all the retry logic. The details of how my library works and can be used can be found in the GitHub repo, so I won’t go into too much detail here. The most interesting detail about our implementation is the fact that we can extend the Service Bus queue trigger at all. This is possible because in programming model V4, function triggers are created using a function call instead of a function.json file. It doesn’t actually matter where the function call takes place, as long as it gets called at function startup. That means that we can wrap the actual call to app.serviceBusQueue() in a function that provides the extra retry logic, and put it in a library. I really like this technique, and it has many interesting applications. To name a few:

  • You can put entire functions in a library, for instance a health check function
  • You can dynamically create multiple functions by looping over a list in an environment variable
  • You can create a wrapper around a function trigger that prefills some default configuration

If you want to know more details about our library implementation, I encourage you to look at the GitHub repo. If you have any questions / remarks about it feel free to leave a comment.

Feedback is welcome!

This is my first blog, and relates to my first public NPM package and first public GitHub project. I expect that on all of these subjects, there is room for improvement. If you have suggestions on any of these topics, or just want to let me know that this blog helped you, please leave a comment!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.