Event Hub triggered Azure Functions: A bug and a workaround 13422386 1019544571447648 7687716130941590224 o1

Event Hub triggered Azure Functions: A bug and a workaround

The context

In my current project, we make extensive use of Event Hubs and Azure Functions, which we are running on Linux App Service Plans. These services have performed well for us, but we have been running into one severe problem: Whenever any of our Event Hubs-triggered Azure Functions restart, it results in two minutes of downtime. Or to be more precise: the function restarts quickly but doesn’t consume any messages from Event Hubs for two minutes.

Two minutes of downtime is already an issue during plannable restarts, such as deployments or app setting updates. We have however also experienced restarts as a result of unannounced platform maintenance by Microsoft. These unpredictable episodes of downtime turned this from an annoyance into a serious problem.

We initially raised an issue with Microsoft support, but they were not of much help. After this we started digging ourselves, and eventually concluded that root cause was a bug in the Event Hub trigger for Azure Functions on Linux App Service Plans. Fortunately, we found a workaround as well.

All of our functions run on dedicated Linux App Service Plans. I did some tests on Windows App Service Plans as well, which did not reveal any issues. I have not checked consumption or elastic premium plans.

The bug

Event Hubs has an architecture similar to Kafka. Each Event Hub consists of one or more partitions, and each message is published to one of them. Consumers are organized in consumer groups, and each partition is consumed by exactly one member of each consumer group.

One way in which Event Hubs differs from Kafka is how the assignment of partitions to members of a consumer group works.1 In Event Hubs, clients keep track of partition ownership in a shared checkpoint store. When a consumer shuts down gracefully, it is supposed to release its partitions. When this doesn’t happen, other consumers will have to wait until the partition lease expires before overwriting it. That brings us to the problem, or rather, the combination of problems:

  1. When a function instance stops, even if it shuts down gracefully, it does not release its partition ownership. This problem only seems to occur on Linux App Service Plans, not on Windows Apps Service Plans.
  2. After a restart, the new function instances will have new client ids, and as such cannot reuse the existing partition leases from before the restart.
  3. Partition ownership expiry is fixed at two minutes and is not configurable.

You can in fact observe this behavior yourself. For Azure Functions, the checkpoint store is just a blob container called ‘azure-webjobs-eventhub’ on the Storage Account associated with the Function. The partition leases are blobs with a metadata field called ‘ownerid’. When you restart your function, what happens next depends on whether you are on a Windows or Linux App Service Plan:

  • On Windows, the ownerId will become null, and then quickly get a new value. This is the expected behavior.
  • On a Linux App Service Plan, the ownerId will not become null but keep its old value for two minutes. At that point the lease expires and the ownerId will finally change.

So to summarize: the function instance restarts quickly, but it doesn’t clean up partition ownership. It then forgets which function instance owns which partition, because each instance got a new Event Hub client with a new ID. This causes the function instances to wait for the ownership to expire, which takes two minutes.

What Microsoft should do

The fact that partition ownership doesn’t get released on a graceful shutdown is just a bug, and Microsoft should fix it. Even if that gets fixed though, there might be some cases where a function shuts down unexpectedly and cannot update the checkpoint store. The better option would be to have the function instances use static Event Hub client ids. That way, the function could reuse the existing partition leases after a restart.

I created this GitHub issue which quickly got some attention. It turned out though that I created the issue on the wrong GitHub repo, and the issue got transferred here. The second issue unfortunately hasn’t received any activity since it was created two months ago, so I don’t expect this to get solved anytime soon.

Two workarounds

Since our bug involves the Event Hubs trigger on Linux App Service Plans, we have two obvious solutions: not using Linux or not using Event Hub triggers.

Solution 1: Use a Kafka Trigger

Event Hubs offer a Kafka compliant interface, which means we can just use the Kafka trigger instead. Here is the documentation on the Event Hubs Kafka endpoint, and here is documentation on the Kafka trigger for Azure Functions. The most valuable piece of documentation though is the GitHub repo for the Azure Functions Kafka Extension, which provides samples for multiple programming languages, including samples for Event Hubs.

The Kafka trigger is already an improvement with the default configuration, but we can tune it to do a bit better still. The configuration field session.timeout.ms determines how long a consumer is allowed to be inactive before triggering a rebalancing of the partitions over the consumer group members. If this value is high, partitions will remain claimed longer after a function instance has gone down. If the value is too low however, temporary hiccups could cause unwanted rebalancing. The default value is 45 seconds, which I find too high for most use-cases. I typically settle at 10 seconds for Azure Functions, and this has so far worked out well. Restarts are fast, and we don’t experience a lot of unwanted rebalancing. The session.timeout.ms can be set by adding the following to your host.json:

"extensions": {
    "kafka": {
      "SessionTimeoutMs": 10000
    }
  }

I find the Kafka trigger a bit less easy to work with than the Event Hub trigger, so we have decided to not switch for some of our less critical functions. For our most critical functions though, this workaround is more than worth it.

Solution 2: Use Windows App Service Plans

Using Windows App Service Plans is at least a partial solution. As mentioned before, downtime should be minimal at least for graceful shutdowns. We only found this out after we already implemented the workaround with the Kafka trigger, but I do not regret choosing that solution. We still are not sure what the behavior for Windows App Service Plans will be in the case of unexpected restarts due to platform maintenance, and I would rather not take the risk.

An aside on debugging Azure managed services

Managed services like Event Hubs and Azure Functions take operational load away from developers, allowing them to focus on delivering business value. This works really well, until the moment these products do not work the way they are supposed to. Their black box nature means we have to rely on Microsoft support investigate the issue, and this reliance can sometimes be frustrating. In the case of the two minute downtime for the Event Hub trigger, we had a weeks long mail thread that eventually ended in them refusing to acknowledge that there even was a problem. The downtime was deemed expected behavior. After all, the function restarted quickly and Event Hubs does not have an SLA on event delivery latency.

Support is vital not only for customer satisfaction but also for identifying product issues. When support dismisses valid problems as ‘within SLA boundaries,’ both users and Microsoft lose visibility into issues that deserve attention.

I am not arguing that you shouldn’t use managed services. I in fact believe that for a lot of use-cases, the increased productivity and reduced operational load are worth the vendor lock-in and reduced control. It is however to know that Microsoft support sometimes disappoints, and that there are other resources available that can help when you encounter a problem.

Many Azure components have their own GitHub repos. These often have more in depth documentation than Microsoft Learn, and of course you can submit issues. In the case of the Event Hub trigger bug, there are a number of components involved:

Often when I run into an issue, it turns out that there was some problem with my code or configuration based on a wrong assumption about how some Azure component works. Even in these situations, the GitHub pages can be really valuable. For instance, the samples on the GitHub repo for the Kafka extension helped me a lot when I was struggling with the Kafka trigger. And of course when you believe there really is a problem with a Microsoft product, you can create a GitHub issue. The response isn’t always fast, but typically more helpful than Microsoft Support.

If you have any questions, or just want to let me know that this article helped you, feel free to leave a comment!

  1. Here is a good explanation on how partition assignment works in Kafka, and here on how it works in Event Hubs ↩︎

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.