Since Lambda added SQS as an event source, there has been a misconception that SQS is now a “push-based” service. This seems true from the perspective of your function because you no longer have to poll SQS yourself. However, SQS itself hasn’t changed – it is still very much a “poll-based” service. The difference is that the Lambda service is managing the pollers (and paying for them!) on your behalf.
Although these managed pollers are invisible to us, they still introduce failure modes that affect our application. Indeed, others have written extensively about using SQS with Lambda – architecture patterns, use cases, and so on. I want to instead look at common failure modes we need to consider when using SQS with Lambda.
The most common mistake when using SQS is not configuring a dead letter queue (DLQ). It would expose you to the poison message problem where an invalid message is retrieved repeatedly and causes the SQS function to err. These messages are incredibly difficult to track down and debug and were the reason for introducing DLQs to SQS.
As a rule of thumb, you should always configure a DLQ for every SQS queue.
Misconfigured visibility timeout
Another common mistake is to not align the SQS visibility timeout (which can be overridden for each message) with the Lambda timeout. The SQS visibility timeout should always be greater than the Lambda function’s timeout.
Even if you expect a message should be processed within a few seconds, many AWS users would give SQS functions higher timeout “just in case”. This is understandable. You want to give the function the best chance to succeed, even if things occasionally take longer. Databases and other systems can experience performance problems under load. Even AWS SDK’s built-in retry and exponential backoff can be problematic at times too. The DynamoDB client for Node.js auto-retries up to 10 times with a max delay of 25.6s between retries (on the last retry)!
However, the default visibility timeout for an SQS queue is 30 seconds. If the SQS function has a higher timeout value then in-flight messages can be available again and processed more than once.
But it gets worse.
SQS functions receive messages in batches, and the messages are deleted from the queue only after the function completes successfully. It means that even if only one message was slow to process, the entire batch can elapse their visibility timeout and become available again in the queue.
Because messages are deleted only after the SQS function completes successfully, it means batches always succeed or fail together. This makes handling partial failures an important consideration when using SQS and Lambda.
Here are some solutions for handling partial failures:
Use a batchSize of 1. This eliminates the problem of partial failures altogether but limits the throughput of how quickly you are able to process messages. In low traffic scenarios, it can be a simple and yet effective solution.
Ensure idempotency. This allows messages to be safely processed more than once. But in processing a message, if you have to update multiple systems then it’s difficult to achieve idempotency. You might find yourself reaching out for solutions such as saga patterns, which adds much complexity to the system.
My personal preference is to:
- Catch errors when processing each message.
- At the end of the batch, check if there are any partial failures.
- If no then do nothing. The poller would delete the messages from the queue for me.
- If there are partial failures, then:
- call the DeleteMessage API on the successful messages to manually delete them from the queue.
- throw an aggregate error with information about the message IDs that failed and their corresponding errors. This leaves the erroneous messages on the queue, so they can be processed again after the visibility timeout.
Lambda auto-scales the number of pollers based on traffic. This is great until the SQS function uses up too much of the available concurrent executions in the region. When the regional concurrency limit is breached, any Lambda invocations can be throttled. This includes functions that are serving user-facing API requests or processing payments, which would negatively impact the user experience.
This is an undesirable scenario because SQS functions are usually less critical parts of a system. And yet, they are capable of taking down other more critical components.
Fortunately, there are mitigation strategies you can adopt.
Increase the concurrency limit to give you more headroom for spikes. This is by far the simplest solution, but it’s not bullet-proof. If a surge of 1000 msg/s won’t take you down, a surge of 2000 msg/s might. Still, your chances of survival have just gone up significantly with little effort.
Set reserved concurrency on the SQS function. Another simple solution, but exposes you to the SQS overpulling problem discussed below. That said, if there isn’t a huge difference between the reserved concurrency and the no. of pollers (you have no control of this) then this solution would likely have minimal impact. In the worst case, at least it limits the blast radius to just the SQS function.
Implement backpressure control in front of the SQS queue. If you control the publisher then rate limit how many messages you publish to the queue. Otherwise, consider putting API Gateway in front of the SQS queue. This lets you leverage API Gateway’s built-in rate-limiting capability. If you have multiple publishers, then also consider using API Keys and Usage Plans. They give you fine control over the rate limits and quotas for each publisher. The drawback is that you have to introduce another moving part, and pay for API Gateway usage, and potentially have to write VTL code! If you use the Serverless framework, the serverless-apigateway-service-proxy plugin makes this a snap.
To better understand this particular problem, you should read this and this article by Zac Charles. The heart of the problem is that Lambda scales the number of pollers irrespective of the reserved concurrency on the SQS function.
When there are more pollers than concurrent executions of the SQS function then invocation requests (from the poller to the SQS function) can be throttled. The messages in the throttled requests go back to the queue after the visibility timeout. This can happen to the same messages multiple times. If you have a DLQ (and you should) configured then in unlikely events messages can end up there without ever being processed.
So that’s it, a rundown of the common failure modes you should consider when using SQS with Lambda. This is not an exhaustive list, and if I have missed some common failure modes please let us know in the comments.
One thing I would like you to take away from this post is that, even when AWS is autoscaling our SQS functions and their pollers, we still have to understand the scalability needs of our own systems. And equally, we need to understand the failure scenarios and how we can best mitigate their impact on our users. Embrace your curiosity and ask “I wonder what would happen to my system if X happens”. The overhead of running these learning experiments in AWS is very low, and the reward is immense.