Imagine you have a function with its concurrency limit set to
3. You create a new queue and set it up as a trigger for your function with a batch size of
10 (the maximum number of messages Lambda should take off the queue and give to a single function execution). There are now
5 parallel connections long-polling the queue.
When a large flood of messages are sent to the queue, you can expect a batch of
10 messages to be picked up by each of the
5 polling connections (50 messages in total). Lambda will then try to invoke your function for each batch, but will only succeed for
3 out of 5. The other two will throttle and the messages will be put back on the queue with their receive attempts incremented (after some retries during the visibility timeout).
This process keeps repeating. The next time one of the previously throttled messages is picked up, it might be lucky and get processed. On the other hand, it could get throttled again. If you have a redrive policy that sends messages to a dead-letter queue, it’s possible some will end up there.
Since Lambda polls the queue for messages before it knows whether it will successfully invoke your function to process them, this can technically happen at any scale, with any batch size, and with any concurrency limit. It’s just much more significant at the low end of concurrency (around 1 to 30).
Unfortunately, there’s no silver bullet right now. However, the problem can be mitigated through the following actions recommended by AWS:
- Set the queue’s visibility timeout to at least 6 times the timeout that you configure on your function.
The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch.
- Set the maxReceiveCount on the queue’s redrive policy to at least 5.
This will help avoid sending messages to the dead-letter queue due to throttling.
- Configure the dead-letter to retain failed messages long enough that you can move them back later to be reprocessed
Some months ago we were told 3 times the timeout and at least 3 max receives. The advice changing like that is evidence this is just mitigation and not a fix.
It has worked for us, though.
Assuming your goal is rate limiting, there are a couple of other serverless options. Mostly they just involve using (perhaps misusing) AWS services.
For example, if you put your messages into a Kinesis Data Stream and configure the stream as the Lambda trigger, you could use the number of shards and the batch size to control concurrency and the message processing rate.
Another service that offers rate limiting is API Gateway. You could have your messages come in through a Lambda-backed API and turn on rate limiting. SNS could send messages to the API, but security would be a pain.
You could pretend the SQS trigger doesn’t exist and use a recursive function to process SQS messages. It’s got far more moving parts and costs more, but at least you can control concurrency properly.
Lastly, you could try to avoid the reason you’re needing to set the function concurrency so low. Can you scale that weak downstream API/service? Or provision your DynamoDB table higher or use On-demand?
This is a challenging issue that is probably best solved by AWS. Other than avoiding the requirement entirely, I don’t really recommend the above solutions (except perhaps Kinesis).
Do you know of any other serverless solutions? Let me know!