During the last few months, I’ve heard multiple people ask more or less the same question:
How do I run thousands of AWS Lambda functions in parallel?
Recently I found a similar question here on Reddit and instead of responding to the thread I decided to take some time and write this article, which describes how I did it on one of my projects in the past, with some examples and code.
Let’s solve the problem posed by the Reddit post I mentioned above. In the post, they were building an “uptime health check” service that makes a request to a number of websites. The service checks the response code and fires an alert in case of any issues; for example high latency or 5xx response code. Also, let’s say that our non-existing service has 1,000 clients, and each of them has 10 checks configured to fire every minute.
What does it mean? That our “application” is expected to make 10,000 calls to external APIs, wait for a response, save results (response code, maybe response body). And it has to do this every minute.
If you don’t see a challenge here let’s consider a simple solution. If you go with the “traditional” way of programming, we would create some workers that will make these calls. We will trigger them using a cronjob or something similar. The problem is that each of these calls can take some time, from a few milliseconds to many seconds. Because it won’t be easy to reuse one worker for multiple calls (we are not sure if we’ll be able to finish a call in 60 seconds), we’ll need to design a system with 10,000 workers to handle our load concurrently.
Let’s take a look at my proposed solution. There is a “new and fancy” way to run many parallel functions at the same time, without having to own the infrastructure. It’s called “Serverless”. Let’s try to use AWS Lambda and see if it can handle 10,000 workers.
But how exactly are we going to start (or “invoke” speaking in AWS slang) 10,000 Lambdas every minute? Well, let’s write a simple python code to trigger all the functions that we need, and let’s pretend it’s a real app and pass them some “URLs” to process.
Also, let’s create a second function called
testLambdaWorker — our worker function, that in a real application would be responsible for making a “healtheck request”. We’ll simulate a “real life” scenario and pretend that our worker is doing something, for let’s say 0.5 seconds on average. Let’s assume that it should be enough to handle everything we have to do in worker function, including “saving results” to the database.
Since we’ll have to run our “invoker” every minute and repeat it over and over again we set an execution limit of 60 seconds. Also, to make sure we’re running as fast as possible let’s allocate the maximum available memory for “invoker” — 3008MB (the more memory you allocate the more CPU you’ll receive). Let’s run it!
Well, that’s not exactly what we expected. We were able to trigger only 1049 “workers” in 60 seconds. That’s not even close to what we need — we expected to schedule 10,000 functions, but Lambda was able to invoke only 1049 functions. To clarify — when we’re invoking Lambda function with
InvocationType=Event we’re not actually waiting for a function to finish, we only wait for AWS to confirm they scheduled a job.
Also, please note that even though we allocated 3008 MB of memory to our function, Lambda used only 63 MB, so we were not bound by CPU and/or memory. Though, if we allocated 128 MB our function would be even slower. Let’s think about another, more scalable solution.
On average, we were able to invoke 17.5 lambda functions each second. We can try to use a “Fanout” pattern, so instead of invoking all of them in a single loop, we can create 100 batches of 100 URLs (10,000 in total, what we need), and have an intermediary lambda function that will trigger individual workers for each URL.
It might be easier to understand this if we look into visualization on the left. We have a “main function” that reads a configuration of all tasks that we have to execute. It divides them into batches of 100 URLs, and invokes “triggerer” Lambda functions, passing one batch to each of them. In our case, we’ll have to invoke it 100 times, with 100 URLs each. “Triggerer” function receives a batch of URLs and invokes “worker” function for each of them. Since each of the “triggerers” will have to invoke only up to 100 “workers” it should take 5–10 seconds, based on our past experience with solution #1. Each worker will be responsible to handle only 1 URL, so it won’t be blocking other processes. Let’s try to put everything together and see how our solution performs now. Let’s look at the final version of our solution, which is also available as a gist on github:
We’re trying to optimize costs now, so the “Triggerer” function is configured with only 256MB of memory and 60s limit, the same as “main” function. Since “Worker” function will be doing a network request and wait for a response — memory and CPU won’t be limiting factor here. Also, since we’ll have a lot of “workers” running every minute, let’s provision minimum available memory, 128 MB to this function. Now, we can start the main lambda function, go to CloudWatch and see how many invocations do we have on “worker”. After a few seconds, we can see that “worker” was invoked 10,000 times. Let’s “productionize” this example and configure it to run each minute. For this, we can create a CloudWatch rule to invoke our “main” function once a minute.
After an hour we checked CloudWatch for our “worker” function again:
We can see that there were 50,000 invocations at every data point (it’s grouped by 5 mins on this view, so 10,000 each minute). The average duration was ~513ms and max duration sometimes was peaking 1.8s with no errors. Also, we can see that usually we’re getting 50–100 throttlings/5 min. Since we’re invoking “Event”-type Lambda functions — AWS’s default behavior is to retry it when it gets throttled, and we can see that we’re not getting any DeadLetterErrors or other errors, so it’s successful. Of course, for production workload, we should either start “worker” functions more slowly (so not all “workers” are invoked at the same time) or we can request an increase for our limits for Lambda concurrency, which is set to 1000 for new accounts by default. If you provide a valid use case, AWS Support is usually happy to increase your limits.
And finally, let’s check how much this solution would cost us. We have to pay $0.000000208 / 100ms / 128MB execution time for our workers. Since we’ll need 10,000 (invocations) * 60 (minutes) * 24 (hours) * 30 (days) = 432,000,000 invocations/month. Each invocation will last ~600ms, so:
Duration: $0.000000208 * 432,000,000 * 6 = $539 / month
Requests: $0.0000002 * 432,000,000 = $86.40 / month
Our cost of running workers will be $625 / month in AWS Lambda bill.
Also, we’ll have to take into account “triggerers” cost (1 main function and 100 actual triggerers): 101 (invocations) * 60 (minutes) * 24 (hours) * 30 (days) = 4,363,200 invocations and each will take about 10 seconds to complete (we’re multiplying by 2 for duration cost, since we’re provisioning 256MB of memory instead of 128):
Duration: $0.000000208 * 4,363,200 * 100 * 2 = $181 / month
Requests: $0.0000002 * 4,362,200 = $0.87 / month
Our cost of “triggereres” will be $182 / month in AWS Lambda bill.
Total price — a little bit over $800/month.
Of course, this is only one of the possible solutions. You could probably achieve the same results using multiple threads on a single machine, but in my opinion, it will make the solution a little bit more complex. Also, it’s possible that you’ll be close to hitting the limit of a single EC2 instance anyway — either network, or CPU or memory, or some combination of them. And for $800/month you’ll be able to afford only a couple of decently-sized instances. Also, you’ll have to make sure your instances are healthy (autoscaling group will help you with that).
I’m not implying that you can’t solve this use case without AWS Lambda. I’m just trying to show that it’s really easy to implement a solution using Lambda, in a very scalable way. Using AWS Lambdas you can have clean code that is executed on hundreds or even thousands of machines in parallel, and this solution can easily scale 10x with no changes to the code.