Blazing Fast Microservice with Go and Lambda

By Remy Chantenay

As you may know, Amazon recently announced the support of Go 1.X for AWS Lambda. To be fair, I was slightly surprised it took that long, but better later than never.

We released an article early this year about how we, at Travelex, pivoted from containers to Lambda functions for some of our services. Since then, we’ve learned some valuable information about how Lambda works under the hood, particularly regarding concurrency mechanisms.

This article is a follow-up.

Despite what most people think, using a JVM-based language (at least Java and Kotlin) on Lambda is fine. Sure, performance wise, it’s not optimal (watch out for those cold starts), but it has other advantages.

That being said, I was curious to see how we could improve performance and cost by using more of a “lightweight” language like Go.

I decided to use the knowledge we gathered on our serverless journey as well as Go’s recent support to rewrite one of our Lambda functions from Kotlin to Go and eventually benchmark against the original.

Go, or “golang” is a compiled language, just like C, C++ or Haskell.

Google originally developed Go to improve their own infrastructure and processes, helping them solve two major issues:

  • Incredibly long compile time
  • Needs for a highly concurrent and performant language

Go does just that — it’s simple, clean, brings nothing particularly new or fancy, but it compiles really fast and provides a built-in concurrency mechanism thanks to goroutines and channels.

Simply put, it’s a performance-focused language.

For this experimentation, I chose to rewrite our tourist service, which is pretty much a CRUD (Create, Retrieve, Update and Delete). The service communicates with a postgreSQL database (on RDS) and has its own schema.

The Lambda function sits within a VPC and is accessible via an API Gateway.

If you came across my previous articles, by now you’ll know that I’m a big advocate of Uncle’s Bob Clean Architecture, because of the clean segregation of the different layers.

We implemented this architecture for our services, and the tourist service is no exception.

Clean Architecture with different data sources.

Before getting down to the nitty-gritty, let’s define the concept of performance in Lambda functions.

When it comes to performance, execution time is likely to be the first metric that comes to mind.

User experience aside, optimising execution time is essential because of Lambda’s pay-per-use pricing model. Getting execution time right is essential.

On Lambda, it’s expressed in milliseconds and billed per 100ms.

There are two different considerations with Lambda — memory allocation and memory footprint.

The former massively relies on the use case —whether you’re running some Map/Reduce jobs (big data) or a simple CRUD service, your requirements won’t be the same.

It also depends on which language you use — for a given job, Go and Kotlin won’t require the same amount of memory.

In theory, you will have to pay approximately 24 times more per 100ms if you are allocating 3GB (max) of memory instead of 128MB (min).

There’s the catch — the more you allocate to your function, the stronger the vCPU will be, and the more you will pay. The flip side to this is that your function will also run faster, hence potentially costing less than expected.

Slide from Amazon comparing memory allocation settings for a same computation.

As you can see on the slide above, higher memory allocation doesn’t automatically mean a higher cost.

Finding the optimal setting for a given job is almost an art.

Thanks to Alex Casalboni, there is a solution to this problem:

The memory footprint is the amount of memory that your function will actually use (regardless of the defined allocation). This is really important, particularly if your function is (most likely) sitting in a VPC.

How you use the language is as important as what language you’re using.

Much like C and C++, Go allows you to manipulate pointers instead of values. A pointer is a reference to the memory address of a specific value.

When calling a function, you can either pass by value or by reference.

By value:

func ChangeName(t Tourist) {
t.Name = "Jane Doe"
tourist := Tourist{Name: "John Doe"}
fmt.Println("Name is ", tourist.Name) // prints "John Doe"
fmt.Println("Name is ", tourist.Name) // still prints "John Doe"

When we pass the value of tourist to the function, Go basically copies the value somewhere else in memory, so the name will only be changed within the ChangeName function.

By reference:

func ChangeName(t *Tourist) {
t.Name = "Jane Doe"
tourist := Tourist{Name: "John Doe"}
fmt.Println("Name is ", tourist.Name) // prints "John Doe"
fmt.Println("Name is ", tourist.Name) // now prints "Jane Doe"

In the snippet above, the value is not copied. Whatever is changed in the ChangeName function will have an impact on the original value.

If you are used to C or C++, this is not new to you.

What does that mean for us?

Well, as you can imagine, copying values all over the place can end up being pretty expensive in terms of memory consumption.

However, it’s not always worth passing the reference as there is a bit more complexity to it — Go’s compiler doesn’t exactly work like C’s.

I try to keep my articles as simple as possible, but if you like technical details, I highly recommend this article about escape analysis with the Go compiler.

There are a couple of rules for this:

  • If the variable shouldn’t be modified: pass the value
  • If the variable is a large Struct (e.g. API Gateway request): pass the pointer

Now, for the interesting part.

The tests were simple — adding a tourist in the database with both functions in the following situations:

With the tests above, we are covering both cold and warm invocations, with and without concurrency.

I’ve used hey for the load testing.

As I already mentioned, the database used is PostgreSQL on Amazon RDS.

A new database has been set up specially for the benchmark and is shared across both functions.

The chosen instance is db.t2.micro (1 vCPU and 1Gb of memory).

No surprise here, while our Kotlin function shows decent results once warm, it’s clearly not the case when cold.

On the other hand, the Go function end up being blazing fast and extremely cheap when warmed.

The difference in terms of memory consumption is also significant — Kotlin consumes 3 times more memory. That’s no surprise.

However, the results above are to be taken with a pinch of salt — not every application is highly concurrent and if Kotlin performs better without concurrency:

Kotlin is definitely doing a better job, whether cold or warm. Meanwhile, the Go function is still significantly faster, with an average of 0.078ms, yes milliseconds! Unfortunately, we are still paying for 100ms…

When benchmarking response times, percentile values are a way better indicator than average, fastest or slowest responses.

As displayed above, the Go function shows far better results in every circumstance.

In the end, this is an unfair comparison — JVM based languages have always been “heavier” and that’s absolutely normal.

Due to the ephemeral lifecycle of Lambda, the JVM runtime needs to be initialised every time the function is cold, which is not ideal.

I am also aware that the set of tests might seem “light.” In my opinion, it’s enough to get a pretty good idea of the differences and allows us to measure the impact on response time and cost.

Because of VPC throttling and how the Elastic Network Interfaces (ENI’s) are assigned to Lambda workers, another major factor to take into account is the subnet’s capacity.

The number of network interfaces attributed to a given function relies on two factors — concurrent executions and memory allocated to the function.

Luckily, there is a simple formula to approximately determine how many ENIs a function will need.

For instance, if we assume that there will never be more than 100 concurrent executions and the allocated memory is 128MB (3008MB being the maximum allocated per function):

100 * (128 / 3008) = 4.25

In this situation, only 5 ENI’s will be assigned to my function. Now, what if, for the same number of executions, we’re running a memory hog (let’s be crazy, a Hadoop node in Java for instance)?

100 * (1024 / 3008) = 34.04

This is why trying to lower the memory consumption of your function is crucial.

Creating and attributing ENI’s won’t just add seconds to the response time, but more importantly, if the VPC doesn’t have enough capacity and thus can’t scale up, it will cause failures (error 500).

You also need to ensure that the subnets within your VPC have sufficient IP addresses for those ENIs. Setting up dedicated subnets is a good idea.

It also worth noting that the ENI limit per region is 350.

Update: during re:Invent 2018, AWS announced working on a solution to improve this by uncoupling ENIs from the Lambda workers allowing faster scaling and decreased cold starts.

Additionally preventing the risk of running out of IP addresses as ENIs will be shared across workers:

We can only imagine that cold starts won’t be that much of an issue in a near future. This specific enhancement is planned for 2019.

Before diving to the benchmark breakdown, here are some findings that are worth mentioning.

As expected, using a traditional database can be an issue with Lambda functions.

There are two options to handle database connections — the first is to open and close a connection for each and every invocation. The second option is to share and reuse the same connection across multiple invocations in the same container.

The former is not very efficient, to say the least — a 3-way TCP handshake + TLS handshake for each invocation would be absurdly expensive…

While the second option allows us to avoid this overhead, one could argue that this isn‘t a perfect solution either. Because we have no feedback whatsoever on the container lifecycle, we won’t be able to close the connections in a clean manner, which will effectively lead to zombie connections.

Unfortunately, there is no simple solution for this. However, if you still want to use a relational database in a highly concurrent environment, there are a couple of things to help you mitigate the issue.

Make sure your database connection is in the global state of the function and not inside the handler.

By running 100 concurrent requests, you will end up opening ~100 database connections.

Since the maximum number of connections varies from one RDS instance class to another, depending on your needs you may want to increase this value before you reach that limit and start experiencing failures.

I would strongly recommend to set one or multiple CloudWatch alarms before the limit is reached so you can quickly become aware of important spikes.

If you have many functions sharing the same database, consider having separate databases per service. This allows you to spread the load — imagine half a dozen services with the same concurrency load hitting the same database… that would be a disaster.

If you have the opportunity to choose your database you may want to consider Aurora. Aurora has been created specifically with high concurrency in mind — the throughput increases with the number of connections.

Lambda allows you to configure a concurrency limit per function. This come in handy if your VPC subnet’s capacity isn’t big enough to keep it up with the account level limit (default). Additionally, it can help avoid smashing your database.

Writing a micro-service in Go has been an absolute pleasure. The language has matured a lot since last time I played with it (3 years ago).

Now, should we re-write some of our most-used services?

That’s worth considering, but as I mentioned above, using a JVM language offers other advantages such as a deep level of traceability and debugging.

Regardless of the language used, if for some reason we failed to meet the level of quality expected by our customers (e.g. requests too slow) and the language is the cause of the issue, we shouldn’t be resistant to switching to a language that does the job better.

Thanks to Heitor Lessa (Serverless Specialist Solutions Architect) for sharing some really cool information on Lambda, VPCs and throttling.