The following is a guest post by Janusz Jezowicz, CEO of Speedchecker. The Speedchecker team runs a global distributed measurement network and offer speed test solutions using the Cloudflare platform.
Software companies contemplating offering a public API to 3rd party developers have many options to choose from for how to offer their API securely with high reliability and with fast performance. When it comes to cost though, commercial solutions are expensive and open-source solutions require a lot of time managing servers and the synchronization between them. This blog post describes how we successfully moved our API gateway to Cloudflare Workers and slashed our costs by a factor of 10.
Our original solution based on the Kong open-source API gateway
When we built our measurement network API for cost reasons we opted for open-source solution Kong. Kong is a great solution which has a vibrant community of users and plug-in developers who extend and maintain the platform. Kong is a good alternative to commercial solutions from companies such as Apigee or Mulesoft whose solutions are really catering for larger businesses who can afford them. Kong is free and it works. On the other hand, if your business has complex needs for API management - e.g. powerful analytics, access control, user-friendly administration then you will need plug-ins. Kong Plug-ins are often not free and you end up with costs approaching commercial solutions.
At Speedchecker we already have a lot of metrics and access control logic within the API itself so the basic functionality of Kong suited us. Here you can see a simplified architecture diagram of our API gateway:
Two Kong instances are, of course, a must if we want to provide a reliable service to our customers. Kong offers flexibility in what database it uses for its API management engine. We have been experimenting with MySQL and moved to PostgreSQL for its better support of replication using Bucardo.
We have been running this solution for over a year and during production we have learned the hard way following drawbacks in our architecture:
- While our Azure Cloud Service is scalable, Kong instance is not. With increased loads we were worried that the instance might fail if we don’t anticipate the traffic increase and do not scale the VM correctly
- Replication setup was quite complex and we had incidents where it failed and we spent days trying to work out why and how to repair it. During those times we were exposed to one live instance and, if it went down, our API would not work
- We had at least one incident where a rogue actor launched a DDOS on our customer-facing website (not API). If the attacker launched on our Kong instance endpoints we would not be able to protect our API
- API management got more complex and not easier which is not how should be once you integrate the API gateway. Our API works using apikey authentication and using one apikey user can access all of our APIs. The quotas per user are not based on the number of API calls but are based on the number of measurement results that we execute on the user’s behalf. Each API call can have a different number of measurement results and, therefore, the complex quota logic and billing calculations need to be done on Azure API and not in Kong. All of this means that the central repository for user apikeys and their quotas lies in the Azure API and we need to make sure synchronization happens between Azure API and Kong. For those reasons, many of the plug-ins make less sense for us to use since we have done all the work on the Azure API side.
- While we saved money on a commercial licence for API gateway, we were spending more man hours on server administration and the monitoring of our system
New API solution using Cloudflare Workers
After Cloudflare announced the Workers feature we have been following it closely and started experimenting with its functionality.
A few of the things that originally stood out for us were:
- We have been using the Cloudflare platform for other parts of our infrastructure for years: we like their platform for its features, performance, cost-effectiveness and reliability. It's always better to use your existing vendors than start exploring new ones we have no experience with.
- Attractive per request pricing. With our 30 million API requests per month, Workers would cost us $25. To compare Azure API management would cost us $300 for 2 Basic instances and Apigee would cost $2500 for the Business plan.
- Powerful DDOS protection. Cloudflare has one of the best DDOS protections available for small-businesses included in the price.
- No need for separate DNS failover and health monitoring
- Extensible platform which we can leverage for any custom logic in the future should our requirements change
On the downside we knew Cloudflare Workers were still in Beta and we would need to spend some time coding the logic instead of using an out of the box solution. After brainstorming with developers, we realized that, for our situation, Cloudflare workers are a good choice. Since most of our API management logic is already in Azure we really need a simple and cost-effective way to protect our origin API. Also, we need to make sure that the new solution is 100% compatible with our Kong solution. I believe this situation is common for all API providers when they are contemplating changing the API gateway infrastructure. You never want to get into a situation where during migration or after you realize some API users cannot access the API and they need to update their own code to work with your new API gateway For that reason, it was important for us that no endpoint changes and no authentication changes are necessary and that our new solution will work seamlessly with just a DNS change.
After one week of development we were ready with our first proof of concept to prepare for migration. The architecture of our new solution looks like the attached diagram.
Typical API call is handled in following way:
- User uses apikey in HTTP headers or querystring of their HTTP request (GET or POST) to query the endpoint hosted on CloudFlare workers
- Worker examines apikey and looks it up in local cache. There are a few different mechanisms in Workers for storing data tables. For our purposes we picked cache in global memory. Since the data table contains only a list of apikeys it’s not very big and the restriction of 128MB does not cause issues. Also, each Cloudflare POP has a different cache which can be problematic for some use cases. In our case though it isn’t - if apikey is not in cache, it can be quickly retrieved from our Azure API.
- If apikey is not found in the local cache, Worker does HTTP subrequest to Azure API and retrieves information about whether apikey is valid. The response is then stored locally in the Global memory cache so that subsequent apikey requests are saving the round-trip to Azure API and do not overload our Origin.
- If apikey is invalid, Worker responds to the API user message about an invalid apikey and Origin is not being hit with a request.
- If apikey is valid, Worker forwards API user request to our Origin and responds back to the API user when it gets a response. In this step we also include any custom redirection logic as some API calls have different Origin endpoints. Using Workers we can easily specify custom logic in which API calls use different endpoints.
As described above, we wanted the change of the architecture to be done seamlessly without requiring users to update any of their API code. For that reason, we devised the following approach to migrate to Cloudflare workers.
- Enable Cloudflare Workers on a staging domain. Perform all tests to production API endpoint and the same tests to staging domain with Workers enabled. The API endpoints should behave in the same way.
- Enable Workers on the production domain with Origin IP still pointing to Kong live instance. Using Workers Routes settings we make sure the Workers code is not being executed on any of the live API endpoints.
- Using Worker Routes we bring online instantly method by method the API calls to the Workers. In case of any problems we can quickly revert by modifying the Routes.
- We monitor Worker Analytics screens, number of API calls and status codes of sub requests to ensure no calls are failing.
During our live migration process everything went smoothly until we started seeing some errors with some of our customers. We had not realized that the Cloudflare firewall has some rate limiting in place and was preventing our API users who queried more than 2000 requests per minute from the same IP. After raising the ticket with Cloudflare Support we got the limit lifted and errors stopped happening.
We believe Cloudflare Workers are a good alternative to existing API gateway solutions. For companies who already have an existing codebase for authentication and analytics there is not a compelling reason to use commercial API gateway package when Cloudflare Workers can add a protection layer to your API at the fraction of the cost of alternatives. While Workers are a relatively new product from Cloudflare, we already feel comfortable using it in production. We encourage you to explore Workers for your new projects. Also, if you are wanting to save costs or to make your self-hosted solution more robust, Workers are a good alternative which can be deployed with no impact on your API users or business.