Clojure at Scale: Why Python Just Wasn't Enough for AppsFlyer | OverOps Blog

By Alex Zhitnitsky

A first hand experience and an introduction to Clojure at scale

Still considered a bit of an esoteric language, Clojure is one of the JVM languages that get us excited. There aren’t many stories around about why companies start using Clojure or how they use it to build systems at Scale. We were lucky to hear an excellent example of using Clojure during a demo of OverOps error viewer for Clojure at AppsFlyer’s office where we learned about the architecture that powers their mobile app measuring and tracking platform.

In this post we’re sharing with you the experience of our new friends from AppsFlyer, Adi Shacham-Shavit, who manages the R&D department, and Ron Klein, a senior backend developer. First thing’s first, a huge thanks to Ron and Adi who treated us to behind the scenes of Clojure at AppsFlyer! If you have any questions for them and interested to learn more, please feel free to use the comments section below.

Here’s their story:

Let’s get started with some numbers

  • 2 Billion events per day
  • The traffic was doubled in the last 3 months
  • Hundreds of instances
  • The company grew from 6 to 50 people over the past year
  • 10 Clojure developers
  • Technologies – Redis, Kafka, Couchbase, CouchDB, Neo4j, ElasticSearch, RabbitMQ, Consul, Docker, Mesos, MongoDB, Riemann, Hadoop, Secor, Cascalog, AWS

The Pains of Scaling Up

At AppsFlyer we actually started our code base in Python. Two years later this wasn’t enough to handle the growing number of users and requests. We started to encounter issues like one of the critical Python processes taking too long to digest the incoming messages, caused mainly by string manipulations and Python’s own memory management system. Even partitioning the messages amongst several processes and servers could not overcome this. This eventually killed the process and caused data loss – the first ‘Python victim’ was the reporting service.

Taking the functional approach

As these kinds of difficulties accumulated, we had to choose between 2 options:

  1. Rewrite some of our services in C (great performance, but less fun to code) and wrap it with Python interop code (easy to do)
  2. Rewrite some of our services in a programming language more suitable for data processing

It is important to mention at this point, that we took the asynchronous event-driven approach to handle incoming messages, which allows the system to easily scale as traffic grows.

We’ve been toying around with the idea of introducing Functional Programming into the company for some time before the rogue reporting service started failing. It’s a good fit with our way of thinking and architecture, so it was logical to make the change – especially since the reporting service failures encouraged us to make the call already. After deciding to go with it, came the second hurdle, which language should we choose?

Scala Vs. OCaml Vs. Haskell Vs. Clojure

Scala was out of the picture because it’s a hybrid of Object Oriented Programming & Functional Programming and leans more towards OOP. OCaml was discarded because of the relatively small community and the Global Interpreter Lock (GIL) that allows only one thread to execute at a time – even on multicore machines (which was also a problem for us in Python). Monads in Haskell made us cringe in fear, so we were left with Clojure.

But that’s not the only reason we chose this path, Clojure won because of 2 major issues. First, it runs on the JVM and second, it’s a functional language with easy access to a mutable state if you need it, even in a heavily concurrent environment.

Clojure is a dialect of the Lisp programming language by Rich Hickey. It’s a general-purpose programming language with an emphasis on functional programming. Like other Lisps, Clojure treats code as data and has a macro system. At its center are immutable values and explicit progression-of-time constructs that are intended to facilitate the development of more robust programs, particularly multithreaded ones.

Micro-services architecture

The server side of AppsFlyer’s system is designed to continuously receive messages (events), process them, store them, and sometimes invoke additional web requests to external endpoints based on them. This “stream” of events made us take some architectural decisions that helped us scale as needed. One of the main decisions was to think of the system as a collection of services, intercommunicating mainly by message queues (formerly via Redis’ pub/sub and currently via Kafka). This made our services independent and loosely coupled.

The flow of events

Let’s take a simplified example: the event of “Application Installed” is published to the entire system through a Kafka topic (queue) named “Installs”. Our Reports service listens to that topic so that it could store this piece of data for the relevant reports. In addition, our Postbacks service listens to that very same topic, and decides, upon its own rules, whether or not to invoke a web request and to which endpoint.

Since the entire system is based on micro-services that consume messages from (and publish messages to) a common pipeline, it’s easy to rewrite them in any programming language, assuming that it has a decent client library to the common pipeline. Kafka is used as the main backbone, with RabbitMQ for the real-time channel.
Concurrency in Clojure

Clojure provides its own approach to concurrency and it might take some time to adjust to it. However, once the mindset is there, it’s much easier to achieve tasks in Clojure than when taking the “conventional” approach. In most cases, writing code that deals with concurrency in Clojure doesn’t include lock statements at all. This is a huge advantage: coding is more focused on the logic itself, rather than the plumbing around locks.

Clojure also has mechanisms that guard data from being corrupted. This, of course, comes with a trade-off: there’s a very low probability that the shared resource held by thread A does not contain all changes made earlier by thread B. Generally speaking, Clojure provides a nice mechanism of immutable data structures, ensuring data integrity and somewhat sacrificing consistency. Clojure has access to almost everything the JVM can provide so you can still use traditional locks. However, if the system you build is based on statistics, and you can tolerate minor data loss, such as the analytics system we have at AppsFlyer, then Clojure is way more than enough.

A real life example

Say we have a service that holds its state in a key-value data structure, a map. The map is initially defined in the module level as empty (this example is simplified for clarity, so code is not written to be fully reusable):

 (def my-map {}) ;; Don't panic, you'll get used to the braces... 

The statement above creates an empty map, accessible by the name my-map.

The first thing that strikes most newcomers to Clojure programming, after the braces syntax, is the freedom of naming variables. Clojure allows some interesting characters for variable names such as “-“, “?”, “!” etc. Think about the simplicity behind a function named contains? used to check whether a collection contains an item.

The basic code to add a key “k” with a value “v” to a given map is:

 (assoc some-map "k" "v") 

This code does not update the original map. Clojure keeps its data structures as immutable as possible. Instead, the statement above returns a new copy of the original map, with the new key and the new value. Behind the scenes, Clojure doesn’t fully duplicate the entire map. Instead, it keeps revisions with pointers to previous revisions, along with the differences. Smart, eh?!

Back to my-map. We’ll have to modify our statement so that it’s ready for concurrency:

 (def my-map (atom {})) 

That little atom is almost all we need to go the concurrent way. So now, when a running thread “updates” my-map (read: creates a new revision of it) so that it also contains the key “my-key” with the value 42, the code looks like this:

 (swap! my-map assoc "my-key" 42) 

This statement changes my-map so that it now holds a new version of itself.

So far, we have a thread “updating” my-map. Reading a map in Clojure and continuing with the previous example, looks like this:

 (get some-map "k") 

The statement above should return the value “v.” When working with Clojure’s atom, the following code can be executed when a thread reads a value from my-map:

 (get @my-map "my-key") 

The only difference is that little “@” before my-map. What it says is something like, “Hey Clojure, give me the latest revision you have for my-map.” As stated above, the latest, most updated revision might not contain all the changes that have been made to our map so far, but the returned value is always safe in terms of data integrity (e.g. not corrupted).

Conclusion

Clojure has its own mindset – immutable objects, Lispy syntax, etc. The major advantage is in its approach to concurrency, focusing on an application’s logic and reducing the overhead of locking mechanisms. This post covers just a tiny bit of Clojure’s way of concurrency. We experienced a significant performance boost when we moved AppsFlyer to Clojure. In addition, using functional programming allows us to have a really small code base with only a few hundred lines of code for each service. The effects of working in Clojure dramatically speed up the development time and allow us to create a new service in days.