In-process caching in Go: scaling lakeFS to 100k requests/second

By Barak Amar

This is a first in a series of posts describing our journey of scaling lakeFS. In this post we describe how adding an in-process cache to our Go server sped up our authorization flow.

Background

lakeFS is an open-source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, atomic, and versioned data lake operations – from complex ETL jobs to data science and analytics.

The Pain

The main flow of requests in lakeFS consists of 3 layers:

  1. Authentication and authorization
  2. Business logic
  3. Optionally, interacting with an underlying object store such as S3 or Google Cloud Storage.

Our requirement was to make sure we can scale our service to handle 100,000 requests per second. This means all 3 layers need to be able to support that. We chose this number based on input from our largest design partners.

Initially we could only get a few thousand requests per second on a decently sized installation. To see where our bottleneck was, we isolated and tested each layer independently.

We started by creating an empty http.Handler, consisting of exactly 0 layers.

Running the service locally, we naturally got throughput way above our requirements.

As we enabled the first layer – authentication, and authorization – we found our first bottleneck: on every request we load the requesting user’s policies and verify the operation is allowed. This incurs a roundtrip to a PostgreSQL database.

The query itself was simple, and rather optimal – a single index lookup. Connection pooling was enabled and we saw no contention on the pool itself, meaning that no requests were waiting in the queue to get a free connection. It seems like we are saturating the throughput given by a single PostgreSQL server.

We mapped the following options:

  1. Scale PostgreSQL vertically – choose a bigger instance size
  2. Add a DB replica to increase available read throughput
  3. Choose a different DB that can scale reads horizontally
  4. Add a caching layer, external to the database

Choosing a solution

  1. Vertical scaling – could easily get expensive, as this is the most expensive part of the lakeFS stack. This is also realistically capped by the maximum available throughput given by managed services like RDS, which many of our users use. 
  2. DB Replication – allowing the database to scale horizontally as load increases would be beneficial, but the operational cost will go up as well. Managing multiple instances of a stateless service is easier (and again, cheaper) than doing the same to a stateful one.
  3. Different DB – same as replication, the operational cost is higher since we now need to maintain another DB. This also increases the complexity of both the code base and the runtime environment.
  4. Cache – a middle ground between complexity and performance, as it adds something to both. If we can cache in-process (as opposed to an external caching server) the complexity overhead happens during development and maintenance of the codebase – not during runtime by the operator, which we prefer (being an open source project, lakeFS is installed and maintained by the user).

Evaluating a cache solution

Listing the pros and cons we decided on using an in-process cache.

Pros

  1. Performance – access to the main memory vs network is orders of magnitude faster, depending on your network.
  2. Access pattern – authorization data tends to be relatively small with a high hit rate. User permissions are read on every request for authorization but almost never change. This makes it a great candidate for caching.
  3. Maintainability – as an open-source project, ease of use for ops is key. With in-process caching we do not require additional service(s) that need to be run or maintained. Basic configuration will probably be good for 90% of the use-cases.

Cons

  1. Code complexity – extra code layer used as part of the authorization check.
  2. Cache invalidation – can be seen as the minimum time to apply changes. This makes our authentication system eventually consistent (which we are fine with). Using a centralized cache service can help alleviate this to a degree, but again – at a high operational cost.
  3. Cache efficiency effort – caching requires thought and tuning: choosing the right eviction algorithm, size and implementation details. These are necessary steps if we want to hit our throughput goals.

Caching in lakeFS

Thread safety and LRU eviction are two basic things we wanted from a cache system. As we require an open-source solution – golang-lru  looked like a good fit. It’s an open source cache library which implements a fixed-size thread-safe LRU cache with an expiry feature, based on HashiCorp’s golang-lru.

We mapped two concerns we wanted to address, building on top of golang-lru:

1. Concurrent misses (i.e. “Thundering Herd”)

If we miss the cache we must hit the database

Since we know many concurrent requests will likely be made by the same user, this means a cache miss will trigger many concurrent DB round-trips, adversely affecting our overall tail latency.

We wanted to reduce the load on the database in this case, so we implemented a locking mechanism where for a given key, only one Goroutine will acquire the data from the database while the rest will simply block and wait for that request to return and populate the cache.

Example:

var ( lruCache lru.Cache acquireLocker ChanLocker ErrCacheItemNotFound = errors.New("cache item not found")
) func GetUser(id string) (interface{}, error) { if v, ok := lruCache.Get(id); ok { return v, nil } acquired := acquireLocker.Lock(id, func() { v, err := expensiveDBLookup(id) if err != nil { return } lruCache.Add(id, v) }) if acquired { return v, err } if v, ok := lruCache.Get(id); ok { return v, nil } return nil, ErrCacheItemNotFound
}

The ChanLocker is the interesting part. Using a sync.Map, we make sure that the first goroutine to ask for a key will add a channel that all the other goroutines will wait on:

type ChanLocker struct { sync.Map
} func(w *ChanLocker) Lock(key interface{}, acquireFn func()) bool { waitCh := make(chan struct{}) actual, locked := w.LoadOrStore(key, waitCh) if !locked { acquireFn() w.Delete(key) close(waitCh) return true } <-actual.(chan struct{}) return false
}

Risks we mapped and considered as a non-issue for our usage:

  1. When DB lookup fails or returns a Not Found Error, we do not set any state into the cache, which will make the rest of the routines report that the item was not found.
  2. Cache thrashing: If the LRU cache is too small, there’s a race condition here: it is possible to successfully add a key to the cache that will be immediately evicted, causing the subsequent lookup to fail.

2. Eviction time

Running in a shared-nothing architecture, lakeFS instances cache independently without communicating. If a user suddenly generates high load on the system, we don’t want a fixed expiry duration since all servers are likely to evict that user at the same time, which will in turn cause a spike in DB reads (see “thundering herd” above).
The lru fork we use adds expiry support while adding new items to the cache using a configured eviction time. We added a jitter that adds some randomness to space timeouts out, reducing the “thundering herd”.

Example:

const ( expiryDuration = 20 * time.Second jitterDuration = 3 * time.Second
) expire := expireDuration + time.Duration(rand.Intn(int(jitterDuration)))
lru.AddEx(key, v, expire)

Results

Writing a basic benchmark test for our Authorization service

func BenchmarkDBAuthService_ListEffectivePolicies(b *testing.B) { serviceWithoutCache := setupNewDBAuthService(authparams.ServiceCache{ Enabled: false, }) serviceWithCache := setupNewDBAuthService(authparams.ServiceCache{ Enabled: true, Size: 1024, TTL: 20 * time.Second, EvictionJitter: 3 * time.Second, }) userName := setupUserWithPolicies(b, serviceWithoutCache, userPoliciesForTesting) b.Run("with_cache", func(b *testing.B) { benchmarkListEffectivePolicies(b, serviceWithCache, userName) }) b.Run("without_cache", func(b *testing.B) { benchmarkListEffectivePolicies(b, serviceWithoutCache, userName) })
} func benchmarkListEffectivePolicies(b *testing.B, s *auth.DBAuthService, userName string) { b.ResetTimer() for n := 0; n < b.N; n++ { _, _, _ = s.ListEffectivePolicies(userName, &model.PaginationParams{Amount: -1}) }
}
Running the benchmarks
% go test -bench 'BenchmarkDBAuthService_ListEffectivePolicies' ./auth
goos: darwin
goarch: amd64
pkg: github.com/treeverse/lakefs/auth
BenchmarkDBAuthService_ListEffectivePolicies/with_cache-16 3814321 311 ns/op
BenchmarkDBAuthService_ListEffectivePolicies/without_cache-16 1168 1068078 ns/op
PASS
ok github.com/treeverse/lakefs/auth 8.177s

The output shows that the cache version ran 3814321 times at a speed of 17.8 ns per loop vs 1168 times at 1068078 ns.

Of course, this is a synthetic test, which proves that a memory lookup is much faster than a network + IO operation with a database server. However similar results also appeared in our stress testing environment with actual servers connected to a real database.

You can check out the full implementation, or read more about lakeFS here.

In a future post, we’ll go deeper into our database layer and explain how we managed to get tens of thousands of requests per second from a moderately sized PostgreSQL instance.

Check out the project’s GitHub repository to learn more.
Photo by Ahmed Zayan on Unsplash