Geospatial indexing on Hilbert curves

By Antoine Sinton

Zenly s2 covering

Zenly is an app that lets our users share their location in realtime with their best friends. Contextual and relevant information is also added on top by processing and analyzing the raw location data. This means we use geographical and location data everywhere.

Applications range from users’ location points to analytics over geographical defined regions as well as in-house reverse geocoding and machine learning powered features. In essence, we have a lot of location and geographical data and we need to be able to handle it in realtime and at scale with high throughput. More specifically, we need ways of indexing and querying these data efficiently.

As an example, a machine learning algorithm detects where users’ places are, that is their home and/or school or work place. One of the features of the model is quite obviously if the user spends a significant amount of time within the polygon of a school or university campus. Our solution has made us capable of querying hundreds of millions of points everyday within about a million school polygons efficiently for the past two years.

A user in a school

When it comes to indexing geographical data (or more generally n-dimensional data where n>1) there are two rather common strategies to choose from: r-tree type indices and quadtree (or n-squared-tree) type indices.

They both have their advantages and drawbacks, a lot of ink has been spilled on the subject already and it is not within the scope of this post. We use both kinds of indices for different applications.

The scope of this post is to describe one of the techniques we use to index polygon data within a quadtree structure and efficiently query these points in approximated polygons for extremely fast lookup.

We’re going to describe how we can leverage google’s s2 library (which we use in lots of applications) and especially how we exploit its usage of the Hilbert space filling curve to index any ensemble of s2 coverings in an efficient tree structure.

example of a quadtree from https://en.wikipedia.org/wiki/Quadtree

For those that are not familiar with using quadtree indices to represent geographical data, the idea is rather straightforward. Take a square, divide it into four equally-sized squares that partition the initial square (called cells). Subdivide each child square into four more children cells and so on. If you imagine the original square is a representation of the entire Earth, after only 30 subdivisions you will be down to the centimeter level. In a typical representation, a single cell will be known by three coordinates (z,i,j) describing the depth and the position of the cell in the tree. Some more on these here https://en.wikipedia.org/wiki/Quadtree.

Common variations include Geohash (see https://en.wikipedia.org/wiki/Geohash), Geohex (it’s not a quadtree, but uses a similar schema, see http://www.geohex.org/), Google s2 (see https://s2geometry.io/).

For a number of reasons (speed, completeness, implementation quality etc) we use s2 extensively.

One important feature of quadtree implementations is the possibility to describe them using certain space filling curves. More on this later.

s2 covering of the USA

The s2 geographical library provided by Google is a rather well known alternative version of a quadtree backed representation that has many built-in features. There is a dedicated site that goes into a number of details https://s2geometry.io/.

One of the most interesting features for us is the ability given by the library to provide a “covering” for any polygon. A s2 covering for a polygon is a list of cells that completely cover this polygon. As2 cell having a similar definition as to that of a regular quadtree cell. That is, we can approximate the region covered by any polygon using a “covering” (i.e. a list) of cells. The approximation can be made as precise as one desires. Examples can be seen here above in this post.

Furthermore, the s2 implementation of the quadtree approach is rather interesting in that it tries to alleviate, or at least lessen, one of the annoying artifacts of regular quadtrees: the physical dimensions (i.e. width and height) of a cell depend on its longitude.

Why is this? Remember that we start from a square representation of the Earth. To do this we need to project the sphere onto a square. A typical way of doing this is using the Mercator projection which tends to stretch things out the closer you get to the poles. See https://en.wikipedia.org/wiki/Mercator_projection for more details.

How does s2 deal with this? Instead of projecting the Earth directly onto a single square, it projects the Earth onto the 6 faces of a cube enclosing the Earth and then applies an extra non-linear transformations to reduce even more the deformations. Each cell in s2 is in fact part of one of six quadtrees that describe the whole planet.

Finally, a linearisation using the space filling Hilbert curve is applied to transform the coordinates (f,z,i,j) into a single number called the CellId. All this is really well described on this page https://s2geometry.io/devguide/s2cell_hierarchy and here https://docs.google.com/presentation/d/1Hl4KapfAENAOf4gv-pSngKwvS_jwNVHRPZTTDzXXn6Q/view.

This is where things start getting interesting for us. Let’s look at the Hilbert curve mapping in a little more detail.

A space filling curve “is a curve whose range contains the entire 2-dimensional unit square” (according to wikipedia https://en.wikipedia.org/wiki/Space-filling_curve). That means, for every point within a square (2 dimensional) we can map it onto a point on the curve (1 dimensional) and back again. To do this, the line needs to be of infinite length. Although this is quite fun mathematically, it doesn’t really help us in understanding our problem.

A lot on Hilbert curves (and space filling curves in general) can be found in this great post: http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves. Space filling curves, in this context, are mostly used as a way of representing quadtree structures using a single number scheme.

In practicality, a space filling curve is built recursively by way of successive approximations. Each approximation is commonly called an order for that curve. For example, here are orders one through six of the Hilbert curve:

orders one through six of the Hilbert curve (source: https://fr.wikipedia.org/wiki/Courbe_de_Hilbert)

How does this help us?

Let’s take a closer look at the first two orders of the Hilbert curve and overlay the grid of the corresponding quadtree.

Add value along the Hilbert curve finally gives us what we are looking for:

That is, first order Hilbert curve (i.e. level 1 in quadtree) allows for 4 values from 0 to 3. Second order Hilbert curve allows for 16 values from 0 to 15. We have a one-to-one correspondence between the value along the Hilbert curve of order o and the quadtree cell at level o.

If you look more closely at the binary representation of the value of the index along the Hilbert curve associated with each cell of the quadtree, we have, for the first order: 00, 01, 10 and 11. For the second order, the first four cells from 0 to 3 can be represented as 0000, 0001, 0010, 0011. The next four cells from 4 to 7 have binary representation 0100, 0101, 0110, 0111 and so on.

We can see that the binary representation of the value of the position along the second order Hilbert curve associated to the child cell of the quadtree at level 2 has the same 2 leading bits as the value of the position along the first order Hilbert curve of its parent cell at level 1 in the quadtree.

This holds at all levels. If you take the binary representation of a cell N on the Hilbert curve of order i, the 4 children of N on the Hilbert curve of order o+1 will have the same leading bits, which are equal to N, up the last two which will define the cell uniquely.

For example, let’s take the binary representation for a random cell N on the curve of order o: Hₒ(N)=1010110011011. The four children n₀, n₁, n₂, n₃ of Hₒ(N) of order o+1 will have the representations:

Hₒ₊₁(n₀) = Hₒ(N) · 00,

Hₒ₊₁(n₁) = Hₒ(N) · 01,

Hₒ₊₁(n₂) = Hₒ(N) · 11,

Hₒ₊₁(n₃) = Hₒ(N) · 11

where · represents concatenation.

This is the representation that gives us what we are looking for, that is a programmable representation that can be used to index and query efficiently elements of the quadtree. Indeed, for every s2 cell we can build the branch of a tree starting at a root node that follows the binary representation of the cell along the Hilbert curve. For a list of cells, we can merge these representations into a single tree structure that is queryable using only binary operations.

Compare what we just said to the structure of the s2 CellID:

We can immediately see that the binary representation of the CellID is split into 3 parts:

  1. 3 bits for face
  2. 60 bits reserved for value of the position along the Hilbert curve representing the quadtree cell
  3. 1 termination bit

Based on what we saw concerning the s2 cell definition and the Hilbert curve, s2 cells can be naturally formalized on a tree structure, as follows:

  • 6 root nodes (one per face);
  • every node has exactly four children;
  • each sub-tree has a maximum depth of 30 levels.

So it’s actually a set of 6 identical quadtree structures merged into a single representation.

As said earlier, we are mostly interested in indexing a covering of cells, the idea is thus to index these coverings and have a mechanism to query and retrieve relevant information. To that end, here is the description of the algorithm we propose:

1. For each cell to index:

  • read first 3 bits to determine which sub-tree we are looking iterating on
  • read position of termination bit thus level of cell to index
  • loop over every couple of bits to read the corresponding child
  • if child node doesn’t already exist, create it and insert reference into current node
  • set current node as previous child and loop until termination
  • at termination, attach value to leaf node value array

2. Query:

  • read first 3 bits to determine which sub-tree we are looking iterating on
  • read position of termination bit thus level of cell to index
  • loop over every couple of bits to read the corresponding child
  • along the way down to termination accumulate values in Node value arrays
  • stop when you have reached end of branch in tree ot termination of query cell

In a simple Python implementation, it can look like this:

As an example test, s2 coverings for all 50 US states have been generated.

Loading the tree into the above implementation and querying 1M random points worldwide to see what country they are in (if any), takes 4.5 seconds on a MacPro 13’’, 3.5GHz, 2017 Model, 16GB RAM (once everything is loaded, that is not counting the time it takes to convert the points from lat/lon to s2, which is slow in s2sphere).

At Zenly, we use a Scala implementation to query points in various polygons within Spark. We build s2 coverings for various categories of polygons like timezones, countries, etc. A bench testing 14M points on a set of 50k polygons shows that our tree is almost two times faster than geospark.

One of the main caveats at the moment with our implementation at Zenly is that this is 100% memory bound, but it should be feasible to use common quadtree storage structures on top of this to build running services.

Also, like all index structures based on quadtrees, there is an imprecision at the border due to the way the covering works. You’ll need to handle the extra step to validate your point in polygon (PIP) precisely if you need that level of detail. This can be handled by any number of PIP algorithms that exist in the wild.

In this article, we describe using the Hilbert curve (as we use this on top of s2) but it can be generalized to all space filling curves that can describe a quadtree (like the Z curve for example).

This is only one example of the many ways we work with location data. There are many different aspects as to what we do and so there are many different ways of handling these data depending on the usage. Location data can also be considered in spatial or spatio-temporal clusters, considered as sequential data with seasonality, mixed with other sources of information to feed more complete models. These feed applications like position prediction and correction, timeline of visited places, attendance in points of interest…

Coupled to a social graph and other data sources, location data can provide interesting and fun insights to the user and his/her friends. They power many current Zenly features, and will continue driving more and more of them in future.

To try Zenly out → https://zen.ly/app.