As with other Social Networks, Strava’s athletes interact with each other in a variety of ways. They can follow one another, view each other’s activities and recognize achievements with kudos. Perhaps the richest interaction is when athletes combine socialization with sport — by participating in real world activities together and briefly tracing out adjacent paths through space and time. Unlike a kudo, this interaction isn’t something we explicitly collect. We must infer these events by carefully processing raw spatiotemporal data and detecting exactly when and where any athletes in our global dataset have spent time together.

There are three major ways that we surface group activity data into our product:

- Grouped activities in a feed: In your Strava feed, activities of users that you follow are all shown together in a single set of activities if they were matched together in a group.
- Grouped activities with a single activity: For every activity, on its detail page, Strava will show a list of all activities that grouped with that activity. This feature only works for non-private activities, and can be opted out of on a per activity basis, or limited so that only your followers can see when you have matched with other activities.
- Flyby: For every activity, Strava will determine all other activities that crossed paths (flew by!) with you (in addition to fully matched activities). You can view a list of these matched activities, and play back your own activities along with others. This feature only works for non-private activities, and can be opted out of entirely via privacy settings entirely.

Our definition of a group activity comes from this social origin: For activities to be in a group, the two athletes must have been close together in distance for a significant portion of the elapsed time of the activities. We generally use a threshold of 30%. This balances a trade off between false positives (activities of others who by chance were near you for some of the time), and of false negatives (actual group activities that might not be matched because of a long journey to the activity meetup spot, or unintentional separation during the group portion).

To learn more, please see Strava’s support page on Group activities:

Working on this problem has been one of my favorite technical problems at Strava. During peak traffic, over 100 activities per second are uploaded to Strava. The average Strava activity crosses paths with over 40 other Strava activities. In peak events, such as Ride London or the NYC Marathon, we will see over 10,000 activities that each cross paths with a majority of all the other 10,000 activities.

This blog post is meant to give insights into the challenges of implementing these features, show how approaches have evolved as we have continued to scale, and explain how our latest system can continuously and effortlessly uncover this amazing graph of data.

In 2010, the v1 implementation of activity grouping was released. It determined similarity of activities entirely using segment match start/finish times. As a reminder, a segment is a section of road or trail that Strava athletes can compare times on. To be considered grouped together, a pair of activities must have completed the same segments during overlapping times. The similarity of two activities was based on what percentage of the segment matches occurred concurrently. An obvious limitation of this system was that activities needed to have matched segments for it to work.

In 2013, I introduced a v2 of activity grouping that no longer depended on segment matches. This algorithm was introduced shortly after we built a new spatiotemporal index of activity data that was originally meant to solve a different problem (determining which activities would match a newly created segment). For each activity, the v2 algorithm gathered a set of match candidates using a query on this index. Match candidates are any activities that cross paths at some point with the main activity (intersecting in any single spatiotemporal tile). These sets of match candidates were then persisted and exposed in the Strava Labs Flyby feature, which was also released at a similar time.

To compute the similarity of the query activity and a match candidate, we load the full raw (latitude, longitude, time) data of both activities (at Strava, we call this data *streams*). We then iterate over time in the main activity, measuring the distance from the main activity to the position of the candidate activity at the same time. The final similarity score is the percentage of time that the candidate activity was within a certain distance of the main activity. To be a group activity, the candidate must have been near your activity more than 30% of the time.

By summing over time, we could correctly represent that the social nature of activities often does not involve moving. For example, visiting a coffee shop or bakery might be the entire social part of a ride. If you were recording during a stop, the v2 grouping algorithm would notice, whereas v1 would only take into account efforts you made on segments. This version of grouping also allowed stationary activities and activities in an area without any segments to group correctly.

From 2016 onward the worst case for v2 grouping began to cause real problems: In very large group activities such as Ride London, we might see 10,000 activities all intersecting a single spatiotemporal tile, generating as many as 100,000,000 match candidate pairs to process. In these events we typically had to delay computation of group activities, because the latency of grouping would rise by more than 10x and would stress other internal services that grouping depended on, such as our stream storage service.

By late 2017, this worst case scenario became an everyday occurrence with the rise of the Zwift virtual cycling platform. Zwift users ride stationary bike trainers anywhere in the world, but generate synthetic GPS data in a single area that has been chosen by Zwift as a fake location of the virtual cycling course. Activities from Zwift are processed by Strava just like any other activity, so location based features will use this synthetic data. These virtual courses can have thousands of concurrent users and are always active, a feat which only the largest real events can match. In effect, for the infrastructure team, every day had become Ride London.

By January 2018, Zwift activities represented a small portion of all Strava activities but accounted for 90% of resources spent on group match processing. Soon after we were forced to disable activity grouping for virtual activities.

By August 2018, a completely new, v3 version of activity grouping was deployed. The new version ultimately lead to a 50x reduction in CPU resources required to perform activity grouping. With v3 activity grouping we can once again process all Zwift activities and will be able to scale grouping for the foreseeable future.

This latest version of activity grouping leverages *MinHashing*. Instead of calculating pairwise activity similarity exactly using the full spatiotemporal stream data, we compute a *MinHash signature* of every activity. These signatures roughly describe in a fixed amount of bytes the spatiotemporal character of an activity. Given the signatures of any two activities, we can estimate their similarity in very few CPU cycles. The rest of this blog post is a detailed explanation of the technique of MinHashing and an overview of how Strava is now leveraging it to solve the important problem of activity grouping.

MinHashing is an efficient technique to estimate the similarity of a pair of sets. A set is often a representation of some underlying object. In a commonly used example, objects are documents, and a set defining each document is the set of words in each document. The notion of similarity is Jaccard similarity — the number of elements the sets have in common divided by the number of total distinct elements: |A∩B| / |A∪B|.

The MinHash of a set X, given some hash function H is defined as the smallest value of H when applied over the elements of that set — min(H(X)). The purpose of this can be better understood if you think of H as representing a random permutation (or a strict ordering — we assume there are no collisions in H) of all possible elements that may appear in the sets. The MinHash is just selecting the leftmost item in X according to this ordering: x = argmin(H(X)) is the minimum member of X according to H, and H(x)=min(H(X)) is the “MinHash value” of X given H.

Let’s first consider the MinHash of A∪B, the union of A and B where A and B are any non-empty sets. Let z=argmin(H(A∪B)). We now consider whether or not z is also in A∩B. In both cases, we will use a general and obvious claim about sets to prove something about the MinHash of A and B: The argmin of a set is equal to the argmin of a subset of that set if and only if the subset contains the argmin of the original set. Applying this claim to the two cases:

- If z is also a member of A∩B: In this case z is in A and in B, so it must be the argmin for those sets also, thus z = argmin(H(A)) = argmin(H(B)).
- If z is not a member of A∩B: WLOG, suppose z is in A but not in B. Then z = argmin(H(A)), but z != argmin(H(B)), thus argmin(H(A)) != argmin(H(B)).

So we have used a hash function to select an arbitrary element of A∪B. We have shown that this element is also in A∩B if and only if the MinHash of A and the MinHash of B are equal. Thus the chance that the MinHash of A equals the MinHash of B is exactly equal to |A∩B| / |A∪B| — which is the definition of the Jaccard similarity of A and B.

In this way, we can compare the MinHash values of two sets to get a single bit of information about their similarity. This will prove useful because we can get information about our original arbitrarily large sets using only finite length hash values.

Instead of a single hash function, it is common to use M pairwise independent hash functions H_1, …, H_M. We will call the resultant M hash values the *MinHash signature* of a set — min(H_1(X)), …, min(H_M(X)). By checking the equality of each of the M components of MinHash signatures of two sets, we get M (independent) bits of information about the similarity of the two sets. We know that the expected value of a single bit is equal to the Jaccard similarity of the two sets, so by increasing M we can decrease the variance of our similarity measure as needed. If N of the M components have equal hash values, then N/M is our best estimate of the Jaccard similarity.

This similarity estimate can be computed in O(M) time and memory, where M could be much smaller than the original set sizes |A| or |B|. A direct algorithm for getting the Jaccard Similarity of two sets is at least O(|A|+|B|). Additionally, instead of a complicated similarity function likely involving building a hash table for an exact calculation of set similarity, with MinHash you merely have to check for pairwise equality of each element in two arrays of length M. This is an incredibly fast operation for computers — for 64 bit hash values and M=100 several million checks per second are possible per CPU core.

I was inspired to look at MinHash due to a recent paper/algorithm called SuperMinHash. SuperMinHash gives a clever and asymptotic speed improvement to the (now 20 years old!) MinHashing technique, while simultaneously increasing its accuracy. The full paper is a great read, but roughly the technique reduces the time complexity for computing a MinHash signature from O(NM) where N is the size of the input set to O(N + M*log(M)). This is a big deal, because values of M can typically be 100–10,000. It also gives a decrease in the expected variance of the similarity estimate compared to naive MinHash. But aside from the generation of signatures, SuperMinHash and MinHash can be used and discussed in exactly the same way.

In our application of MinHash, an object is a Strava activity — a single run or ride defined by an array of floating point latitude, longitude, and time measurements for the duration of an activity (streams). To invoke MinHashing for activities, we need to find a way to represent this data as a set of discrete elements. Moreover, to be suitable for activity grouping, this representation should also have the property that the Jaccard similarity of such sets correlate closely with the similarity our old v2 algorithm.

After many experiments we found a scheme that worked well. We convert the stream to a sequence of discrete cubic tiles in the 3 dimensional space of (latitude, longitude, time). The sequence of tiles is continuous — we use Bresenham’s algorithm to enforce this. Each tile is an element in the set used as input to MinHashing.

We then repeat this a few more times using slightly different tile sizes as well as a rotation in latitude/longitude subspace to avoid bad edge cases. For example, if a road running exactly north happened to be exactly on a tile boundary, a slight GPS difference between you and your running partner might lead to zero tile overlap. The multiple tiling schemes help to lower the impact of this kind of edge case. It also helps on the edge case of very small activities that might otherwise have a small number of tiles relative to the number of MinHash components M. The final set is simply the collection of distinct tiles from each of the different tiling coordinate schemes.

Our tiles are approximately 80 meters by 80 meters by 120 seconds. This roughly means that you need to be within 80 meters + 120 seconds of someone you are running or riding with or you run the risk of your activities not matching.

The signature generation is fairly cheap due to SuperMinHash’s speedup and only needs to be done once per activity. We are able to use a signature of length M=100 and a 64 bit hash for our applications. With these parameters, the full MinHash signature of an activity is 800 bytes. Our testing showed that compared to v2 similarity, our approach to v3 had less than a 10% false positive rate and less than a 10% false negative rate, but we chose stricter parameters to reduce false positives at the expense of false negative.

Another application of MinHash signatures allows for efficient retrieval of all similar objects to a query object, assuming we have already computed and indexed the signatures for all objects. For two MinHash signatures to have a non zero similarity, they must have the same value in at least one component of their MinHash signatures. We leverage this fact in the following table structure:

*Lookup Table*: a map from activityId to MinHash Signature of that activity, where each hash function value is an Array[Bit]:

getSig(activityId: Long): Array[Array[Bit]]

*Index Table*: a map from each MinHash signature component value to the set of activity ids that have signatures with the same value for that one component:

getIds(hashValue: Array[Bit], componentIndex: Int): Set[Long]

Each activity appears once in the lookup table and once for each hash value in its signature in the index table (M times). We can query these tables to return the set of all activities that have a non-zero similarity with a target activity:

def candidateQuery(activityId: Long): Set[Long] =

getSig(activityId)

.zipWithIndex

.flatMap{ case(hashValue, componentIndex) =>

getIds(hashValue, componentIndex)

}

The query gets the signature for the input activity, then looks up candidate activities in the index for each component of that signature, and finally collects their union.

This full index table is expensive to query and store. Also, the candidate query pattern looks like it could be expensive — especially for activities in popular areas such that the set size in the index might be large. There are however several optimizations which can make this lookup incredibly fast.

If we are ultimately only interested in activities that are very similar to each other, and do not care about missing a small percentage of similar activities, we do not need to store the entire index. Suppose we only care about pairs of activities with over 50% similarity. With a probability of at least 1/2, a pair of such activities will have a collision in any single signature component. The probability of a single pair not having a single collision in the first K components is roughly (1/2)^K. For example for K=1 and a similarity threshold of 50% we would have a false negative rate of ½. For K=20 we would have a false negative rate of roughly 10^-6.

We calculate and store MinHash signatures of length M in the lookup table, but only store K elements of the signatures in the index table. We choose M according to the desired variance in pairwise similarity values, and then we choose an K<=M according to a desired similarity threshold and false negative rate for finding candidate activities. If M=100, then K=20 gives an index 1/5th the size at the cost of a 0.0001% false negative error.

To compute similarity of two activities, we might naively fetch the full signatures (800 bytes) from the lookup table, then compute their similarity on the client machine. With the right database, we can instead compute the similarity on the database storage machine, and only return the similarity estimate to the client (a single double value), thus greatly reducing network traffic between database and client. This would reduce the network IO needed for the result from 2*800 bytes to 8 bytes.

If the database is distributed, this trick still works as long as our query is for the similarity of one main activity with a large set of candidate activities. In this case, we look up the signature of the main activity and broadcast it to all database nodes where the lookup table has records for candidate activities. This is a savings of network transfer roughly as long as the number of database nodes is smaller than the number of candidate activities.

In production we are using Cassandra and achieving this with Cassandra UDFs. This might look like the following, where the function *similarityUDF* takes a signature and a set of *activityIds*, and returns the similarity of the signature with the signature of each *activityId*, calculating it remotely.

def candidateSimilarities(activityId: Long): Map[Long, Double] = {

signature: Array[Array[Bit]] = getSig(activityId)

candidateIds: Set[Long] = candidateQuery(activityId)

similarityUDF(signature, candidateIds)

}

For index table keys, we want our hash values to be globally robust against hash collisions (false positives). With a billion activities, there are at most 100 billion (2³⁶) distinct keys in the table. We desire that the expected number of false positives for a lookup in a single activity (*candidateQuery*) is not very large. To achieve this, we just need the number of hash function values to be large compared to the number of items. The 2³⁶ values of a 64 bit hash function is more than enough. Note that this is a lower standard than preventing any single collision in the database, which would require a hash function with twice as many bits.

When using MinHash signatures only to determine the similarity of a single pair of activities, we have a much lower requirement for false positives. We only need to prevent giving too high of a similarity estimate for our chosen pair. If our M MinHash values were each truncated to just B bits, we would expect to see M/(2^B) components give false positives due to this truncation. Using just a single byte (B=8) only adds an error in the expected value of the similarity of half a percent.

With these optimizations, we arrive at the following scheme: For optimization #1, in the index table, store only the first K components as full 8 byte hash values as keys. In the main lookup table, also store full hash values for the first K components. To implement optimization #3, in a different (reduced) lookup table, store M components reduced to a single byte each. Our implementation of *similarityUDF* can use the signatures in this reduced lookup table.

Another application for MinHash activity signatures is serving activity groups in athlete feeds. A feed is a collection of recent activities belonging to athletes that the feed owner is following. An activity group in a feed is a subset of activities in the feed that also all grouped together.

In the prior system, for each activity in a feed we selected all activities it matched with. This query needed to look at the global set of Strava activities, even considering matches with activities that the feed owner does not follow. We then filtered those results to just other activities that are in the feed. Finally, we constructed clusters of grouped activities from this set of matches.

If we have F activities in the feed and each activity is matched with a set of M other activities, this approach requires O(FM) records to be loaded. It is not possible to have an index that reduces the complexity of this query, at least not without denormalizing the activity matches into every feed. Suppose the feed owner follows 200 athletes who participated in a very large social activity, such as Ride London. Ride London generates over 10,000 uploads to Strava, each of which might typically be grouped with 1,000 other activities. For this example, we must process 200 * 1,000 database rows to find the correct group activity set in the feed. This is clearly not a scalable implementation, and has actually been responsible for prior site outages. As a safety valve against this problem, we limited the number of activity match rows we would return, leading to incorrect groupings in the feed.

With MinHash signatures we can improve this process: Specifically, we fetch the MinHash signature for each activity in a feed. We then (naively) evaluate the similarity of all pairs of activities, and from that construct connected sets of activities. This would take O(F²) comparisons, but a feed never has more than 200 activities, and 200² comparisons can be done in less than 10ms. A slightly more sophisticated approach needs to check asymptotically fewer pairs than this. The advantage of this technique is that all lookups are local to the set of activities in the feed, unlike the prior technique which needed to first lookup the global set of matches for each activity.

One shortcoming of Jaccard similarity is that it is symmetric. Ideally, activity similarity would be asymmetric: we want to be able to represent that A rode with B but that B did not ride with A. In terms of set cardinality, the natural asymmetric similarity measure we would like to have of A with respect to B is |A∩B| / |A|. Given the symmetric Jaccard similarity J = |A∩B| / |A∪B| and the original set cardinalities |A| and |B| we can derive: |A∩B| / |A| = J * (|A| + |B|) / (|A| * (J + 1)). This expression is useful as long as |A|/|B| is not very small, as it gets smaller the variance of our estimate blows up and we should just not consider it a match. This improvement is modest, but is cheaply implemented — we only have to store the cardinality of each MinHash set in the lookup table (along with the Signature).

Our implementation allows for easily storing multiple hashing schemes with different hash function parameters. For example, we can create a tiling scheme with the same spatial resolution, but with much less temporal resolution of the tiles — say to a single day. The notion of similarity in this scheme will be much more flexible with time but strict with space. It could allow for automatic detection of all races and mass events, where athletes ride the same course but at different times throughout a day.

One caveat in this scheme is that recurrence of the same tile coordinate is not appropriately weighted, since MinHash Similarity ultimately operates on sets and can’t represent an item multiple times. For example a criterium bicycle race running the same loop multiple times would only be counted once. One way to address this would be BagMinHash — another algorithm from the same author as SuperMinHash. A simpler way to address this is to just count each return visit to a given tile as a distinct tile.

Pushing this technique even further we can omit the time dimension entirely, purely considering spatial similarity of activities. We can even use this spatial similarity metric and a fast and distributed clustering algorithm to cluster global datasets of hundreds of millions of activities into common route clusters.

Activity Grouping is an immense technical challenge to implement at scale and during rapid growth. Successive innovations have improved the performance and quality of matching over the years. This engineering effort has produced an entirely novel component of the social graph of athletes that would otherwise be inaccessible. Finally, there is still much potential for future applications of these techniques towards new applications and discoveries.