**Suppose** that you're a doctor who studies human sexual behavior. You want to
run a study with all the patients that you can find, but you don't find a lot of
volunteers. You only end up with about 40 subjects.

After you've ran your study and collected data, you want to share this data with other researchers. You look at the attributes, and deduce that ZIP code and age are likely to be used in reidentification attacks. To share it in a safe way, you're thinking of \(k\)-anonymity.

When trying to find a strategy to obtain \(k\)-anonymity, you find out that you would have to lose a lot of information. For \(k=10\), a rather small value, you end up with buckets like \(20\le age\lt 50\). That makes sense: you have only few people in your database, so you have to bundle together very different age values.

But when you think about it, you start questioning whether you really need
\(k\)-anonymity. Who are the attackers, in your scenario? The researchers with
whom you share the data, and possibly unknown parties if the data ever leaks.
None of these people have background information about who is in the dataset.
Thus, the attacker doesn't just have to distinguish between different records,
but to actually find the *real identity* of a record based on its information.
This attacker has significantly weaker capabilities than for \(k\)-anonymity!

Let's look at two different rows in this database.

ZIP code | age |
---|---|

85535 | 79 |

60629 | 42 |

At first glance, the amount of information for this two individuals seems to be the same. But let's take a look at the values…

- 85535 corresponds to a place in Arizona named Eden. Approximately 20 people live in this ZIP code. How many people do you think are exactly 79 years old in this particular ZIP code? Probably only one.
- 60629 corresponds to a part of the Chicago metropolitan area. More than 100,000 people live there. How many of them are 42 years old? A thousand, at least, and probably more!

It seems that it would be very easy to reidentify the first row, but that we don't have enough information to reidentify the second row. But according to \(k\)-anonymity, both rows might be completely unique in the dataset.

Obviously, \(k\)-anonymity doesn't fit this use case. We need a different definition: that's where \(k\)-map comes in.

## Definition

Just like \(k\)-anonymity, \(k\)-map requires you to determine
which columns of your database are *quasi-identifiers*. This answers the
question: what can your attacker use to reidentify their target?

But this information alone is not enough to compute \(k\)-map. In the example
above, we assumed that the attacker doesn't know whether their target is in the
dataset. So what are they comparing a given row with? With all other individuals
sharing the same values in a larger, sometimes implicit, dataset. For the
previous example, this could be "everybody living in the US", if you assume the
attacker has no idea who could have this genetic disease. Let's call this larger
table the *reidentification dataset*.

Once you picked the quasi-identifiers and the reidentification dataset, the
definition is straightforward. Your data satisfies \(k\)-map if every combination
of values for the quasi-identifiers appears at least \(k\) times *in the
reidentification dataset*.

In our example, this corresponds to counting the number of people in the US who share the quasi-identifier values of each row in your dataset. Consider our tiny dataset above:

ZIP code | age |
---|---|

85535 | 79 |

60629 | 42 |

We said earlier than the values of the first row matched only one person in the US. Thus, this dataset does not satisfy \(k\)-map for any value of \(k\ge 2\).

How do we get a larger \(k\)? We could generalize the first value like this:

ZIP code | age |
---|---|

85*** | 79 |

60629 | 42 |

ZIP codes between 85000 and 85999 include the entire city of Phoenix. There are 36,000+ people between 75 and 84 years old in Phoenix, according to some old stats. It's probably safe to assume that there are more than 1,000 people who match the quasi-identifiers values of the first row. We saw earlier that the second row also matched 1,000+ people. So this generalized dataset satisfies 1000-map.

Wait a second, why does this feel like cheating? What happened there, to give us
such a generous number so easily? This comes from the generous assumptions we
made in our attack model. We assumed that the attacker had *zero* information on
their target, except that they live in the US (which is implied by the presence
of ZIP codes). And with only the information (ZIP code, age), you don't need a
lot of generalization to make each row of your dataset blend in a large crowd.

To make this attack model stronger, you could assume that the attacker will use
a *smaller* reidentification database. For example, suppose that your genetic
disease you're studying requires regular hospital check-ups. The attacker could
restrict their search only to people who have visited a hospital in the last
year. The number of possible "suspects" for each value tuple gets smaller, so
the \(k\) of \(k\)-map decreases too^{1}.

\(k\)-map is inherently a *weak* model. So when choosing the quasi-identifiers and
reidentification dataset, you have to think hard at what an attacker could do.
If your attacker doesn't have lots of resources, it can be reasonable to assume
that they won't get more data than, say, the voter files from your state. But if
they can figure out more about your users, and you don't really know which
reidentification dataset they could use, maybe \(k\)-anonymity is a safer
bet^{2}.

## And now, some practice

OK, enough theory. Let's learn how to compute \(k\)-map in practice, and anonymize your datasets to make them verify the definition!

… There's one slight problem, though.

It's usually impossible.

Choosing the reidentification dataset is already a difficult exercise. Maybe you can afford to make generous assumptions, and assume the attacker doesn't know much. At best, you think, they'll buy voter files, or a commercial database, which contains everyone in your state, or in the US. But… then what?

To compute the maximum \(k\) such as your dataset verifies \(k\)-map, you would first need to get the reidentification dataset yourself. But commercial databases are expensive. Voter files might not be legal for you to obtain (even though an evil attacker could break the law to get them).

So, most of the time, you can't actually check whether your data satisfies \(k\)-map. If it's impossible to check, it's also impossible to know exactly which strategy to adopt to make your dataset verify the definition.

#### Exception 1: secret sample

Suppose you're not releasing all your data, but only a *subset* (or *sample*) of
a bigger dataset that you own. Then, you can compute the \(k\)-map value of the
sample with regard to the original, bigger dataset. In this case, choosing
\(k\)-map over \(k\)-anonymity is relatively safe.

Indeed, your original dataset is certainly *smaller* than the reidentification
dataset used by the attacker. Using the same argument as above, this means that
you will obtain a *lower bound* on the value of \(k\). Essentially, you're being
pessimistic, which means that you're on the safe side.

Even if the attacker has access to the original dataset, they won't know which records are in the sample. So if the original dataset is secret, or if you've chosen the sample in a secret way, \(k\)-map is a reasonable definition to use, and you can compute a pessimistic approximation of it.

#### Exception 2: representative distribution

This case is slightly different. Suppose that you can make the assumption that
your data is a *representative* (or *unbiaised*) sample of a larger
dataset. This might be a good approximation if you selected people (uniformly)
at random to build your dataset, or if it was gathered by a polling
organization.

In this case, you can compute an estimate of the \(k\)-map value for your data, even without the reidentification dataset. The statistical properties which enable this, and the methods you can use, are pretty complicated: I won't explain them in detail here. They are mentioned and compared in this paper, which has references to the original versions of each of them.

#### Exception 3: using humans

For the case of our doctor earlier, if the dataset is small enough, a motivated data owner could actually do the job of an attacker "by hand". Go through each record, and try to map it to a real person, or estimate the chances of it being possible. We pretty much did that in this article!

This is very approximative, and obviously not scalable. But for our imaginary doctor, it might be a reasonable solution!

#### Implementations

ARX implements the methods from exceptions 1 and 2. Documentation for the
first one can be found here. Instructions to estimate the number of
*unique* values assuming uniformity can be found here. Originally,
μ-ARGUS was the first software with this feature, but I couldn't run
it on my machine, so I can't say much about it.

## Conclusion

You might wonder why I wrote an entire article on a definition that is hardly used because of how impractical it is. In addition to the unique problems that we talked about in this article, the limitations of \(k\)-anonymity also apply. It's difficult to choose \(k\), non-trivial to pick the quasi-identifiers, and even trickier to model the reidentification database.

The definition also didn't get a lot of attention from academics. Historically,
\(k\)-anonymity came first^{4}. Then, people showed that \(k\)-anonymity was
sometimes not sufficient to protect sensitive data, and tried to find *stronger*
definitions to fix it. Weaker definitions were, of course, less interesting.

Nonetheless, I find that it's an interesting relaxation of \(k\)-anonymity. It shows one of its implicit assumptions: the attacker knows that their target belongs to the dataset. This assumption is sometimes too pessimistic: it might be worth considering alternate definitions.

Choosing a privacy model is all about modeling the attacker correctly. Learning to question implicit assumptions can only help!