Suppose that you're a doctor who studies human sexual behavior. You want to run a study with all the patients that you can find, but you don't find a lot of volunteers. You only end up with about 40 subjects.
After you've ran your study and collected data, you want to share this data with other researchers. You look at the attributes, and deduce that ZIP code and age are likely to be used in reidentification attacks. To share it in a safe way, you're thinking of \(k\)-anonymity.
When trying to find a strategy to obtain \(k\)-anonymity, you find out that you would have to lose a lot of information. For \(k=10\), a rather small value, you end up with buckets like \(20\le age\lt 50\). That makes sense: you have only few people in your database, so you have to bundle together very different age values.
But when you think about it, you start questioning whether you really need \(k\)-anonymity. Who are the attackers, in your scenario? The researchers with whom you share the data, and possibly unknown parties if the data ever leaks. None of these people have background information about who is in the dataset. Thus, the attacker doesn't just have to distinguish between different records, but to actually find the real identity of a record based on its information. This attacker has significantly weaker capabilities than for \(k\)-anonymity!
Let's look at two different rows in this database.
At first glance, the amount of information for this two individuals seems to be the same. But let's take a look at the values…
- 85535 corresponds to a place in Arizona named Eden. Approximately 20 people live in this ZIP code. How many people do you think are exactly 79 years old in this particular ZIP code? Probably only one.
- 60629 corresponds to a part of the Chicago metropolitan area. More than 100,000 people live there. How many of them are 42 years old? A thousand, at least, and probably more!
It seems that it would be very easy to reidentify the first row, but that we don't have enough information to reidentify the second row. But according to \(k\)-anonymity, both rows might be completely unique in the dataset.
Obviously, \(k\)-anonymity doesn't fit this use case. We need a different definition: that's where \(k\)-map comes in.
Just like \(k\)-anonymity, \(k\)-map requires you to determine which columns of your database are quasi-identifiers. This answers the question: what can your attacker use to reidentify their target?
But this information alone is not enough to compute \(k\)-map. In the example above, we assumed that the attacker doesn't know whether their target is in the dataset. So what are they comparing a given row with? With all other individuals sharing the same values in a larger, sometimes implicit, dataset. For the previous example, this could be "everybody living in the US", if you assume the attacker has no idea who could have this genetic disease. Let's call this larger table the reidentification dataset.
Once you picked the quasi-identifiers and the reidentification dataset, the definition is straightforward. Your data satisfies \(k\)-map if every combination of values for the quasi-identifiers appears at least \(k\) times in the reidentification dataset.
In our example, this corresponds to counting the number of people in the US who share the quasi-identifier values of each row in your dataset. Consider our tiny dataset above:
We said earlier than the values of the first row matched only one person in the US. Thus, this dataset does not satisfy \(k\)-map for any value of \(k\ge 2\).
How do we get a larger \(k\)? We could generalize the first value like this:
ZIP codes between 85000 and 85999 include the entire city of Phoenix. There are 36,000+ people between 75 and 84 years old in Phoenix, according to some old stats. It's probably safe to assume that there are more than 1,000 people who match the quasi-identifiers values of the first row. We saw earlier that the second row also matched 1,000+ people. So this generalized dataset satisfies 1000-map.
Wait a second, why does this feel like cheating? What happened there, to give us such a generous number so easily? This comes from the generous assumptions we made in our attack model. We assumed that the attacker had zero information on their target, except that they live in the US (which is implied by the presence of ZIP codes). And with only the information (ZIP code, age), you don't need a lot of generalization to make each row of your dataset blend in a large crowd.
To make this attack model stronger, you could assume that the attacker will use a smaller reidentification database. For example, suppose that your genetic disease you're studying requires regular hospital check-ups. The attacker could restrict their search only to people who have visited a hospital in the last year. The number of possible "suspects" for each value tuple gets smaller, so the \(k\) of \(k\)-map decreases too1.
\(k\)-map is inherently a weak model. So when choosing the quasi-identifiers and reidentification dataset, you have to think hard at what an attacker could do. If your attacker doesn't have lots of resources, it can be reasonable to assume that they won't get more data than, say, the voter files from your state. But if they can figure out more about your users, and you don't really know which reidentification dataset they could use, maybe \(k\)-anonymity is a safer bet2.
And now, some practice
OK, enough theory. Let's learn how to compute \(k\)-map in practice, and anonymize your datasets to make them verify the definition!
… There's one slight problem, though.
It's usually impossible.
Choosing the reidentification dataset is already a difficult exercise. Maybe you can afford to make generous assumptions, and assume the attacker doesn't know much. At best, you think, they'll buy voter files, or a commercial database, which contains everyone in your state, or in the US. But… then what?
To compute the maximum \(k\) such as your dataset verifies \(k\)-map, you would first need to get the reidentification dataset yourself. But commercial databases are expensive. Voter files might not be legal for you to obtain (even though an evil attacker could break the law to get them).
So, most of the time, you can't actually check whether your data satisfies \(k\)-map. If it's impossible to check, it's also impossible to know exactly which strategy to adopt to make your dataset verify the definition.
Exception 1: secret sample
Suppose you're not releasing all your data, but only a subset (or sample) of a bigger dataset that you own. Then, you can compute the \(k\)-map value of the sample with regard to the original, bigger dataset. In this case, choosing \(k\)-map over \(k\)-anonymity is relatively safe.
Indeed, your original dataset is certainly smaller than the reidentification dataset used by the attacker. Using the same argument as above, this means that you will obtain a lower bound on the value of \(k\). Essentially, you're being pessimistic, which means that you're on the safe side.
Even if the attacker has access to the original dataset, they won't know which records are in the sample. So if the original dataset is secret, or if you've chosen the sample in a secret way, \(k\)-map is a reasonable definition to use, and you can compute a pessimistic approximation of it.
Exception 2: representative distribution
This case is slightly different. Suppose that you can make the assumption that your data is a representative (or unbiaised) sample of a larger dataset. This might be a good approximation if you selected people (uniformly) at random to build your dataset, or if it was gathered by a polling organization.
In this case, you can compute an estimate of the \(k\)-map value for your data, even without the reidentification dataset. The statistical properties which enable this, and the methods you can use, are pretty complicated: I won't explain them in detail here. They are mentioned and compared in this paper, which has references to the original versions of each of them.
Exception 3: using humans
For the case of our doctor earlier, if the dataset is small enough, a motivated data owner could actually do the job of an attacker "by hand". Go through each record, and try to map it to a real person, or estimate the chances of it being possible. We pretty much did that in this article!
This is very approximative, and obviously not scalable. But for our imaginary doctor, it might be a reasonable solution!
ARX implements the methods from exceptions 1 and 2. Documentation for the first one can be found here. Instructions to estimate the number of unique values assuming uniformity can be found here. Originally, μ-ARGUS was the first software with this feature, but I couldn't run it on my machine, so I can't say much about it.
You might wonder why I wrote an entire article on a definition that is hardly used because of how impractical it is. In addition to the unique problems that we talked about in this article, the limitations of \(k\)-anonymity also apply. It's difficult to choose \(k\), non-trivial to pick the quasi-identifiers, and even trickier to model the reidentification database.
The definition also didn't get a lot of attention from academics. Historically, \(k\)-anonymity came first4. Then, people showed that \(k\)-anonymity was sometimes not sufficient to protect sensitive data, and tried to find stronger definitions to fix it. Weaker definitions were, of course, less interesting.
Nonetheless, I find that it's an interesting relaxation of \(k\)-anonymity. It shows one of its implicit assumptions: the attacker knows that their target belongs to the dataset. This assumption is sometimes too pessimistic: it might be worth considering alternate definitions.
Choosing a privacy model is all about modeling the attacker correctly. Learning to question implicit assumptions can only help!