 desfontain.es Go back Open original

# δ-presence, for when being in the dataset is sensitive - Ted is writing things

Remember $$k$$-map? We used this definition when the attacker didn't know who was in the dataset. Let's go back to this setting, with a slightly different scenario. You're no longer a doctor studying human sexual behavior. You're still a doctor, but this time, you're specialized in treating a particular chronic disease. Instead of running a survey, you're running a clinical trial for a new drug to treat this disease. Similarly, you want to share the data with other people.

At first glance, these two settings look similar — but there is a crucial difference. Which information is sensitive, exactly? For the survey, the answers of each participant are sensitive, as they reveal intimate details. But for the clinical study, being in the dataset is the sensitive information. If someone figures out that you've taken part in the study, they learn that you suffer from this disease.

So, what does it change in practice? Suppose that your dataset contains the following records:

ZIP code age
85535 10
85535 12
85535 13
85535 13
85535 16
85535 43

You do a little research on who lives in ZIP code 85535. You learn that in this ZIP code:

• 5 people have ages between 10 and 19;
• 5 people have ages between 20 and 29;
• 10 people have ages between 30 and 39;
• 10 people have ages between 40 and 49;
• and 20 people are 50 or older.

Transforming this part of your dataset to have it satisfy $$5$$-map is easy:

ZIP code age
85535 10-19
85535 10-19
85535 10-19
85535 10-19
85535 10-19
85535 40-49

… But what has gone wrong there?

An attacker, using only public data, knows that there are 5 people aged between 10 and 19 in ZIP code 85535. Then, by looking at your de-identified dataset, the attacker can figure out that all of them are part of your data. Thus, they all have this specific disease. The attacker learned something sensitive about individuals, without re-identifying any record. Just like in the example of $$l$$-diversity!

We need yet another definition. Introducing… $$\delta$$-presence!

## Definition

Remember what we counted for our previous privacy definitions? For each combination of quasi-identifier attributes:

• for $$k$$-anonymity, we counted the number of records in the dataset;
• and for $$k$$-map, we counted the number of records in the larger population.

What went wrong in our leading example? For certain attributes, these numbers were equal. To detect this, we now compute the ratio between those two numbers. Then, the $$\delta$$ in $$\delta$$-presence is the largest ratio across the dataset.

Consider the dataset above. The ratio for the records (85535, 10-19) is $$5/5=1$$, and the ratio for the records (85535, 40-49) is $$1/10=0.1$$. Thus, since we defined $$\delta$$ as the greatest ratio, we have $$\delta=1$$. Since the $$k$$ of $$k$$-map is always larger than the $$k$$ of $$k$$-anonymity, this is the maximum possible value of $$\delta$$. Saying that a dataset satisfies $$1$$-presence gives zero guarantees.

Whether $$\delta=1$$ is not the only interesting thing. We also want this value to be small. The lower, the better. Consider what it means if $$\delta=0.95$$. The attacker might learn that their target has a 95% chance of being in the dataset. It's not quite a 100% certainty, but it still can be problematic. For example, it might be more than enough for an insurance company to deny you coverage…

How do we get to a lower $$\delta$$ in our previous example? One solution would be to remove the age completely:

ZIP code age
85535 10-39
85535 10-39
85535 10-39
85535 10-39
85535 10-39
85535 40-49

Then, the ratio for the records (85535, 10-39) becomes $$5/(5+5+10)=0.25$$. The ratio for record (85535, 40-49) is still $$0.1$$, so $$\delta=0.25$$. (Assuming that no other record in the dataset has ZIP code 85535, and all other records have a smaller ratio).

$$\delta$$-presence was first proposed by Nergiz et al. in a 2007 paper (pdf). In this paper, the definition is a bit different. The authors compute not only the largest ratio, but also the smallest one. The $$\delta$$ parameter hides two parameters $$\left(\delta_{\text{min}},\delta_{\text{max}}\right)$$. This was done to protect against the symmetric attack: hiding that someone is not in the dataset. I never encountered a situation where this is a real concern, so I simplified it a bit for this post.

## $$\delta$$-presence in practice

$$\delta$$-presence is computed from the ratios between quantities used in $$k$$-anonymity and $$k$$-map. While $$k$$-anonymity is very easy to compute, $$k$$-map is much harder. As such, $$\delta$$-presence has very similar practical characteristics than $$k$$-map. Since you don't typically have access to the full larger dataset, you can't compute $$\delta$$ exactly. You can use a pessimistic approximation if your data is a sample of a larger dataset that you own. You can also do the work of estimating $$\delta$$-presence by hand.

What about statistical approximations? Nergiz et al. proposed an interesting method in a followup paper (pdf). Unfortunately, two of its requirements make it hardly usable in practical scenarios.

• First, to run the algorithm, you need to "describe your beliefs about the world" (in a statistical sense). Unless you're a statistician, this is not something you can really do.
• Second, computing the algorithm exactly is very expensive. The authors propose a lot of approximations to make it tractable… But then, using them makes the results even more uncertain.

Finally, if you still want to use this algorithm, you would also likely have to implement it yourself. I don't know of any available software that does it for you.

## Conclusion

Like $$k$$-map, in theory, it often makes sense to use $$\delta$$-presence. It's a pity that both definitions are so difficult to use in practice! Having simpler (and more usable) approximation algorithms would be great… Which is why I have done some research work in that direction. And the results of this work will be the topic of a future post! =)