Anonymization

Releasing the data behind studies or that trained algorithms is usually a good thing for transparency, but just as often can be harmful.

So how can you be open about what data you used without putting the individuals that compose that data at risk?

The threat model here is:

Data such as age, sex, zip code, etc, are quasi identifiers in that they can, in combination, uniquely identify an individual.

An equivalence group is a set of records that have the same combination of quasi-identifiers.

Broadly we have two methods:

Many anonymization methods fail when we have high-dimensional sparse datasets (everything is a quasi-identifier).

k-anonymity

One problem is record linkage, in which the target can be linked to one or very few records in the dataset.

k-anonymity helps with this: the idea of k-anonymity is that we want at least $k$ records for each equivalence group in the dataset.

For instance, if we have the following data:

sex age location
M 22 New York, NY
F 50 Philadephia, PA
M 25 Albany, NY
F 35 Pittsburgh, PA

And if we want $k=2$, the resulting anonymized dataset might be:

sex age location
M 20-30 NY
F 30-50 PA
M 20-30 NY
F 30-50 PA

l-diversity

Another problem is attribute linkage in which some sensitive values dominate the target's equivalence group.

The idea of l-diversity is that we want at least $l$ distinct values for each sensitive attribute and equivalence group. l-diversity is an extension of k-anonymity.

t-closeness

With t-closeness we want the distribution of a sensitive attribute in any equivalence group to be close to the attribute's distribution in the entire dataset.

References