Anonymization

Releasing the data behind studies or that trained algorithms is usually a good thing for transparency, but just as often can be harmful.

So how can you be open about what data you used without putting the individuals that compose that data at risk?

The threat model here is:

we want to release the data itself and not summary statistics
we want to protect all individuals equally
adversaries are resourceful
adversaries may use unknown techniques
adversaries may have access to additional data (e.g. personal knowledge about the target)

Data such as age, sex, zip code, etc, are quasi identifiers in that they can, in combination, uniquely identify an individual.

An equivalence group is a set of records that have the same combination of quasi-identifiers.

Broadly we have two methods:

generalization methods, in which we take a specific value and make it more general, e.g. go from age=25 to age in [20,30].
suppression methods, in which we remove values altogether

Many anonymization methods fail when we have high-dimensional sparse datasets (everything is a quasi-identifier).

k-anonymity

One problem is record linkage, in which the target can be linked to one or very few records in the dataset.

k-anonymity helps with this: the idea of k-anonymity is that we want at least $k$ records for each equivalence group in the dataset.

For instance, if we have the following data:

sex	age	location
M	22	New York, NY
F	50	Philadephia, PA
M	25	Albany, NY
F	35	Pittsburgh, PA

And if we want $k=2$, the resulting anonymized dataset might be:

sex	age	location
M	20-30	NY
F	30-50	PA
M	20-30	NY
F	30-50	PA

l-diversity

Another problem is attribute linkage in which some sensitive values dominate the target's equivalence group.

The idea of l-diversity is that we want at least $l$ distinct values for each sensitive attribute and equivalence group. l-diversity is an extension of k-anonymity.

t-closeness

With t-closeness we want the distribution of a sensitive attribute in any equivalence group to be close to the attribute's distribution in the entire dataset.

References

What every data scientist should know about data anonymization. Katharina Rasch.