Releasing the data behind studies or that trained algorithms is usually a good thing for transparency, but just as often can be harmful.
So how can you be open about what data you used without putting the individuals that compose that data at risk?
The threat model here is:
Data such as age, sex, zip code, etc, are quasi identifiers in that they can, in combination, uniquely identify an individual.
An equivalence group is a set of records that have the same combination of quasi-identifiers.
Broadly we have two methods:
age in [20,30].
Many anonymization methods fail when we have high-dimensional sparse datasets (everything is a quasi-identifier).
One problem is record linkage, in which the target can be linked to one or very few records in the dataset.
k-anonymity helps with this: the idea of k-anonymity is that we want at least
For instance, if we have the following data:
|M||22||New York, NY|
And if we want
Another problem is attribute linkage in which some sensitive values dominate the target's equivalence group.
The idea of l-diversity is that we want at least
With t-closeness we want the distribution of a sensitive attribute in any equivalence group to be close to the attribute's distribution in the entire dataset.