Privacy laws stipulate protections for “personal information,” which is usually defined as information that could readily identify an individual. This has given rise to various attempts to define “identifying information,” when in fact there is no clear division between types of information that can and cannot identify individuals. Individuals can be identified by unique attributes such as names and official identification numbers, or by a unique combination of other attributes. Effective anonymization requires the capacity to measure the risk that a particular combination of attributes could identify an individual. Conceptualizing identity as a set of dimensions helps to evaluate and reduce this risk by reducing the occurrence of low count records.
As we discussed in our most recent post “What is “Identifying Information”? An Identity Spectrum Model”, identifying information exists on a spectrum from “verinyms,” which provide certainty of an individual’s identity, to anonymous information which provides absolutely no connection to identity. Most data fall somewhere in between, identifying subgroups to which an individual belongs. Individuals can be identified by a single attribute, such as a full postal address, or a combination of attributes: for example, the ages of a person’s children, combined with a racial identifier and a postal code, may point to one or two individuals.
The goal of anonymization, when identification is understood as a spectrum, is to reduce the amount of personal information in a dataset to such a point that it no longer poses a significant risk of re-identifying individuals, and can therefore be shared more freely. The key to effective anonymization is the ability to measure the risk of re-identification in a dataset and apply changes that reduce this risk to an acceptable threshold. Conceptualizing identity as a set of dimensions helps to evaluate this risk and guide the use of anonymization techniques.
Dimensions of Identity
Data sets containing personal information usually include a variety of personal attributes, such as gender, age, address, diagnoses, and so on. Some of these attributes are more closely related to each other than others. Grouping data into dimensions provides significant information about re-identification risk. For example, the location dimension includes a number of identifying attributes: mailing address, postal code, electoral district, city, etc. As more attributes belonging to a given dimension are included in a data set, and as those attributes become more specific, the likelihood of identification within that dimension rises.
Some key dimensions included in many data sets are:
- Personal demographics (e.g., gender, age, ethnicity, number of children)
- Personal records (e.g., medical or financial information)
- System transactions (e.g., appointments, billing, service and program participation)
The risk of releasing data increases with:
- the level of granularity in each of the dimensions, and
- the uniqueness of a certain property.
Sufficient granularity in any dimension can be enough to identify an individual: for example, a full postal address may belong to a single person. Medium to high levels of granularity in several dimensions can also identify an individual: for example, there may be only one individual in a hospital’s database with a particular year of birth, postal code, and gender. The above diagram plots granularity as a function of distance from the core.
Along with granularity it is essential to consider the uniqueness of any property or combination of properties included in a dataset. Some postal codes contain twenty households, and others only one. Some medical diagnoses are very common in a particular age bracket, and unusual in others.
The Key to Anonymization: Avoiding Low Count Records
The granularity and uniqueness of data both point to the same fundamental determinant of re-identification risk, which is the existence of low count records: any combination of properties that is unique to a particular individual or a very small group of individuals. The key number for measuring re-identification risk is the number of individuals who share identical records.
For example: A hospital database is being anonymized so that it can be shared with a medical research institute. Patient names and health card numbers have been deleted from the dataset, and dates of birth and death have been generalized to years of birth and death only. Dates of diagnosis have been generalized to monthly intervals. Data fields that remain unchanged are diagnosis and treatment procedures. If, say, only three individuals born in 1982 received a particular diagnosis in March 2014, the risk of re-identification is too high. One option is to delete these records. The other is to apply additional anonymization, perhaps by generalizing years of birth to ten-year intervals (e.g., 1980-1989, or alternatively age 30-39).
The key to anonymization lies not in deleting particular types of data, but in avoiding the occurrence of low count records pertaining to one or a few individuals with a specific set of characteristics. The concept of dimensions of identity provides a starting point towards this goal by helping to break down a dataset and suggest possibilities for anonymization. Dimensions not relevant to a particular purpose can be eliminated from the dataset. Within each of the remaining dimensions, the most specific fields can be deleted, randomized, or generalized. Finally, any remaining low count records can be identified and deleted. When this is accomplished, the risk of re-identification approaches zero, as each individual record is no longer a unique set of attributes but has become one among many identical records.