Anonymization by Design: Taking the Guesswork out of Anonymization

Anonymization is becoming a necessity for organizations that use and share large personal data holdings. The goal of anonymization is to ensure that personal data used for secondary purposes cannot identify individuals. As we have discussed in our recent posts, this means much more than just deleting names and government identification numbers. As anonymization becomes a common requirement, labour-intensive and error-prone manual anonymization no longer makes sense. How can we move toward “anonymization by design”: integrating anonymization capabilities into the design of information systems?

Organizations of all kinds are creating and sharing data at an exponentially growing pace. Much of this data pertains to individuals and their interactions with public institutions, businesses, non-profit organizations, and many others. Beside its original use in providing services to individuals, this data may be used for numerous purposes, including research, program evaluation, reporting, and consumer analytics. Privacy regulations stipulate that personal data used for such secondary purposes must be anonymized – altered so that it cannot identify individuals. Anonymization is increasingly becoming a part of operations for many organizations that make use of large personal data holdings.

The challenge of anonymization is to ensure that privacy remains protected, regardless of where personal data travels or how it is used. While many organizations still carry out anonymization manually, a growing suite of tools, methods, and guidelines are available to organizations seeking to anonymize data more efficiently and reliably.

Approaches to Anonymization

Many organizations initially attempt to anonymize data manually. Typically, technical staff is assigned to delete or randomize database fields such as names, government identification numbers, and birthdates that pose a high risk of re-identifying individuals.

The difficulty with this approach is that, as we have discussed in our recent posts, individuals can potentially be re-identified by any unique combination of properties. For example, the combination of an ethnic identifier, gender, number of children, and postal code could point to one or two individuals who could be identified by someone with access to additional information sources. Organizations can reduce privacy risk by sharing only the data fields relevant to a data recipient’s specific needs. However, this is not usually what is done; entire databases are often released with only the most obviously identifying fields removed. Even at its best, manual anonymization is a time-consuming process that depends on subjective judgments about which data is safe to disclose.

A more sophisticated approach is to purchase a tool to evaluate the risk of re-identification. Several tools are available that calculate the probability that individuals could be identified on the basis of a given dataset. Risk measurement tools can support a more effective and defensible approach to anonymization.

The challenge of using these tools is to understand and make decisions on the basis of the risk statistics they provide. Unfortunately, many organizations do not have staff with the expertise to determine which data fields should be anonymized and which anonymization techniques are appropriate. Management often lacks the background knowledge necessary to interpret risk data and make decisions about appropriate risk levels.

Anonymization by Design

A more integrative approach is to design or redesign a database to support anonymization. The ideal scenario for disclosing data to a third party is to create a dataset that includes only the specific data fields relevant to their purpose. However, the reality is that organizations typically attempt to anonymize an existing database for disclosure. This can be very risky if the database is not structured appropriately or if adequate risk measurement techniques are not used.

There are three key steps to designing or redesigning a database for anonymization:

  1. Structural design

    Many databases have structural vulnerabilities that can undermine anonymization. A common problem is the existence of hidden tables or notes that can be overlooked during the anonymization process. The first step toward setting up a database suitable for anonymization is to determine its structural requirements and identify data risks. For example, a key structural requirement is that the database be fully normalized: data should be indexed by random identification codes rather than direct identifiers such as names or government identification numbers, and only one index table should link identification codes to direct identifiers. Identifying data risks means identifying unusual properties within the database that create a high risk of identification: for example, medical diagnoses that are rare within a particular age bracket, or geographic identifiers pertaining to a small population. These data risks become the initial targets for anonymization.

  2. Functional design

    With appropriate expertise, software functions capable of anonymizing data without reducing its utility can be integrated into a database. Such functions might include shifting the dates of client transactions within a realistic time frame, or replacing individuals’ names with randomly generated fictional names. The capacity to replace personal data with altered or randomly generated data is especially important when information technology staff needs to perform testing and checks, which generally require access to a version of the database that has all of the same fields and functions as the live database.

  3. Risk measurement design

    The key to measuring privacy risk in the context of anonymization is the capacity to locate low count records: any combination of properties that is unique to a particular individual or a very small group of individuals. A database designed for anonymization will include programming code that can identify low count records that should be deleted in order to reduce privacy risk. Eliminating low count records ensures that the risk of individuals being re-identified is extremely low.

In practice, these steps may not be precisely linear, but summarize the most important aspects of an anonymization by design strategy. Once a database has appropriate structures, functions, and risk measurement tools for anonymization, preparing the data for release to a particular recipient is a relatively straightforward, efficient and repeatable process. By taking the guesswork out of anonymization, an anonymization by design approach can make data sharing a safe and regular part of operations rather than an exceptional and risky event.

To receive a more comprehensive summary of our 7 Principles of Anonymization by Design please contact us at


7 Principles of Anonymization by Design – email for a copy

What is “Identifying Information”? An Identity Spectrum Model

Dimensions of Identity: A Risk-based Approach to Anonymization