Cloud Storage and Tokenization: Privacy Solution or Privacy Risk?

Recently, tokenization has been promoted as a privacy solution enabling cloud storage of personal data records. At first glance, tokenization seems a simple, cost-effective, and secure solution, but is this in fact the case?

For many businesses and public organizations which store large volumes of personal data, such as credit card purchase records, healthcare records, or government services databases, cloud storage has proven a tempting opportunity. Why invest in secure servers, which take up space, require maintenance, and need to be replaced every few years, when it is possible, for a limited fee, to store data in a secure online database? Cloud storage seems cost-effective and convenient, and has the added benefit of providing remote data access to employees, partners, and clients.

Most organizations are aware of some of the limitations of cloud storage. Online data storage usually poses a higher security risk than on-site servers. Cloud storage providers’ security practices are often rigid or obscure from the user’s perspective: providers typically have a template service agreement and are not willing to amend it to meet the requirements of a client’s organizational or jurisdictional policies. Further, cloud computing providers themselves pose a privacy risk from a legal perspective. Providers are subject to the laws of the jurisdiction where they are based, which is usually not the user’s jurisdiction. Most cloud storage providers are based in the United States, where the law effectively allows American government agencies to access any data deemed relevant to national security. From the perspective of Canadian organizations, this is clearly a privacy violation.

A common solution to privacy concerns about cloud storage is tokenization, a system in which personal data is split into two databases. One database contains the raw, individual records, but with direct identifiers such as names, credit card numbers, or health card numbers removed. Each record is labeled with a randomly generated identification number (a ‘token’). The second database contains the identification key: direct identifiers paired with the corresponding identification numbers. The two databases are hosted by different cloud storage providers. When an authorized user issues a request for records, it is processed by both databases. Results from each of the databases are returned to a local machine, which then combines the results and produces the requested data.

On the surface, it is a simple and elegant solution. Unless both databases are somehow breached, privacy is assured. However, there are three significant problems with tokenization that warrant a reconsideration of its effectiveness: cost efficiency, vulnerabilities in the results aggregation process, and residual risk due to indirect identifiers.

1. Cost efficiency

The usual rationale for cloud storage is that it is more cost-effective than maintaining secure servers. This is not necessarily true. When the identifier database is hosted internally and the raw data is hosted by a cloud storage provider, maintaining two systems can prove to be more expensive than relying on one internal secure system. When the two datasets are hosted by different cloud storage providers, the expenses are still considerable: besides cloud storage fees, set-up costs can be high. The process of separating data into two databases is time-consuming and needs to be done in a secure environment. The security safeguards offered by cloud storage providers may also need to be supplemented by the organization’s security staff, incurring additional costs.

2. Results aggregation

When tokenization is implemented with adequate security, the point of highest risk is the access point: the computer which matches records from two databases to create identifiable records. Frontline staff, administrators, and technical support staff will all need to access identifiable records, often in large numbers. For instance, clinical staff often finds errors in hospital records and refers the problem to data quality assurance specialists. When this happens, data quality assurance staff usually has to look up multiple records in order to diagnose and fix the problem: since it is often the case that a patient’s data has been entered into the record of another patient with a similar name or the same date of birth, quality assurance staff will search all records with the same last name or date of birth. In this process, all of the identifiable records that quality assurance staff accesses are recorded on their computers, which therefore have to be secure. This issue cannot be circumvented by allowing technical support staff access only to de-identified data – as this example has shown, some technical support teams do need access to identifiable records.  In short, tokenization cannot replace the need for a secure internal system.

3. Indirect identifiers

There is one more fundamental concern with tokenization, which is the assumption that removing direct identifiers from personal records eliminates privacy risk. Indirect identifiers, such as dates of birth, postal codes, diagnoses, and other significant dates (e.g., credit card transactions, medical procedures, hospital admissions and discharges) can be used in combination to identify individuals. The question of splitting data into two databases is more complex than it initially appears: how many data fields need to be removed from records for them to be effectively de-identified? How many data fields can be included in the identification key database before it poses a privacy risk?

An issue worth noting here is the practice of allowing third-party access to the purportedly de-identified database. Healthcare providers may see tokenization as a way to facilitate data sharing, by providing researchers and administrators with access to the database containing raw data records with direct identifiers removed. Because of the privacy risks just described, third-party data clients should never be given access to live clinical databases. Decisions about data sharing should always take into account which data fields each client actually needs, how much privacy and security protection they can guarantee, and how likely it is that the data could be used to identify individuals. We discuss these issues in detail in our publications on de-identification, cited at the end of this article.

Privacy Solutions

It is fairly clear, at this point, that certain configurations of tokenization may be neither cost-effective nor privacy-protective. Tokenized cloud storage simply is not appropriate for live databases that are accessed in the course of daily operations. For protecting personally identifiable data, there is currently no satisfactory alternative to the use of secure servers.

Cloud storage does, however, have something important to offer businesses and healthcare providers. Where an organization receives frequent third-party requests for a particular dataset, it can be efficient to set up a cloud database containing the de-identified dataset to be accessed by third-party clients. This works well when the dataset is fixed and in demand by numerous clients with the same level of data access. In summary, cloud storage works best for external rather than internal clients, for fixed rather than live databases, and for de-identified rather than identifiable data.


Ki Consulting De-identification Maturity Model.

Ki Consulting Risk-Based Privacy Maturity Model.