Big Data Needs “Use Control”
Dr. Wael Hassan
Big data radically undermines the protection of data privacy. In this changing environment, access control is of limited use as a model for data protection. Instead, we propose a shift away from “access control” to “use control.”
Our privacy online is protected by access control. The system, refined over the past 50 years, has been highly effective in securing data and minimizing data breaches. But the advent of big data has moved the goalposts. Access control cannot protect our metadata. To adapt to this rapidly evolving digital environment, access control should be superseded by use control.
Big Data’s Privacy Leak
Much of our daily activity is regulated by passwords. They safeguard our online lives, giving us access to everything from our smartphones and bank accounts to the many websites where we shop and entertain ourselves. Passwords are keys securing our personal information and property. The security they give us is known as access control.
Yet there’s one type of personal data that passwords cannot protect: the traces we leave every time we use the Internet and phone networks. The details of our activity as network users are known as metadata. We can keep our personal information under cyber-lock and key, but not our metadata. We can erase browser cookies, but the search engine’s log of our browsing patterns and search keywords remains.
As I’ve stated before, big data is consumer data: the records of our interactions with websites, stores, companies, public institutions, social media platforms, and so on. A relatively new concept, big data is still loosely defined. The term is used to describe both unstructured and structured data – usually a combination of personal content and metadata – collected through various channels: online, through wifi and phone networks, and in the physical world through sensors or cameras via the Internet of Things. Companies have begun harvesting massive amounts of personal data while the norms for respecting privacy in this new paradigm are still undefined.
The ground rules of personal data protection have not changed, despite general confusion about how they apply in rapidly-changing contexts. Fair information principles, the bedrock of most privacy legislation, state that organizations should only collect, store, use, and retain personal information for specific purposes to which individuals have consented. Any information, or combination of information, that is detailed enough to potentially identify a person is personal information, and these rules apply.
Yet as larger and larger volumes of data are collected and aggregated by big data initiatives, it’s more and more difficult to define precisely what is considered personal information. “Data lakes” – massive repositories of relatively unstructured data collected from one or several sources, often without a specific purpose in mind – are a valuable asset for companies, providing a wide variety of data for potential future analysis, or for sale to other companies.
Data lakes often contain a mix of metadata and personal content. In combination, these can frequently identify specific individuals. For example, publicly available and searchable databases of Twitter activity show tweets by geographic location – positions so specific that they can show the house a given tweet was sent from. In the commercial realm, big box retailers use customers’ debit and credit card numbers to link their various purchases, and have developed customer sales algorithms so refined that they can identify the purchase patterns of pregnant women and send them coupons for baby products. Personally identifiable data is of far more value to marketers, and sophisticated technologies are harnessed to re-identify anonymous data.
As these examples show, data lakes can contain extremely sensitive personal information. This data is not intended to be viewed by anyone – it is usually processed by computer algorithms – but all too often, anyone with access to a data lake can view personal information about specific individuals. Our current model for ensuring privacy on the Internet is access control. But does access control have any meaning in the unstructured environment of a data lake?
Access Control in a Big Data Context
Access control, the ability to control access to information, is the primary tool of online security. The first access control protocols were developed by the military, creating an electronic identity that would allow specific individuals access and copy permissions for specific data categories.
The great advantage of access control is that it tells you who has access to what information at any given time. Its limitation is that it cannot tell you what the user does to the data, and in what context. Once write access or editorial access is given, an authorized user can in theory do anything with the data – it can be rewritten, deleted, or copied and passed on. Simply put, access control is only effective as long as data doesn’t leave its source realm, or its initiator’s sphere of influence.
But as digitalization and data use continue to evolve, the access control model is becoming outmoded. Data moves far beyond its origination point, often passing rapidly from collector to purchaser to further purchasers in a chain of unregulated data trading. Data brokers buy web and social media traffic, combining them with data such as customer support, public records, and phone and Internet metadata to create highly detailed profiles of hundreds of millions of consumers. Algorithms sort consumers into multiple categories, passing these on to retailers for targeted advertising.
Individuals usually grant access to their information so that a business may provide services. Often the customer cannot access these services without opting in to information sharing. If you give your email to a clothing store, you can expect to receive advertising related to the products they sell. But in the new world of big data, your information rapidly travels far beyond the company you made your initial agreement with. There’s a world of difference between sharing your data with your favourite store and consenting to its unrestricted use. Yet these are the only options offered by access control. Until we mature that model, data protection will be all or nothing. If you don’t want to share your data, don’t sign up to join Amazon or Google – if you do choose to use these services, your personal data will be unprotected.
By using access control to regulate big data, we’re trying to solve today’s challenges with yesterday’s tools.
The Privacy Gap
Here’s another example of a privacy gap that current access control mechanisms are unable to bridge.
How does this work in practice?
The Use Control Advantage
Simply put, use control allows someone to regulate the ways their data is accessed and used. Access control systems can’t help shoppers know where their information is disclosed, to whom, or for what purpose. However, a use control model could bring big data analytics in line with legal privacy requirements by allowing consumers to take charge of the ways their information is used.
- My identifying information can be used. (This would include your name and address, and shopping history. It will allow us to send you personalized offers and discounts.)
- My anonymized identifying information can be used. (This will help us improve our products and services.)
- My derivative data can be used. (Your anonymous purchasing patterns will help us analyse what our customers like and what could be improved on.)
- My personal, anonymized spending patterns can be analysed as part of aggregated data. (By understanding purchase volumes per month, or year over year, we can predict client purchase patterns and hence better stock our stores.)
- None of my data can be used. (This will affect our ability to send you discount coupons for your favourite products.)
While access control only allows the binary choice of opting in or opting out, use control offers a much more sophisticated range of responses. The reality is that most shoppers do want to opt in, but they don’t want to give up their right to privacy by doing so. They also want some control over how their data is to be used.
Paired up with cutting-edge technologies such as the Internet of Things, big data has myriad possible applications, of which business and marketing are but one. Transport for London’s prepaid travel cards allow detailed mapping of the routes of London commuters, on macro and micro levels. This allows the corporation to hone the system’s efficiency in real time, and send service disruption alerts to a commuter’s smartphone. More controversially, data on individual travelers can also be accessed in police investigations. Wirelessly connected glucometers mean your doctor can monitor your blood glucose levels without you leaving your home. Big data analytics and genetic testing allow doctors to predict an individual, or a population’s, susceptibility to a given disease. Predictive medicine can save lives; it also raises privacy issues and the risk of people being denied health insurance based on their DNA.
In our current system, we rely on corporate leaders to self-regulate how they use millions of people’s identifying information. As President Obama puts it, “It is not enough for leaders to say: ‘trust us, we won’t abuse the data we collect.’”
In 2102, the US government began a pilot big data project, code-named Neptune. Privacy and security controls are built into this vast data lake. All data is given multiple tags; access is dependent on a variety of factors; including intended use or “need to know,” and clearance level. Types of searches are also strictly regulated. In a sophisticated system, highly specific rules work in combination to ensure that different people can access different information, for different purposes, at different times. An example of use control in action, Project Neptune shows how the advantages of big data can be utilized without sacrificing privacy.
The Next Generation of Data Protection
Use control offers the possibility of data analytics regulation. It can police the new frontier of privacy and security. Access control has been highly effective in meeting security threats – so much so, in fact, that external hacks are rare. Recent major data breaches have been characterized by the involvement of an insider, bought by outside interests. This was the case with both the recent hack of Sony Corp and the 2014 data breach at Toronto’s Rouge Valley Centenary Hospital. In the Rouge Valley incident, employees sold the personal information of pregnant women patients to companies selling education savings plans. Use control would prevent insider-linked data leaks by regulating both unauthorized and authorized access to information.
Say that a privacy engineer is setting up a use control system for a pharmacy chain’s database. To minimize the possibility of re-identifying individual clients by combining different database queries, certain context-specific restrictions could be built in: by blocking column-based searches, for example, results showing gender, ethnicity, or age can be avoided; and filters can prevent any query producing a result of less than 500 people. Filters can also block access to chronic disease health status; in this way, when Jean goes to the pharmacy to get his bipolar medication, or his methadone dose, that information cannot be accessed in analysis of what he’s purchased in the store.
When it comes to audits, use control offers a major benefit. Auditing compliance with privacy norms in an access control system is a considerable challenge. If someone claims that there’s been an unauthorized access of their data, the only recourse is to audit all data transactions – an immensely time-consuming task. With use control, there’s no need to look at the data; all you need to do is to review the database controls, to see whether or not they would allow the suspected access.
Faced with a data breach in an access control system, an auditor would need to review transactional data to confirm a wrongdoing. The auditor would ask: is there a trace of Alex committing transactions A, B, and E, and in this sequence? (In the case of a banking review by the SEC, for example, this could involve millions of transactions from multiple banks.) In a use control review, the auditor would simply examine the system policies or rules. The query would be: does Alex have individualized access to data, and do the rules allow him to uncover a single individual’s identity, or a certain property? Thus, use control can bring data use practices more in line with the legal requirements of privacy protection.
The advent of big data has pushed information use into uncharted and currently unregulated waters. The fundamental principles of privacy law state that you have the right to know all the information collected about you by a single entity. What does this mean in the era of big data? Much of Canada’s significant privacy legislation (including several provincial Freedom of Information and Protection of Privacy Acts, Alberta’s Health Information Act, and Ontario’s Personal Health Information Protection Act) is over ten years old. When these laws were drafted, big data didn’t exist – at least in its current form – and nor did the now-ubiquitous cloud. Both present significant challenges to privacy protection, and undermine the efficacy of current legislation. Given both the quantity and the detail of the data now being gathered and retained, privacy legislation needs to be redrawn to respond to our rapidly-shifting technological environment.
Use control deserves further study and development. We need a conceptual model, including a canonical set of instructions that define a use control vocabulary – the technical language to describe and implement database controls. Ideally, such language would be crafted with enough clarity that the same terminology could be used across the board in legal, technical, and practical contexts: by lawmakers, privacy commissioners, privacy engineers, and data protection officers.
I believe use control systems can be the next generation of data protection. Use control has the potential to protect data currently at risk, and to ensure that big data can flourish alongside robust privacy rights. By promoting a culture of transparency in database use and design, and by defining a privacy language that can be used uniformly across all the diverse aspects of the privacy field, use control can strengthen and simplify data privacy and security.