“Reverse Engineering Online Tracking for Privacy, Transparency, and Accountability”
— Summary of Lecture by Arvind Narayanan, Assistant Professor, Princeton University [December 8, 2014]
When we browse the web, data about us is tracked, collected, and put to use in creative ways. But the lack of transparency makes web tracking problematic. In a recent Big Data Lecture at MIT, Narayanan explained several aspects of his research and highlighted the effort underway at WEBTAP. WEBTAP’s research goals are to correct market failures in online privacy, foster accountability on privacy, security, and ethical issues, and enable a more informed public debate.
When it comes to big data, he notes, “It is about taking a big data system – an online tracking ecosystem – that you did not build and do not control, and devising ways to study the inputs and outputs and learn as much as you can about it.” He adds, “This is an emerging field that is growing rapidly; there are far more research problems than researchers to work on them, so it is a good time to get involved.“
Third-party online tracking
When readers visit a webpage – the front page of the New York Times (NYT), for example, the main content belongs to the NYT, but the images may come from other media sources, and third-party ads are provided from multiple locations. Understanding who is compiling profiles of users’ browsing history is not always transparent. The scale of operations is impressive; Narayanan points out that based on studies at Princeton and elsewhere, there are several dozen tracking mechanisms on the typical top-50 web sites.
Online tracking can be used for a variety of purposes, from serving up ads to refining the types of news that you will see. The possible harms are also diverse. In the case of advertising, websites may vary their prices, depending on user information and it is not clear if this could lead to unfair price discrimination. In the case of news websites, there are concerns over whether excessive tailoring to interests may lead to a “filter bubble,” where individuals will be offered narrower and narrower streams of content based on past viewing history. Taken to an extreme, this has the potential to lead to a smaller view of the world – a “self-reinforcing bubble” – by providing continual positive feedback and restriction of content, due to an individual’s existing beliefs and preferences.
“Not only are many users oblivious to this,” says Narayanan, “but many web site operators themselves are unaware of the extent to which third-party tracking is taking place.” The idea of a Panopticon can be used as an ominous analogy: people are being watched, but they do not know the details or purposes of the observation; creating a fundamental power asymmetry. In the case of online tracking, the concern is that people do not always know what data is being collected or what it is being used for.
In seeking to understand the mechanisms at work on these sites Narayanan and his team began to explore the environment and compile data about their findings. “Our central thesis is that a single modular platform can enable a variety of experiments to reverse engineer privacy-impacting practices. In these experiments an automated, simulated user (i.e. bots) browse the web; we monitor and analyze flows of personal data and test how sites personalize themselves to these bots.”
While the concerns on privacy are serious, Narayanan asserts, “One of the best things you can do for privacy is simply to measure it. This fixes the fundamental information asymmetry, enables more informed pubic debate, and, in more egregious cases, will prompt regulation and enforcement of existing laws; measurement and disclosure of findings will help with all of that significantly.”
Canvas fingerprinting and other hidden identification mechanisms
One newly developed technique of browser fingerprinting is called, “canvas fingerprinting,” whereby third-parties are able to uniquely identify and track visitors without the use of browser cookies (as those can be deleted by the users). In a study of 100,000 web sites, Narayanan‘s team at Princeton working with a team at KU Leuven discovered that canvas fingerprinting was in use on over 5,500 sites by approximately 20 different providers (reference: https://securehomes.esat.kuleuven.be/~gacar/persistent/index.html). This is a case where publicizing the information made a difference. “When the study came out,” notes Narayanan, “there was a big public backlash and many of the companies decided to stop doing this.”
A second area of research on advanced online tracking mechanisms includes the practices of cookie respawning, and cookie syncing – all relating to ways that third-parties seek to place “permanent” cookies on users’ devices. Narayanan’s work also extends to ID cookie detection and the implications of federated login, on Facebook for example.
In addition to serving up targeted ads, online tracking can be used to vary prices of products or services based on a customer’s willingness to pay (price discrimination). While one recent study found price discrimination on some sites (DOWNLOAD PAPER) at this time this does not appear to be happening on a large scale; recent research focused on online airline ticket prices leads to the conclusion that there is no clear evidence of systematic price discrimination (DOWNLOAD PAPER)
Challenges and Future Work
One of the challenges facing Narayanan and his team centers on the nature of the web itself – they have found that it resists automation, leading to frequent crashes during experiments. The team has built several layers of abstraction on top of their system that have good error recovery rates and a substantial amount of parallelization. A second issue involves the preservation of statistical rigor; when testing certain types of interactions, different machines must be used, since the sites being evaluated may be developing profiles of individual machines or browsers. In addition, the sites might be conducting their own research on users – a situation that could skew the results of any tests, given a two-way testing scenario.
Ideally, the team would be able to simulate real world interactions, but there is not a way to fully automate such an experiment right now. “This is an area that is begging for understanding,” advises Narayanan, “ because these big data techniques exist – there is a billion dollar infrastructure both for first-party and third-party tracking ecosystems. Companies are pitching products that claim to be about personalization and then when you test these products, you may find that they are doing little of that, based on browsing histories.”
Future research will focus on greater use of machine learning to study canvas fingerprinting and other phenomena. There is a strong interest in developing a measurement-driven privacy tool that can block behaviors rather than blocking actors. Work is also underway on producing a broad and comprehensive web privacy consensus and examining ways to enable first-party accountability. This could result in providing a 1-click tool for publishers to identify security and privacy problems, for example.
Narayanan concludes, “There should be a public debate and a common understanding on how [online] data is being used before people have to decide if they are comfortable with it. At the same time, these ecosystems are growing very fast – so there is a ripe opportunity for technologists to come in and build infrastructure that can act as an independent, external oversight and transparency layer. If we are able to develop this infrastructure, then we can ensure that the ecosystems will evolve towards what society wants, while also addressing the commercial visions that systems builders have.”
The Princeton Web Transparency and Accountability Project (WEBTAP): http://webtap.princeton.edu
“Is Digital Advertising a New Form of Market Manipulation?,” Renee Boucher Ferguson on the MIT Sloan Management Review website: http://sloanreview.mit.edu/article/is-digital-advertising-a-new-form-of-market-manipulation/
“The Web never forgets: Persistent tracking mechanisms in the wild,” Gunes Acar et al., https://securehomes.esat.kuleuven.be/~gacar/persistent/the_web_never_for…
“Measuring Price Discrimination and Steering on E-commerce Web Sites,” Aniko Hannak et al., http://www.ccs.neu.edu/home/cbw/pdf/imc151-hannak.pdf
“Crying Wolf: On the Price Discrimination of Online Airline Tickets,” Thomas Vissers et al., https://www.petsymposium.org/2014/papers/Vissers.pdf
Arvind Narayanan (Ph.D. 2009) studies information privacy and security and maintains a strong interest in technology policy. He is an Assistant Professor of Computer Science at Princeton and leads the Princeton Web Transparency and Accountability project (WEBTAP), which focuses on revealing and analyzing how companies are collecting and using our personal information.
Narayanan is also studying the security and stability of Bitcoin and cryptocurrencies. Narayanan has demonstrated that data anonymization is broken in fundamental ways, for which he jointly received the 2008 Privacy Enhancing Technologies Award. Along with his current affiliation in the Computer Science department at Princeton, he is an affiliated faculty member at the Center for Information Technology Policy there and an affiliate scholar at Stanford Law School’s Center for Internet and Society.
You can follow Arvind Narayanan on Twitter at @random_walker.
Further details on Arvind and his work are available at: http://randomwalker.info/etc/