Attacking Web Browsers with Machine Learning and JavaScript

Cambridge, MA — April 23, 2022

Users today are more privacy-conscious than ever, with topics such as third-party trackers, data brokers, and targeted advertising recently entering mainstream public discourse. At the center of many of these discussions is the idea that your web history should be private.

Unfortunately, the Internet ecosystem is far from reaching this goal. Progress has been made, but a cat-and-mouse game has popped up between advertisers, who continue to deploy better trackers, and researchers, who continue to study stronger defense mechanisms.

In this post, I play both the cat and the mouse. I just completed an 18-month-long research project that introduces a new privacy-violating website-fingerprinting attack. The attack works in all major operating systems and web browsers, and can accurately identify which website you’ve opened in a new tab by “fingerprinting” your browsing environment. I’ll start by telling you all about it. Then, I’ll discuss a mitigation I created that cripples the attack, is highly practical, and can be implemented today in settings where users desire increased security. Let’s get into it.

How to Attack a Web Browser

Our attack falls under the well-studied field of side-channel attacks, in which an attacker takes advantage of contention over a shared resource in order to predict a victim’s activity. For example, let’s say your roommate tells you they have become obsessed with Netflix shows, and says they’re going to take a break for awhile. It’s now up to you to figure out if your roommate is using Netflix and going back on their word.

There are two important facts that will help you solve this problem:

  1. Movies are very large files. Downloading one typically creates lots of network activity.
  2. You both share the same Wi-Fi router.

One solution you might try involves periodically downloading a file and measuring how long each download takes. If you notice that a download takes longer than usual, and this pattern holds for some period of time, you might begin to suspect that something is up. Why is the download taking longer? It might be because your roommate is watching Netflix again, creating a ton of network activity and slowing down your traffic. This “attack” works because your router is shared between both of your laptops, and because routers are generally slowed down by increases in network activity.

Now, let’s think more about your web browser. Browsers try really hard to keep tabs separate. In May 2018, Firefox released a feature called site isolation, which defends against side-channel attacks such as Spectre by running each website’s code in a separate process, or “isolating” tabs from each other. Chrome followed suit two months later. This change reduced the number of resources shared by different Chrome tabs, making it more difficult for one of your tabs to pull off a Spectre-like attack.

However, it’s impossible to isolate tabs entirely. You probably have a number of tabs open right now. Each of them probably contains JavaScript code, which gets executed on the same processor, or CPU. Each tab uses the same network card to download data. Once a tab downloads some data, it gets saved to the same CPU cache as data downloaded by your other tabs. This creates several avenues through which one of your tabs can mount a side-channel attack.

Our attack takes advantage of this by measuring changes in computational throughput, or the rate at which work can be done, over time. We measure this throughput every five milliseconds in order to create a counter trace. Here’s some pseudo-code:

loop:
    counter = 0
    start = time()

    loop for 5 milliseconds:
        counter++

    trace[start] = counter

It turns out that these traces meaningfully capture work done by other tabs while the trace was being collected. If you run a JavaScript version of the code above while another tab is loading a website, you get traces that look like these:

In these nine traces, time increases from left to right, and each bar reveals a counter value from a single iteration of the outer loop from the pseudo-code above. Notice that traces from the same website look very similar to each other, while traces from different websites don’t look at all alike. This fact is crucial to our attack’s success. Once you’ve had some time to study these traces, you should be able to evaluate new ones pretty well. Which website do you think was running while I generated this trace?

Correct! This trace was generated while I opened nytimes.com. Let’s try one more.

Hopefully you got it right again — I opened weather.com. You get the idea. This demonstrates a basic version of our attack, in which you can now determine whether I opened one of three websites. Generating a trace is so simple that you can even try it right now:

We evaluated this attack’s success in a closed-world and open-world environment. In the closed-world environment, the attacker needs to pick the correct website out of the 100 most-visited websites in the world. However, in reality, an attacker might not know all of the websites you might browse, which is why we evaluate in an open-world environment as well. In the open-world environment, the attacker needs to pick whether a trace reflects activity from one of the 100 websites on our list, or from none of them.

This would quickly get tiring to do by hand, which pushed us to solve this problem with machine learning, mostly because deep neural networks are really good at recognizing patterns. So good, in fact, that out of 100 websites, we achieved at least 91.9% accuracy in every experimental setup we tried with a major web browser, and we often achieved better. In Safari on macOS, our attack’s accuracy reached an average of 96.6%. Even when we used Tor Browser, the leading secure web browser, our model picked the correct website 49.8% of the time, which is pretty good when randomly picking a website out of 100 would result in a success rate of 1%.