How to scrape the data behind interactive web graphs

Sometimes we are interested in obtaining data that is behind web graphs like the ones here (e.g., produced through highcharts.js or something related). Sometimes the data points can be obtained by eyeballing, but there are also cases where we need hundreds or thousands of such graphs or where data is so fine-grained that it is impossible to simply spot it. In such a case, we are interested in an automatic procedure which scrapes these graphs. Unfortunately, such charts are tricky to scrape, because data is loaded dynamically in the background.

One trick to obtain the data is to inspect the website using your browser’s built-in developer tools. For example, in Chrome:

  1. Open the website which contains the graph.
  2. Right-click somewhere on the website and press “Inspect”.
  3. In the new window, proceed to the “Network” tab. This tab provides an overview of network transactions between your computer and the website.
  4. Look out for files with a “.json” ending–these are the ones which contain the graph data.json2
  5. Inspect the file by clicking on the “Headers” tab. We need the location of the file on the web server which should be somewhere in the general information.tempsnip
  6. Now we can pull the data into Python and work with the data right away using:
url = "http://pathToJSONfile"
x = requests.get(url).json()

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s