Three hacks for parsing JSON in Python

Transforming data into one coherent format that can be used in statistics software such as Stata or R is a main task in data science.

One of the most popular data formats is JSON (Java-script object notation). For example, data retrieved from APIs or exported from NoSQL databases (e.g., MongoDB) is typically JSON.

On first sight, JSON looks a bit chaotic and like spaghetti. Here’s an example taken from data on software repositories. The code gives the name of a software repository (repo_name), along with a list of programming languages used in the repository, along with the precise number of bytes.


How can we extract the data we need from JSON code? How to understand the underlying structure? In this post I want to share three hacks on how to get a better handle on it using Python.

1. Visualize, Visualize, Visualize
To get a handle on the spaghetti-structure of JSON, use some tools to visualize its structure. Once you understand the structure, you can more easily decide on what you need and how to extract it. To make it pretty, there are two main tools I use:
a) JSON-Formatter: a little webpage where you can paste your JSON-spaghetti and retrieve a visualized tree structure of the data. Works particularly good for longer JSON sets.
b) print(json.dumps(parsed_json, indent = 4,sort_keys=False)): Command can be put in Python, then JSON is printed on the console.

See, our above code has become more readable:

2. Handle JSON like an array
Here’s my standard procedure to access JSON data. Let’s continue with the above example. Say, we want to extract the programming languages used in the project. Then we would first read in the entire JSON file. Second, we would iterate over all language elements in the JSON file and write them into a variable

json_dta = json.loads(l) #this loads the JSON file into Python
for lang_element in json_dta["language"]: #iterate over language elements
    lang=lang_element["name"] #write each element into a variable
    #do something with the data

3. Always double-check if an element exists
I would highly recommend double-checking whether a JSON element exists before (trying) to access it. If you try to access an element that does not exist, then Python will complain or completely stop the entire process. This is somewhat likely to happen with large JSON files or files that are malformatted. Here’s a simple check that makes our above code failsafe:

json_dta = json.loads(l) #this loads the JSON file into Python
if "language" in json_dta:
    for lang_element in json_dta["language"]: #iterate over language elements
        lang=lang_element["name"] #write each element into a variable
        #do something with the data

Ftools – One solution when Stata takes hours over hours to complete

Annoyed that Stata again took another 2 hours to complete your command, just to find out that you missed to something and need to rerun? And another two hours gone..

Experiences like this are typical for working on larger datasets. Long processing times for collapse, merge, and egen commands can take forever, at least it feels like this.

One approach to reduce frustration is to use the awesome ftools provided by Sergio Correia. Ftools are a reimplementation of sone of the most popular Stata data processing commands.

Currently, the following commands are available in a revised implementation:

  • egen group (now fegen group)
  • collapse (now fcollapse)
  • merge (now join)
  • levelsof (now flevelsof)

The time savings can be immense. From my experience in using ftools I experienced that my merges and collapse took only roughly a third of what they do with the regular Stata commands–immense productivity improvements!

Now how to get ftools? You get them via SSC, thus when in your Stata command window type the following command and hit enter:

ssc install ftools

How to combine csv files? Here comes a simple trick..

Say we have multiple csv files and we want to merge them into one big csv. How can we do this?

Several little software tools exist, but this would be another app on our computer and some also cost you money. There’s a way simpler and completely free hack that does the job for you, assuming you have a Windows computer:

1. Open the command window:

Press Windows + R to open the run window. Type “cmd” into the executable field and hit Enter.

2. Navigate to your folder
The command window presents your default folders. Unless this is the place where your csv files are located, we need to navigate.

Type “cd” along with the path to the folder and hit enter.

For example, “cd C:\jens\data” switches to the folder C:\jens\data”.

3. Merge

Stay in command window and type:

copy *.csv merged.csv

Et voilà, after hitting enter, Windows combines the files and gives you the results in the file merged.csv.

This is how a simple, half-a-minute hack can help in combining several files together.

3 frequent problems with time-series data in Stata–and how to solve them

1. Spell beginnings and ends

When dealing with firm data, we typically do not observe all firms for all periods. Some of them might go out of business, some might not be tracked anymore due to falling below a threshold of size, and others might simply miss due to a lack of data.

This is why we are often interested in identifying the beginnings, ends, and lengths of the individual firm spells in the data. But this is not so straightforward. Here’s the trick on how to get the beginning of the spell, the end, and the length.

Assume we have firms (firm_id) observed over several years (year):

gen firstyear=. bysort firm_id: replace firstyear = year if _n==1

gen lastyear = 1 if firm_id!=firm_id[_n+1]

bys firm_id: egen spell_length = count(year)

2. XTSET does not work due to repeated observations of the time variable

Assume we have firms (firm_id) observed over several years (year). We do xtset firm_id year and Stata prompts an error message, indicating that we cannot xtset due to repeated observations in the time variable. What do we do?

Well, what we should do is, first of all, inspect the firm-year pairs for duplicates by:

 duplicates report firm_id year 

This prints us with a count of the duplicate firm-year pairs. We might find out that some prior merging of the data went wrong. Thus, we may want to go back to our original merging and check what led to a huge number of duplicates.

If we are sure that the duplicates are unnecessary, we can drop them right away:

 duplicates drop firm_id year, force 

Or, we may find out that only a few observations are affected. Then we might inspect these in more detail

bys firm_id year:  gen dup = cond(_N==1,0,_n)

3. Missing observations downward or upward

Sometimes we have missing data in our time series that we want to fill downward from the top observation. Let’s assume we have firms with a distinct firm_id and the variable location is only given for some of them:


bys firm_id: replace location=location[_n-1] if location==. & location[_n-1]!=. 

..or upward:

 bys firm_id: replace location=location[_n-1] if location==. & location[_n-1]!=. 

3 Essential Python Tricks for Lean Code

Python is one of the most important programming languages, especially for data scientists.

Sometimes I find myself going through hundreds of lines of code for my projects. So I spent some hours researching on how to trim the massive code and make the overall coding leaner. Here are 5 tricks I learned.

1. Lean conditional statements

Conditional statements can be really clumsy:

if a == 0:
print("not 0")

But this can cost several lines of code. There’s a more lean way to write conditional statements:

print("0") if a ==0 else print("not 0")

2. Simple String-cutting

I work a lot with time-series data. Some of them are Unix timestamps, which look like this:

date = "1553197926UTC"

Converting the number itself into a date would not be a problem, but the remainder of the timestamp–the ‘UTC ‘ part–needs to be removed before we can do anything with the timestamp. Python offers a straightforward way to get rid of some parts of strings (here the trailing three characters):

date = "1553197926UTC"
date = date[:-3]
>>> 1553197926

3. Convert a Nested Array into One Array

Sometimes we get a nested array, especially when dealing with JSON responses from APIs:

array = [[1, 2], [3, 4], [5, 6]]

If we want to transform the nested array into one array, here’s a little trick that does it:

import itertools
>>> [1, 2, 3, 4, 5, 6]


3 little hacks for parsing web content with Python and Beautiful Soup

Over the past two weeks, I made great progress in collecting data for a new research project of mine. I had to deal with substantial amounts of web content and had to parse it in order to use it for some analyses. I typically rely on Python and its library Beautiful Soup for such jobs and the more I use it, the more I appreciate the little things. Here are the top three new hacks:

1. Getting rid of HTML tags

I had to extract raw text from web content I scraped. The content I wanted was hidden in a complete mess of HTML tags like this:

</span></div><br><div class=”comment”>
<span class=”commtext c00″>&gt; &quot;the models are 100% explainable&quot;<p>In my experience this is largely illusory. People think they understand what the model is saying but forget that everything is based on assuming the model is a correct description of reality.<p>

Getting the “real” text out of it can be tricky. One way is to use regular expressions, but this can become unmanageable given the variety of HTML tags.

Here comes a little hack: use BeautifulSoup’s built-in text extraction function.

from bs4 import BeautifulSoup

soup = BeautifulSoup(webcontent, "html.parser")

comment = soup.get_text()


2. No clue what you’re looking for? Prettify your output first

Before I do extract anything, I have a look at the web content–soup helps you get through the code salad with some function called “prettify” to make it readable:

from bs4 import BeautifulSoup

soup = BeautifulSoup(textstring, "html.parser")

print soup.prettify()


where “name” is the filename.

3. Extracting URLs from <a> tags

Sometimes you find a link like this and want to extract its URL:

Here’s the code:

from bs4 import BeautifulSoup

soup = BeautifulSoup(textstring, "html.parser")



Three tips to process large datasets in Stata or R

The increasing availability of large-scale public datasets is a goldmine for many researchers and data analysts. For example, great potential resides in data from Wikipedia (~300 GB per month), OpenStreet Map (~70 GB), and Reddit (~600 GB). However, getting such large datasets ready for analysis is often difficult. Stata, for example, refuses file inputs that are larger than the available RAM in your computer. Of course, we might use computing services such as Amazon and Google, but this requires a research budget, setting up a customized environment, and a constant Internet connection.

In this blog post, I want to share three best practices on how to deal with large datasets and how to get them into statistic software like Stata.

1. Work with CSV files

Datasets come in different shapes. Some are JSONs, some are XML, and many more. While statistic software allows you to import files of various different formats, I always recommend transferring them into CSV (comma separated values) files first. The reason is that CSV is probably the leanest file format, as it goes without the various (and potentially duplicate) meta information that JSON or XML files have. Transferring into CSV can considerably reduce the size of the input files. There are various converters available, and I will share some of the mines in the next posts.

2. Split the input files

One way to deal with large datasets is to cut them into chunks and then process each chunk in a batch. When working with CSV files, there is a little tool called the Free Huge CSV File Splitter, which does its job perfectly fine for me. For batch processing all files in a directory using Stata, the following code helps:

set obs 1
gen x=.
save "output.dta", replace
cd "folder"
local commits : dir . files "*.csv"
foreach file of local commits {

import delimited `file', clear

**do all the processing here

append using "..\output.dta"
save "..\output.dta", replace

drop x
save "..\output.dta", replace

The code loops over all files in the folder called “folder”, processes them, and eventually writes them into one output file.

3. Get rid of strings as much as possible

String data processing is among the most computation-intensive operations. Try to avoid string data as much as possible even before importing data into Stata. Many datasets have hashcodes or control strings included, which can be completely unnecessary for you, but blow up the size of your dataset. Before importing files into Stata, I use EmEditor to have a first look at the structure of the dataset. I then drop unnecessary string data and then import it into Stata.

How to scrape the data behind interactive web graphs

Sometimes we are interested in obtaining data that is behind web graphs like the ones here (e.g., produced through highcharts.js or something related). Sometimes the data points can be obtained by eyeballing, but there are also cases where we need hundreds or thousands of such graphs or where data is so fine-grained that it is impossible to simply spot it. In such a case, we are interested in an automatic procedure which scrapes these graphs. Unfortunately, such charts are tricky to scrape, because data is loaded dynamically in the background.

One trick to obtain the data is to inspect the website using your browser’s built-in developer tools. For example, in Chrome:

  1. Open the website which contains the graph.
  2. Right-click somewhere on the website and press “Inspect”.
  3. In the new window, proceed to the “Network” tab. This tab provides an overview of network transactions between your computer and the website.
  4. Look out for files with a “.json” ending–these are the ones which contain the graph data.json2
  5. Inspect the file by clicking on the “Headers” tab. We need the location of the file on the web server which should be somewhere in the general information.tempsnip
  6. Now we can pull the data into Python and work with the data right away using:
url = "http://pathToJSONfile"
x = requests.get(url).json()


How to do a placebo simulation in difference-in-differences designs (part 1)

Marianne Bertrand’s 2004 article “How much should we trust differences-in-differences estimates?” (appeared in QJE) outlines several tests that can be done to assess the robustness of difference-in-differences estimates given concerns of false positives.

One recommendation is to run a placebo simulation in which–in a first step–the treatment indicator is randomly assigned to observations in the data set and–in a second step–the regressions are run again with the goal to compare the main estimates with those from the placebo regression.

I have written a little Stata script that runs such a placebo simulation and compiles an Excel spreadsheet which gives the placebo coefficient estimates along with the confidence interval bounds.

Here’s that script. It assumes a panel dataset in use which observations take the form of unit-years (e.g., firm-years). The only thing necessary to adjust for your purposes is to set the parameters at the top.

global project_folder = `"C:\Users\path to project"'
global depvar = "dependent variable"
global treatment = "treatment binary"
global post = "time binary which is 1 for observations after the treatment"
global idvar = "unit identifier variable (e.g., id)"
global timevar = "time identifier variable (e.g., years)"
global controls = "list of control variables (e.g., age)"
global seed = "110" //sets the memory for reproducible random variable generations
global treatment_groupsize = "number of observations in the treatment group (e.g., 100)"
global numruns = "#runs of the simulation (e.g., 60)"

**set excel headers
putexcel set $project_folder, replace
putexcel A1=("DV Coefficient")
putexcel B1=("DV Lower CI")
putexcel C1=("DV Upper CI")
local cellcounter = 3
set seed $seed

*estimate "true" regression
xtset $idvar $timevar
xtreg $depvar i.$treatment##i.$post $controls $timevar, fe robust
putexcel A2=(_b[1.$treatment#1.$post])
putexcel B2=(_b[1.$treatment#1.$post] - invttail(e(df_r),0.025)*_se[1.$treatment#1.$post])
putexcel C2=(_b[1.$treatment#1.$post] + invttail(e(df_r),0.025)*_se[1.$treatment#1.$post])

forvalues i=1/$numruns {
	randomtag if $timevar == awardm-4, count($treatment_groupsize) gen(r) //ssc
	bys $idvar: egen placebo = max(r)
	drop r
	tab placebo
	capture xtreg $depvar i.placebo##i.$post $controls $timevar, fe robust
	putexcel A`cellcounter'=(_b[1.placebo#1.$post])
	putexcel B`cellcounter'=(_b[1.placebo#1.$post] - invttail(e(df_r),0.025)*_se[1.placebo#1.$post])
	putexcel C`cellcounter'=(_b[1.placebo#1.$post] + invttail(e(df_r),0.025)*_se[1.placebo#1.$post])
    if _rc!=0 {
      display "Error on run "`i'
	else {
	   estimates store result`i'
	drop placebo
	local cellcounter=`cellcounter'+1

In one of the next blog posts, I will show how to use this generated spreadsheet for plots of the placebo confidence intervals or simple tabulation summaries for your papers.

How to make clean difference-in-differences graphs in Stata

Difference-in-differences designs seem to be everywhere now, but some of the papers I read don’t seem to leverage one of their key strengths: visualizing what is going on in the data.

For me, I tend to use the following graph style. It plots the dependent variable over time, here from April to October. The treatment and control groups go with different line patterns and colors. Instead of a bulky legend I denote the groups right next to their line. The treatment time is denoted by two vertical bars which separate the group lines. Instead of a complete grid, the graph only relies on a vertical grid to ease eyeballing the changes in the dependent variables.


Now here is the code for the graph in Stata.

**setup: fill the blanks

global dv = ""

global timevariable = ""

global graphtitle = "A clean graph"

global line1 = "Treatment"

global line2 = "Control"

global ytitle = "Mean of dependent variable"


**collapse the data into an aggregated time series

collapse (mean) y = $dv (semean) se_y = $dv, by(m treatment)

sort $timevariable

gen yu = y + 1.96*se_y

gen yl = y - 1.96*se_y

label  define m      1  "April"  2 "May" 3 "July"  4 "August" ///

                     5  "September" 6 "October" 7 "November"

label  value m m

twoway (scatter y m if m<=2 & treatment==1, msymbol(S) ) ///

       (rcap yu  yl m if m=3 & treatment==1) (line y m if m>=3 & treatment==1) ///

       (scatter y m if m<=2 & treatment==0, msymbol(S) ) ///

       (rcap yu  yl m if m=3 & treatment==0) (line y m if m>=3 & treatment==0) ///

       (function y=3.25,range(2.10 2.12) recast(area) color(gs12) base(4.25)) ///

       (function y=3.25,range(2.88 2.90) recast(area) color(gs12) base(4.25)) ///

		, ///

		graphregion(margin(large)) ///

		ylabel(3.25(.25)4.25) ///

		title($title) ///

		yscale(titlegap(*16)) ///

  	    xlabel(1(1)7, valuelabel ) xtitle(" ") ///

	    text(4.3 6.8 $line1) ///

		text(3.7 6.8 $line2) ///

		graphregion(color(white)) bgcolor(white) ///

	    ytitle($ytitle) legend(off) scheme(s2mono) ///

		saving("fig\clean_plot", replace)

gr combine "fig\clean_plot.gph", /*

	*/ iscale(.7) xsize(6)

graph export "fig\clean_plot.png", replace width(1600) height(800)