3 little hacks for parsing web content with Python and Beautiful Soup

Over the past two weeks, I made great progress in collecting data for a new research project of mine. I had to deal with substantial amounts of web content and had to parse it in order to use it for some analyses. I typically rely on Python and its library Beautiful Soup for such jobs and the more I use it, the more I appreciate the little things. Here are the top three new hacks:

1. Getting rid of HTML tags

I had to extract raw text from web content I scraped. The content I wanted was hidden in a complete mess of HTML tags like this:

</span></div><br><div class=”comment”>
<span class=”commtext c00″>&gt; &quot;the models are 100% explainable&quot;<p>In my experience this is largely illusory. People think they understand what the model is saying but forget that everything is based on assuming the model is a correct description of reality.<p>

Getting the “real” text out of it can be tricky. One way is to use regular expressions, but this can become unmanageable given the variety of HTML tags.

Here comes a little hack: use BeautifulSoup’s built-in text extraction function.


from bs4 import BeautifulSoup

soup = BeautifulSoup(webcontent, "html.parser")

comment = soup.get_text()

 

2. No clue what you’re looking for? Prettify your output first

Before I do extract anything, I have a look at the web content–soup helps you get through the code salad with some function called “prettify” to make it readable:


from bs4 import BeautifulSoup

soup = BeautifulSoup(textstring, "html.parser")

print soup.prettify()

 

where “name” is the filename.

3. Extracting URLs from <a> tags

Sometimes you find a link like this and want to extract its URL:

Here’s the code:


from bs4 import BeautifulSoup

soup = BeautifulSoup(textstring, "html.parser")

a=soup.findAll("a")

url=a[1]["href"].lower().strip()

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s