Over the past two weeks, I made great progress in collecting data for a new research project of mine. I had to deal with substantial amounts of web content and had to parse it in order to use it for some analyses. I typically rely on Python and its library Beautiful Soup for such jobs and the more I use it, the more I appreciate the little things. Here are the top three new hacks:
1. Getting rid of HTML tags
I had to extract raw text from web content I scraped. The content I wanted was hidden in a complete mess of HTML tags like this:
</span></div><br><div class=”comment”>
<span class=”commtext c00″>> "the models are 100% explainable"<p>In my experience this is largely illusory. People think they understand what the model is saying but forget that everything is based on assuming the model is a correct description of reality.<p>
Getting the “real” text out of it can be tricky. One way is to use regular expressions, but this can become unmanageable given the variety of HTML tags.
Here comes a little hack: use BeautifulSoup’s built-in text extraction function.
from bs4 import BeautifulSoup soup = BeautifulSoup(webcontent, "html.parser") comment = soup.get_text()
2. No clue what you’re looking for? Prettify your output first
Before I do extract anything, I have a look at the web content–soup helps you get through the code salad with some function called “prettify” to make it readable:
from bs4 import BeautifulSoup soup = BeautifulSoup(textstring, "html.parser") print soup.prettify()
where “name” is the filename.
3. Extracting URLs from <a> tags
Sometimes you find a link like this and want to extract its URL:
Here’s the code:
from bs4 import BeautifulSoup soup = BeautifulSoup(textstring, "html.parser") a=soup.findAll("a") url=a[1]["href"].lower().strip()