Beautiful Soup [rephrase.net]

Tuesday, 09 January 2007

Beautiful Soup

It's been said that one of the advantages of our new "Web 2.0" is that, unlike its predecessor, it is a golden land of feeds, APIs and other Web services.¹ Personally, I'll believe that when I'm no longer subscribed to half a dozen pirate RSS feeds -- that is, generated by screen-scrapers -- and don't have the creation of half a dozen more on my to-do list.

Pity, though, because screen-scraping sucks: if the operators of these sites were on board with semantic markup, they'd probably be on board with RSS; scraping is often a frustrating exercise in attempting to extract meaningful data from a mass of nested tables, with unclosed tags, illegal characters and sundry other horrors.

That's where Leonard Richardson's Beautiful Soup comes in.

If it just cleaned up the input, it'd be a useful tool, but it doesn't stop at normalization; Beautiful Soup makes it easy to actually extract the data you want while ignoring everything else. It's not just useful, it's fantastic.

By way of example, it's fairly popular to write screen-scraping tools for the Internet Movie Database (e.g.: Python, Perl, PHP). There's no API. A few years ago I wrote a scraper for the list of the top 250 films, a page which is pretty much par for the course as far as markup is concerned:

<tr bgcolor="#e5e5e5" valign="top"><td align="right"><font face="Arial, Helvetica, sans-serif" size="-1"><b>1.</b></font></td><td align="center"><font face="Arial, Helvetica, sans-serif" size="-1">9.1</font></td><td><font face="Arial, Helvetica, sans-serif" size="-1"><a href="/title/tt0068646/">The Godfather</a> (1972)</font></td><td align="right"><font face="Arial, Helvetica, sans-serif" size="-1">187,822</font></td></tr>

It's not invalid or ill-formed, just ugly and not particularly useful. font, b, redundant inline CSS, and no helpful classes or ids to find elements by selector.

My parser looked something like this:

$page = fetch('http://imdb.com/chart/top');
preg_match_all("|<b>([0-9]+)\..+?([0-9]{1}\.[0-9]{1}).+?/title/(tt[0-9]+)/\">([^\(]+)[ ]\(([0-9]{4})|", $page, $matches);

for ($i=0; $i<250; $i++) {
    $rank        = $matches[1][$i];
    $imdb_rating = $matches[2][$i];
    $id          = $matches[3][$i];
    $title       = $matches[4][$i];
    $year        = $matches[5][$i];

    // do something with the data
}

As Jamie Zawinski put it, "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." That's not always true, but it is here: that expression is a blob of difficult-to-follow logic, and will stop working if the page markup changes. It could be neatened up and commented (with the x modifier), but will never be readable.

Compare to an implementation using Beautiful Soup:

resp, body = http.request("http://imdb.com/chart/top")
soup = BeautifulSoup(body)

# There is only one node with the text "Rank" and nothing
# else. It occurs in the first cell of the first row of the 
# list table.     
th_row = soup.find(text="Rank").findParent("tr")

# Each subsequent row is an entry in the list.
for td_row in th_row.findNextSiblings("tr"):
    rank, rating, title_data, votes = td_row.findAll("td")

    # Extract ID from href ("/title/tt{id}/")
    imdb_id = title_data.a["href"][9:-1]

    # Now get the two text nodes, title and year.
    title, year = title_data.findAll(text=True)
    title = title.strip()
    year = year.strip(" ()")

    # We only want the text from the other cells, not
    # the containing markup.        
    rank, rating, votes = [node.find(text=True).strip(" .") for node in [rank, rating, votes]]

It's not shorter, but it's more natural. It would have also been possible to create without even examining the page markup, because it uses visible text (the "Rank" column header) to locate the required data. A similar implementation could be created using XPath, though I've never seen a library that made it anywhere near as natural.

Beautiful Soup can do a lot more, however, and it doesn't require markup to be well-formed. Even without the power provided by subclassing the parser, it's incredibly flexible.

Anywhere a string is accepted, for example, you can pass in a regular expression object:

# Match the cells with "1.", "2.", "3." etc.
rank_match = re.compile(r"^\d+\.$")
for rank in soup.findAll("td", text=rank_match):
    rating, title_date, votes = rank.findParent("td").findNextSiblings("td")
    # etc.

Or pass in a function that takes a Tag object:

# True if a list rank <tr>, else False
def check_this_tag(tag):
    # must be a row
    if tag.name != "tr":
        return False

    # with four children
    if len(tag) > 4:
        return False

    # the first child must contain "<b>{[0-9]+}.</b>"
    elif not re.match(r"\d+\.", tag.contents[0].b.string):
        return False

    # and so on

    else:
        return True

for tr in soup.findAll(check_this_tag):
    # etc.

I think I'm in love.

No, not WS-*. ↩