Beautiful Soup

January 9, 2007

It’s been said that one of the advantages of our new “Web 2.0″ is that, unlike its predecessor, it is a golden land of feeds, APIs and other Web services.1 Personally, I’ll believe that when I’m no longer subscribed to half a dozen pirate RSS feeds — that is, generated by screen-scrapers — and don’t have the creation of half a dozen more on my to-do list.

Pity, though, because screen-scraping sucks: if the operators of these sites were on board with semantic markup, they’d probably be on board with RSS; scraping is often a frustrating exercise in attempting to extract meaningful data from a mass of nested tables, with unclosed tags, illegal characters and sundry other horrors.

That’s where Leonard Richardson’s Beautiful Soup comes in.

If it just cleaned up the input, it’d be a useful tool, but it doesn’t stop at normalization; Beautiful Soup makes it easy to actually extract the data you want while ignoring everything else. It’s not just useful, it’s fantastic.

By way of example, it’s fairly popular to write screen-scraping tools for the Internet Movie Database (e.g.: Python, Perl, PHP). There’s no API. A few years ago I wrote a scraper for the list of the top 250 films, a page which is pretty much par for the course as far as markup is concerned:

<tr bgcolor="#e5e5e5" valign="top"><td align="right"><font face="Arial, Helvetica, sans-serif" size="-1"><b>1.</b></font></td><td align="center"><font face="Arial, Helvetica, sans-serif" size="-1">9.1</font></td><td><font face="Arial, Helvetica, sans-serif" size="-1"><a href="/title/tt0068646/">The Godfather</a> (1972)</font></td><td align="right"><font face="Arial, Helvetica, sans-serif" size="-1">187,822</font></td></tr>

It’s not invalid or ill-formed, just ugly and not particularly useful. font, b, redundant inline CSS, and no helpful classes or ids to find elements by selector.

My parser looked something like this:

$page = fetch('http://imdb.com/chart/top');
preg_match_all("|<b>([0-9]+)\..+?([0-9]{1}\.[0-9]{1}).+?/title/(tt[0-9]+)/\">([^\(]+)[ ]\(([0-9]{4})|", $page, $matches);

for ($i=0; $i<250; $i++) {
    $rank        = $matches[1][$i];
    $imdb_rating = $matches[2][$i];
    $id          = $matches[3][$i];
    $title       = $matches[4][$i];
    $year        = $matches[5][$i];

    // do something with the data
}

As Jamie Zawinski put it, “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” That’s not always true, but it is here: that expression is a blob of difficult-to-follow logic, and will stop working if the page markup changes. It could be neatened up and commented (with the x modifier), but will never be readable.

Compare to an implementation using Beautiful Soup:

resp, body = http.request("http://imdb.com/chart/top")
soup = BeautifulSoup(body)

# There is only one node with the text "Rank" and nothing
# else. It occurs in the first cell of the first row of the 
# list table.     
th_row = soup.find(text="Rank").findParent("tr")

# Each subsequent row is an entry in the list.
for td_row in th_row.findNextSiblings("tr"):
    rank, rating, title_data, votes = td_row.findAll("td")

    # Extract ID from href ("/title/tt{id}/")
    imdb_id = title_data.a["href"][9:-1]

    # Now get the two text nodes, title and year.
    title, year = title_data.findAll(text=True)
    title = title.strip()
    year = year.strip(" ()")

    # We only want the text from the other cells, not
    # the containing markup.        
    rank, rating, votes = [node.find(text=True).strip(" .") for node in [rank, rating, votes]]

It’s not shorter, but it’s more natural. It would have also been possible to create without even examining the page markup, because it uses visible text (the “Rank” column header) to locate the required data. A similar implementation could be created using XPath, though I’ve never seen a library that made it anywhere near as natural.

Beautiful Soup can do a lot more, however, and it doesn’t require markup to be well-formed. Even without the power provided by subclassing the parser, it’s incredibly flexible.

Anywhere a string is accepted, for example, you can pass in a regular expression object:

# Match the cells with "1.", "2.", "3." etc.
rank_match = re.compile(r"^\d+\.$")
for rank in soup.findAll("td", text=rank_match):
    rating, title_date, votes = rank.findParent("td").findNextSiblings("td")
    # etc.

Or pass in a function that takes a Tag object:

# True if a list rank <tr>, else False
def check_this_tag(tag):
    # must be a row
    if tag.name != "tr":
        return False

    # with four children
    if len(tag) > 4:
        return False

    # the first child must contain "<b>{[0-9]+}.</b>"
    elif not re.match(r"\d+\.", tag.contents[0].b.string):
        return False

    # and so on

    else:
        return True

for tr in soup.findAll(check_this_tag):
    # etc.

I think I’m in love.


  1. No, not WS-*

Can craft

December 7, 2006

I saw a post at MetaFilter about aluminium-can folk art and decided to have a go. It turned out to be easier than I thought.

The metal is thin enough that it’s trivial to cut with tin snips or even regular scissors; a ruler and a knife are good for straight lines. I used a regular Stanley knife with a segmented blade and had no problems other than the expected blunting.

Removing the top and bottom of the can without accidentally crushing it is the hard part, and even that isn’t very hard. Filling the can with water and freezing it might help there, especially if using a saw.

That leaves a sheet of metal:

Sheet metal

Once the jagged edges are taken into account, you should be able to get a usable sheet of at least 20cm by 8cm.

It’s quite an anticlimax, actually, because you work with the aluminium in almost exactly the same way as with paper or cardboard; it’s just harder to cut and harder to glue. (Staples work fine.) It also can’t withstand repeated stress from folding, so origami is out of the question, but it’s well-suited to cut-out models that don’t require repeated folds. In exchange, you get something that can survive harsh conditions. If you leave the outside of the can visible, you might even get a little pop-culture cachet.

This is what I ended up with:

Can flower

(And a few more.)

Nothing impressive compared to what other people have produced, but that was always a given: someone has already built an entire house.

As is

December 6, 2006

Apache is great. My latest discovery is mod_asis, which “provides for sending files which contain their own HTTP headers”. You just put the headers at the top of the file, separated from the content with a blank line.

Status: 200 OK
Content-Type: text/plain; charset=UTF-8
Last-Modified: Wed, 29 Nov 2006 08:12:31 GMT

Hello!

There’s nothing there you can’t do in your programming language of choice, but it’s far more elegant than, say, the PHP equivalent:

<?php
header('HTTP/1.1 200 OK');
header('Content-Type: text/plain; charset=UTF-8');
header('Last-Modified: Wed, 29 Nov 2006 08:12:31 GMT');

?>Hello!

It still seems pretty useless, but there are times when custom headers are important. At the very least, it’s a perfect fit for unit-testing HTTP clients.

For a more common example, consider the simple deletion of a file from a server. If you have no intention of putting it back, requests to that URL should return a status code of 410 (”Gone“), so clients know not to retry the request later. Apache has no mod_psychic to divine the webmaster’s intentions, and will return a less-useful 404 (”Not Found”). It’s a deliberately vague response, because the long-term status of the resource isn’t clear.

With mod_asis you can add index.asis to your DirectoryIndex directive as a drop-in replacement for index.*:

AddHandler send-as-is asis
DirectoryIndex index.html index.php index.asis

Or work some magic with (e.g.) MultiViews. Tada, a 410:

Status: 410 Gone
Content-Type: text/html

<html>
 <head>
  <title>Content Removed</title>
 </head>
 <body>
  <h1>Content Removed</h1>
  <p>Sorry -- this page of saucy H.P. Lovecraft limericks is gone and shan't be coming back. Cthulhu fhtagn!</p>
 </body>
</html>

Okay, so in practice there’ll probably be an easier or better way, like Redirect gone or mod_rewrite with the G modifier, or having things handled automatically by your content-management system.

Still, it’s a nice tool, and you can never have too many of those.

Resource and representation; or, miscellaneous things learned while reading W3C mailing list archives, part one

November 8, 2006

Roy Fielding:

At no time whatsoever is the resource transferred across the network when doing a GET. Only a REPRESENTATION of that resource is transferred, and the fragment refers to a target within the representation and not within the resource. That is why fragments are media-type specific.

Files

Practically speaking, 99% of experience with URIs is going to be with http URLs on the World Wide Web. It’s from using websites, developing websites, server administration. For me, at least, it was a little of this and a little of that, only learning just enough to be getting along with the task at hand. That task never involved writing a HTTP server, so there were definitely no specifications involved. (Who reads standards anyway?)

That limited experience encouraged a file-centric view of the Web. It’s natural enough, given that most Web server software automagically provides a 1:1 mapping between files in a directory and publicly-accessible URLs:

  • /~sam/public_html/index.html = http://example.com/index.html
  • C:\apache\htdocs\pages\giraffe.jpg = http://example.com/pages/giraffe.jpg

Ergo, resources are files/documents, and it’s the resource itself — at least a bitwise copy of it — sent down the wire in response to requests. The server gets a request for index.html, and that’s what it sends back: you can tell it’s the same file; just “view source”.

Fake files

Moving from hand-coded HTML to dynamic content ought to have fixed this misconception, but it made things worse. Instead of this:

http://example.com/page/giraffe.html

I had this:

http://example.com/blog/01/01/giraffe

Instead of a HTML file, I had a text document about giraffes, stored in a database, dynamically converted into HTML and interpolated into a template written in yet another language, sent as bits and bytes to a remote client, which rendered it — styled by another language again — into something pretty and human-readable. Which of those was the “resource”?

I thought of it as faking files. It was conning browsers into thinking they’d been given real files when, in fact, it was magic fairy gold. I thought it was a clever trick. A quick Google search turns up plenty of pages with titles like “fake files/directories using mod_rewrite“, so I’m clearly not alone.

There are other reasons to take this view, too, like the insistence on calling a HTTP 404 status code a “File Not Found” error. (It’s just “Not Found”.)

Resources

It wasn’t “faking files”, because a URI doesn’t identify a file. Not only in practical terms, because the server software can generate its output however it pleases, but theoretically. It’s not a “Uniform File Identifier”: a URI identifies a resource.

RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. […]

This specification does not limit the scope of what might be a resource; rather, the term “resource” is used in a general sense for whatever might be identified by a URI.

That sounds frighteningly ambiguous, but the generality is liberating. Even if you don’t want to use a URI in ways that aren’t intuitive — what does it mean for one to identify a “physical resource”? — it’s obvious that an abstract resource is more flexible than a file. A resource can be anything. (A file is a resource, because “any information that can be named can be a resource“; it’s just a limited one.)

Conceptually, it’s a whole different ballgame.

Negotiation

To return to the quote that began this entry: performing a HTTP GET request does not transfer a resource. It transfers a representation of that resource. What makes this interesting is the explicit permission to have more than one representation of the same thing.

  • A multilingual site will have pages available in more than one language.
  • Some clients prefer different formats. An essay could be presented in plain text, HTML, Atom, PDF, streaming audio, interpretive dance video…

The only necessary thing is that the essential characteristics of the resource are present in every representation. Since a URI can identify just about anything, that means it’s up to you to decide what those characteristics are.

My favourite example is a URI that identifies a circle. It being a simple matter of geometry, there are dozens of possible representations which provide enough information to construct the same circle:

  1. The radius, plus coordinates of the centre.
  2. Three or more points on the circumference.
  3. Three points making up an equilateral triangle, with the qualification that the circle meets the midpoint of each side.

And so on and so on and so on. Each of these can be presented in many different formats:

  1. Plain text, XML, or something else vaguely human-readable format.
  2. A vector graphics format, such as SVG.
  3. Spoken aloud and recorded as an MP3.

They share the same essential characteristics — the data required to reconstruct the circle. That’s it. Not all clients can handle all formats, of course: my math is extremely rusty, so calculus is right out; and I’m not fluent in binary. In HTTP, the solution is that clients send an Accept header specifying which formats are acceptable. The server gives them what they want, or, if that’s not possible, replies with “406 Not Acceptable”.

There are good reasons not to use content negotiation: it can be confusing to users, it’s more expensive to implement, causes errors in some situations, and adds complexity that’s probably unnecessary. More to the point, it’s likely that most will want the same format anyway, so why bother?

On the other hand, it opens up a lot of interesting possibilities, and in some contexts the costs are minimized. It can be useful if diversity of formats is already going to be supported, for example, such as when creating an interface for a web service. Instead of different URIs for data in XML or JSON, the preferred type can be specified in the Accept header.

Methods

There’s an aptly-titled section in Fielding’s dissertation called “Manipulating Shadows“:

Defining resource such that a URI identifies a concept rather than a document leaves us with another question: how does a user access, manipulate, or transfer a concept such that they can get something useful when a hypertext link is selected? REST answers that question by defining the things that are manipulated to be representations of the identified resource, rather than the resource itself. An origin server maintains a mapping from resource identifiers to the set of representations corresponding to each resource. A resource is therefore manipulated by transferring representations through the generic interface defined by the resource identifier.

How can a resource be “manipulated by transferring representations”?

The circle example from the previous section is a good illustration. Changing the radius of a circle results in a different circle, either larger or smaller, but that doesn’t mean that a radius and a centre point are the circle. They’re just a representation of it.

On the Web, the representation submitted to the server is most likely to be the body of a HTTP POST request — from a HTML form, for example — which consists of nothing more than specially encoded key-value pairs. The server can take that data and construct weblog comments, airline bookings, whatever.

Atom

This explains everything I never understood about the Atom Publishing Protocol. How is it possible to POST an Atom entry to a collection and create a resource that will display in HTML? Especially one that will display alongside significant non-entry data: CSS for presentation, sidebars, navigation etc. Why is the link URL different from the edit-URI and different from the ID URI? Doesn’t the Atom representation identified by the edit-URI better represent the entry than the one located by the link?

The technical side, the mechanics of it, that was always easy. It was the theory that didn’t make sense. It was like “faking files”: a kind of dodgy hack that worked, certainly, but felt dirty, like taking advantage of a weakness in the system. Like a kludge.

If anything, the truth is the exact opposite. Having a separate edit-URI feels like a nod to practicality; there’s nothing stopping a pure implementation using content-negotiation to serve both HTML and Atom from the same address. If there’s a hack there, it’s in mapping files to URLs without thought for what resources those URIs will identify. It’s quick and convenient, and sure, it works. But it’s still a kludge.

(A rose by any other name?)

Further Reading

David Simon On “The Average Reader”

 #

From his Believer interview with Nick Hornby:

My standard for verisimilitude is simple and I came to it when I started to write prose narrative: fuck the average reader. I was always told to write for the average reader in my newspaper life. The average reader, as they meant it, was some suburban white subscriber with two-point-whatever kids and three-point-whatever cars and a dog and a cat and lawn furniture. He knows nothing and he needs everything explained to him right away, so that exposition becomes this incredible, story-killing burden. Fuck him. Fuck him to hell.

DVD Flick - GPL DVD authoring from DivX, FLV etc.

 #

Nice Win32 front-end for DVDAuthor, amongst others.

Philosophy as an experimental science

 #

[A] restive contingent of our tribe is convinced that it can shed light on traditional philosophical problems by going out and gathering information about what people actually think and say about our thought experiments. [...]

It always irritated me in philosophy discussions when someone would seriously make an argument that something was "intuitively" true or false. There are already some interesting results:

Recently, a team of philosophers led by Machery came up with situations that had the same form as Kripke's and presented them to two groups of undergraduates — one in New Jersey and another in Hong Kong. The Americans, it turned out, were significantly more likely to give the responses that Kripke took to be obvious; the Chinese students had intuitions that were consonant with the older theory of reference.

ASCII Rave in Haskell

 [via]#

The idea is to be able to make instrumental sounds by typing onomatopoeic words.

Software now available.

insightful internet comment of the day

 #

Pretty much everyone laughs at suffering of some sort or another, and everyone has a limit, a degree of suffering beyond which they'll find no humor. And everyone thinks that people who draw the line elsewhere are either hypersensitive or monstrous.

  • Mr. President Dr. Steve Elvis America

(You can't argue with qualifications like those!)