Archive: March 2007

cron and PYTHONPATH

March 4, 2007

For obvious reasons Dreamhost won’t let users install arbitrary Python packages into the shared site-packages directory, so I keep them in another directory (~/pylib) and add it to my PYTHONPATH.

I was beating my head against the wall because module imports would fail when scripst were run as cron jobs. It turns out that cron runs them in a limited environment, so my custom environment variables, including PYTHONPATH, don’t get loaded.

There are several fixes, but the easiest is to set the variable from within the crontab file:

PYTHONPATH=/home/sam/pylib
0 0 * * * something.py > something.log

Hooray!

(I wonder if I’ll ever learn to read the manual first…)

Audiobook: A Series of Unfortunate Events

March 27, 2007

The Bad Beginning, The Reptile Room, The Wide Window, The Miserable Mill, The Austere Academy, The Ersatz Elevator, The Vile Village, The Hostile Hospital, The Carnivorous Carnival, The Slippery Slope, The Grim Grotto, The Penultimate Peril, The End, by Lemony Snicket.

Presentation

The best thing about the books in A Series of Unfortunate Events is their existence as actual physical artifacts. If there were ever a compelling argument against e-paper, it’s these: beautiful, rough-cut hardcovers, tastefully illustrated by Brett Hellquist. They’re just wonderful, especially given the awful covers on so many children’s books; compare, for example, the grotesquely gaudiness of Artemis Fowl. The greatest tragedy of the film adaptation is that the charmless tie-in editions of the books replaced the originals on most store shelves.

It wouldn’t be outrageous to make a similar claim for the audio versions: Tim Curry (Rocky Horror etc.) reads, and he’s perfect.

There are two different editions of the first volume, one read entirely by Mr. Curry and one which he merely narrates, accompanied by sound effects and a full cast for the dialogue. I have only heard the original. Additionally — and unfortunately — he’s replaced on the third, fourth and fifth books by Daniel Handler, under his Lemony Snicket alias. He does a creditable job, but just can’t compare.

Content

The story proper begins at Briny Beach, where Violet, Klaus and Sunny Baudelaire have their holiday interrupted by news that their parents have perished in a terrible fire, which has also destroyed their house and all of their possessions. As if that wasn’t quite enough unpleasantness for a morning, the three newly-orphaned children are adopted by a distant relative, the villainous Count Olaf, whose only interest in them is their parents’ fortune, held in trust until Violet comes of age.

It’s charmingly dark. The Bad Beginning even opens with an admonition that the book not only has no happy ending, but also no happy beginning or middle; the reader would be better-off closing it and looking elsewhere for something more pleasant. It’s true: very little happens to the Baudelaires that isn’t bad, and what few happy times they do have usually consist of a brief respite between one horrible circumstance and the next. Nor is all cartoon-style peril, where the worst that can happen is a bump on the head from a falling anvil. It’s always the nicest ones who have to die…

No matter how dark it gets, the style is delightful. It’s self-conscious, with frequent authorial asides warning the reader to lay the books aside and take up something fun, like knitting, or bemoaning the way that Mr. Snicket himself was once shipwrecked, or poisoned, or trapped in a pit full of ravenous armadilloes. There are lots of nice little touches, like the explanatory parentheticals — here, the word “paranthetical” refers to a distracting mid-sentence interruption — which begin as vocabulary lessons for young readers but become saturated in irony as the series progresses.

The first seven books follow roughly the same formula: Violet, Klaus and Sunny are sent to live with a new guardian, only to be faced with another of Count Olaf’s schemes. It doesn’t take long for the novelty to wear off, so it comes as a relief when the sixth and seventh books gently transition into a new formula, based on the investigation of a mystery first uncovered at The Austere Academy. By the time the orphans visit The Hostile Hospital, they’ve taken control of their own lives and are actively attempting to unravel the Byzantine tangle of secrets they’ve become trapped within.

Events come to a head in the aptly-titled The Penultimate Peril, with virtually every character in the entire series — at least, those that are still alive — in residence at the Hotel Denouement, though prospective readers should bear in mind that the mystery is never really resolved. What makes The End, the final book, so charming and interesting, is that it makes the not-knowing palatable. The entire volume is, in effect, a continuation of the denouement that began at the (also aptly-titled) hotel, and its character is markedly different.

The End is the end of, essentially, a long coming-of-age story, charting the Baudelaires’ intellectual, emotional and moral maturation. They begin as children, thrown from the care of one adult to another, their only hope of escaping trouble to somehow alert an adult — notably Mr. Poe, their bank manager — of Olaf’s latest treachery. By The End, they’ve become fully indepdendent. (Most blatantly, Sunny grows from unintelligible baby to monosyllabic toddler to, finally, articulate gourmet chef.)

The moral character of the novels also changes. The Bad Beginning establishes a dichotomy between villains on the one hand, such as Count Olaf, and “noble people”, like Olaf’s kindly neighbour, on the other, but it is steadily eroded through the sequels. Most of the “noble people” suffer from a weakness of character that prevents them from helping the children: Josphine is a coward, Monty is too trusting, Jerome is too self-interested; others fall to peer pressure, blind faith in the rule of law, or undue deference to authority. The villains — even Olaf — are humanized, while the Baudelaires find themselves committing more than a few morally ambiguous acts.

It’s hard to complain that the story is never fully resolved when, in fact, that seems to be the point. Nothing is simple. Even villains have their good sides; even noble people can be villainous — and the difference between the two is only a matter of perspective. Every answer raises two new questions. Every person has two parents with stories of their own, four grandparents with stories of their own, ad infinitum.

They don’t live happily ever after, of course. If there’s a moral here, it’s that there is no end to the series of unfortunate events, just the books with that name. We call that series “life”. For some people it’s better, for some it’s worse, but what matters is how you live it. Try to do the right thing, find what happiness you can along the way, and ask yourself: “What’s the opposite of what Olaf would do?”

ESPN Soup

March 4, 2007

Someone emailed out of the blue after stumbling across my previous post on Beautiful Soup, wanting help extracting pitch-by-pitch baseball match data from the ESPN match-result pages. (For example, this game.) I thought: “why not?” If you’re new to screen-scraping, it might be of interest to you too.

Note

Actually harvesting ESPN data in this way is against the site’s terms of use, and republishing it, especially in a commercial context, would be like fresh blood in the water to their lawyers. It is not a course of action I recommend or endorse. It does, however, make a pretty good case-study in screen-scraping structured data out of markup not designed for that purpose. I believe that any use of ESPN materials in this post constitutes fair use/fair dealing for the purpose of research or study.

On with it

First, the setup: we need to get the page in question and parse it. httplib2 is a good choice for the fetching. For parsing, Beautiful Soup.

import httplib2
from BeautifulSoup import BeautifulSoup

url = "http://scores.espn.go.com/mlb/playbyplay?gameId=260726120&full=1"

http = httplib2.Http(".cache")
resp, body = http.request(url)
soup = BeautifulSoup(body)

What we’re looking for is the pitch-by-pitch data for each half, i.e., everything from “San Francisco - Top of 1st” on down. The page markup looks like this:

<table cellpadding="3" cellspacing="1" class="tablehead">
    <tr style="background-color: #000000;" class="stathead">
        <td colspan=2>San Francisco - Top of 1st</td>
        <td colspan=2 align="center">SCORE</td>
    </tr>
    <tr class="colhead">
        <td colspan=2>Pedro Astacio pitching for Washington</td>
        <td align="center">SFO</td>
        <td align="center">WAS</td>
    </tr>
    <tr class="oddrow">
        <td nowrap>Randy Winn</td>
        <td>Ball, Strike (foul), Strike (foul), Ball, <span class="bi">R Winn singled to center</span></td>
        <td align="center">0</td>
        <td align="center">0</td>            
    </tr>

There are lots of ways we could go about finding this in the page tree. We could look for some specific text, then navigate from it to the table:

td = soup.find(text="San Francisco - Top of 1st")
table = td.findPrevious("table")
# or:
table = td.parent.parent

Obviously, that’s not good enough if we want the script to work for a game where San Francisco wasn’t playing. Just the “Top of 1st” part, then:

expression = re.compile(r"Top of 1st$")
td = soup.find(text=expression)

Conveniently, though, that table is the only one on the page with a class of “tablehead“, so we’ll search by that instead.

table = soup.find("table", {"class":"tablehead"})

This method might be less robust, more fragile than searching by text. The problem with searching by attributes like “class” is that they’re non-essential: when the page is redesigned, there’s a reasonable chance that the class will change, be replaced by an id, and so on. The table itself might be replaced with an unordered list or a series of divs. It’s only the nonessential data (like the scores) that are guaranteed to stay.

We’ve already noticed that every half starts with a row containing unique text, like this:

San Francisco - Top of 1st

Or, more generally:

.*? - (Top|Bottom) of [1-9](st|nd|rd|th)

Again, we could search by that text using a regular expression, but each row has a class of “stathead”, so it’s easier to search by that.

halves = table.findAll("tr", {"class":"stathead"})
for half in halves:
    # process

Each of the halves we’ve found is a row (<tr>) with two cells, the first of which contains the relevant text. We can select it in half a dozen ways:

text = half.contents[0].contents[0]
text = half.find("td").contents[0]
text = half.td.contents[0]
text = half.td.string
text = half.find(text=True)

I prefer the last, because it most accurately models the way I’m thinking about the problem: “find the text in this element”. The others are more conscious of the specific markup involved.

halves = table.findAll("tr", {"class":"stathead"})
for half in halves:
    battingteam = half.find(text=True)

The score data is found in the rows following that one:

rows = half.findNextSiblings("tr")

# The pitcher is listed in the first row
pitcher = rows[0].find(text=True)

for row in rows[1:]:

Unfortunately, the call to findNextSiblings returned many more rows than we want. All of the halves are in the same table, so asking for every following row is the same as asking for every row in the current half, as well as every row in every half after that. Ideally we would only select the rows that we want, but it’s easier to get them all and just stop when we find one we don’t like.

for row in rows[1:]:
    # If the class is "stathead", we're at the next half
    if row["class"] == "stathead":
        break

There are three more types of rows we’ll have to deal with, and each one needs to be handled differently.

At the end of each half is a summary of runs, hits and errors:

# If the class is "colhead", it's showing the runs/hits/errors summary
elif row["class"] == "colhead":
summary = row.find(text=True)

Some rows note events such as a change of pitchers or a pinch-hitter:

# If the first cell contains nothing but a space, the second cell has
# information about relief etc.
elif row.contents[0].string == "&nbsp;":
    secondcell = row.contents[1]
    # Ignore the markup, get all the text and join it together
    info = "".join(secondcell.findAll(text=True))

There’s some extra markup there we could use to find out more about the data — when the pitcher changes, for example, it’s listed in bold green, but when the catcher changes the text is normal — but that also makes it harder to extract the text. We can’t just find the first text node: we have to recursively find all text nodes and join them together.

The rest of the rows list, pitch-by-pitch, each batter’s performance:

else:
    data = row.findAll("td")
    # First cell contains the batter's name
    batter = data[0].find(text=True)

    # Second cell lists pitch-by-pitch results.
    pitchbypitch = "".join(data[1].findAll(text=True))

    # Third and fourth cells list the current score
    team1score = data[2].find(text=True)
    team2score = data[3].find(text=True)

As with previous type of row, the cells with pitch-by-pitch play data have some text in bold, some in bold green, some in bold red. If we were scraping for statistics, we could do some more interesting parsing to pull out stats on how players went out, how they scored, and so on, but that’s something for another day.

The complete script has a few print statements thrown in there, for output like this:

San Francisco - Top of 1st
 Pedro Astacio pitching for Washington:
  Randy Winn: Ball, Strike (foul), Strike (foul), Ball, R Winn singled to center (0/0)
  Omar Vizquel: Strike (bunted foul), Ball, O Vizquel grounded into double play, third to second to
first, R Winn out at second (0/0)
  Shea Hillenbrand: Strike (looking), S Hillenbrand lined out to left (0/0)
-- 0 Runs, 1 Hits, 0 Errors --

I don’t even like baseball…

This page contains the archive of articles posted to rephrase.net during March 2007.