Someone emailed out of the blue after stumbling across my previous post on Beautiful Soup, wanting help extracting pitch-by-pitch baseball match data from the ESPN match-result pages. (For example, this game.) I thought: “why not?” If you’re new to screen-scraping, it might be of interest to you too.
Actually harvesting ESPN data in this way is against the site’s terms of use, and republishing it, especially in a commercial context, would be like fresh blood in the water to their lawyers. It is not a course of action I recommend or endorse. It does, however, make a pretty good case-study in screen-scraping structured data out of markup not designed for that purpose. I believe that any use of ESPN materials in this post constitutes fair use/fair dealing for the purpose of research or study.
First, the setup: we need to get the page in question and parse it. httplib2 is a good choice for the fetching. For parsing, Beautiful Soup.
import httplib2
from BeautifulSoup import BeautifulSoup
url = "http://scores.espn.go.com/mlb/playbyplay?gameId=260726120&full=1"
http = httplib2.Http(".cache")
resp, body = http.request(url)
soup = BeautifulSoup(body)
What we’re looking for is the pitch-by-pitch data for each half, i.e., everything from “San Francisco - Top of 1st” on down. The page markup looks like this:
<table cellpadding="3" cellspacing="1" class="tablehead">
<tr style="background-color: #000000;" class="stathead">
<td colspan=2>San Francisco - Top of 1st</td>
<td colspan=2 align="center">SCORE</td>
</tr>
<tr class="colhead">
<td colspan=2>Pedro Astacio pitching for Washington</td>
<td align="center">SFO</td>
<td align="center">WAS</td>
</tr>
<tr class="oddrow">
<td nowrap>Randy Winn</td>
<td>Ball, Strike (foul), Strike (foul), Ball, <span class="bi">R Winn singled to center</span></td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
There are lots of ways we could go about finding this in the page tree. We could look for some specific text, then navigate from it to the table:
td = soup.find(text="San Francisco - Top of 1st")
table = td.findPrevious("table")
# or:
table = td.parent.parent
Obviously, that’s not good enough if we want the script to work for a game where San Francisco wasn’t playing. Just the “Top of 1st” part, then:
expression = re.compile(r"Top of 1st$")
td = soup.find(text=expression)
Conveniently, though, that table is the only one on the page with a class of “tablehead“, so we’ll search by that instead.
table = soup.find("table", {"class":"tablehead"})
This method might be less robust, more fragile than searching by text. The problem with searching by attributes like “class” is that they’re non-essential: when the page is redesigned, there’s a reasonable chance that the class will change, be replaced by an id, and so on. The table itself might be replaced with an unordered list or a series of divs. It’s only the nonessential data (like the scores) that are guaranteed to stay.
We’ve already noticed that every half starts with a row containing unique text, like this:
San Francisco - Top of 1st
Or, more generally:
.*? - (Top|Bottom) of [1-9](st|nd|rd|th)
Again, we could search by that text using a regular expression, but each row has a class of “stathead”, so it’s easier to search by that.
halves = table.findAll("tr", {"class":"stathead"})
for half in halves:
# process
Each of the halves we’ve found is a row (<tr>) with two cells, the first of which contains the relevant text. We can select it in half a dozen ways:
text = half.contents[0].contents[0]
text = half.find("td").contents[0]
text = half.td.contents[0]
text = half.td.string
text = half.find(text=True)
I prefer the last, because it most accurately models the way I’m thinking about the problem: “find the text in this element”. The others are more conscious of the specific markup involved.
halves = table.findAll("tr", {"class":"stathead"})
for half in halves:
battingteam = half.find(text=True)
The score data is found in the rows following that one:
rows = half.findNextSiblings("tr")
# The pitcher is listed in the first row
pitcher = rows[0].find(text=True)
for row in rows[1:]:
Unfortunately, the call to findNextSiblings returned many more rows than we want. All of the halves are in the same table, so asking for every following row is the same as asking for every row in the current half, as well as every row in every half after that. Ideally we would only select the rows that we want, but it’s easier to get them all and just stop when we find one we don’t like.
for row in rows[1:]:
# If the class is "stathead", we're at the next half
if row["class"] == "stathead":
break
There are three more types of rows we’ll have to deal with, and each one needs to be handled differently.
At the end of each half is a summary of runs, hits and errors:
# If the class is "colhead", it's showing the runs/hits/errors summary
elif row["class"] == "colhead":
summary = row.find(text=True)
Some rows note events such as a change of pitchers or a pinch-hitter:
# If the first cell contains nothing but a space, the second cell has
# information about relief etc.
elif row.contents[0].string == " ":
secondcell = row.contents[1]
# Ignore the markup, get all the text and join it together
info = "".join(secondcell.findAll(text=True))
There’s some extra markup there we could use to find out more about the data — when the pitcher changes, for example, it’s listed in bold green, but when the catcher changes the text is normal — but that also makes it harder to extract the text. We can’t just find the first text node: we have to recursively find all text nodes and join them together.
The rest of the rows list, pitch-by-pitch, each batter’s performance:
else:
data = row.findAll("td")
# First cell contains the batter's name
batter = data[0].find(text=True)
# Second cell lists pitch-by-pitch results.
pitchbypitch = "".join(data[1].findAll(text=True))
# Third and fourth cells list the current score
team1score = data[2].find(text=True)
team2score = data[3].find(text=True)
As with previous type of row, the cells with pitch-by-pitch play data have some text in bold, some in bold green, some in bold red. If we were scraping for statistics, we could do some more interesting parsing to pull out stats on how players went out, how they scored, and so on, but that’s something for another day.
The complete script has a few print statements thrown in there, for output like this:
San Francisco - Top of 1st
Pedro Astacio pitching for Washington:
Randy Winn: Ball, Strike (foul), Strike (foul), Ball, R Winn singled to center (0/0)
Omar Vizquel: Strike (bunted foul), Ball, O Vizquel grounded into double play, third to second to
first, R Winn out at second (0/0)
Shea Hillenbrand: Strike (looking), S Hillenbrand lined out to left (0/0)
-- 0 Runs, 1 Hits, 0 Errors --
I don’t even like baseball…
cron and PYTHONPATHFor obvious reasons Dreamhost won’t let users install arbitrary Python packages into the shared site-packages directory, so I keep them in another directory (~/pylib) and add it to my PYTHONPATH.
I was beating my head against the wall because module imports would fail when scripst were run as cron jobs. It turns out that cron runs them in a limited environment, so my custom environment variables, including PYTHONPATH, don’t get loaded.
There are several fixes, but the easiest is to set the variable from within the crontab file:
PYTHONPATH=/home/sam/pylib
0 0 * * * something.py > something.log
Hooray!
(I wonder if I’ll ever learn to read the manual first…)
WordPress 2.1 is incompatible with the previous version of Multiply.
I’ve already had some mail from people who upgraded WordPress and found no joy, but the plugin has now been updated and, I hope, fixed. That said, there are some things that prospective upgraders should be aware of.
Everything seems to be working on my local install, but I have not had time for proper testing, and won’t for the forseeable future. I won’t be available at all until Monday at the earliest: if you try this upgrade, you’re on your own. (Back up first!)
If you’re game, install the latest version of the plugin and follow the instructions on upgrading.
It’s been said that one of the advantages of our new “Web 2.0″ is that, unlike its predecessor, it is a golden land of feeds, APIs and other Web services.1 Personally, I’ll believe that when I’m no longer subscribed to half a dozen pirate RSS feeds — that is, generated by screen-scrapers — and don’t have the creation of half a dozen more on my to-do list.
Pity, though, because screen-scraping sucks: if the operators of these sites were on board with semantic markup, they’d probably be on board with RSS; scraping is often a frustrating exercise in attempting to extract meaningful data from a mass of nested tables, with unclosed tags, illegal characters and sundry other horrors.
That’s where Leonard Richardson’s Beautiful Soup comes in.
If it just cleaned up the input, it’d be a useful tool, but it doesn’t stop at normalization; Beautiful Soup makes it easy to actually extract the data you want while ignoring everything else. It’s not just useful, it’s fantastic.
By way of example, it’s fairly popular to write screen-scraping tools for the Internet Movie Database (e.g.: Python, Perl, PHP). There’s no API. A few years ago I wrote a scraper for the list of the top 250 films, a page which is pretty much par for the course as far as markup is concerned:
<tr bgcolor="#e5e5e5" valign="top"><td align="right"><font face="Arial, Helvetica, sans-serif" size="-1"><b>1.</b></font></td><td align="center"><font face="Arial, Helvetica, sans-serif" size="-1">9.1</font></td><td><font face="Arial, Helvetica, sans-serif" size="-1"><a href="/title/tt0068646/">The Godfather</a> (1972)</font></td><td align="right"><font face="Arial, Helvetica, sans-serif" size="-1">187,822</font></td></tr>
It’s not invalid or ill-formed, just ugly and not particularly useful. font, b, redundant inline CSS, and no helpful classes or ids to find elements by selector.
My parser looked something like this:
$page = fetch('http://imdb.com/chart/top');
preg_match_all("|<b>([0-9]+)\..+?([0-9]{1}\.[0-9]{1}).+?/title/(tt[0-9]+)/\">([^\(]+)[ ]\(([0-9]{4})|", $page, $matches);
for ($i=0; $i<250; $i++) {
$rank = $matches[1][$i];
$imdb_rating = $matches[2][$i];
$id = $matches[3][$i];
$title = $matches[4][$i];
$year = $matches[5][$i];
// do something with the data
}
As Jamie Zawinski put it, “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” That’s not always true, but it is here: that expression is a blob of difficult-to-follow logic, and will stop working if the page markup changes. It could be neatened up and commented (with the x modifier), but will never be readable.
Compare to an implementation using Beautiful Soup:
resp, body = http.request("http://imdb.com/chart/top")
soup = BeautifulSoup(body)
# There is only one node with the text "Rank" and nothing
# else. It occurs in the first cell of the first row of the
# list table.
th_row = soup.find(text="Rank").findParent("tr")
# Each subsequent row is an entry in the list.
for td_row in th_row.findNextSiblings("tr"):
rank, rating, title_data, votes = td_row.findAll("td")
# Extract ID from href ("/title/tt{id}/")
imdb_id = title_data.a["href"][9:-1]
# Now get the two text nodes, title and year.
title, year = title_data.findAll(text=True)
title = title.strip()
year = year.strip(" ()")
# We only want the text from the other cells, not
# the containing markup.
rank, rating, votes = [node.find(text=True).strip(" .") for node in [rank, rating, votes]]
It’s not shorter, but it’s more natural. It would have also been possible to create without even examining the page markup, because it uses visible text (the “Rank” column header) to locate the required data. A similar implementation could be created using XPath, though I’ve never seen a library that made it anywhere near as natural.
Beautiful Soup can do a lot more, however, and it doesn’t require markup to be well-formed. Even without the power provided by subclassing the parser, it’s incredibly flexible.
Anywhere a string is accepted, for example, you can pass in a regular expression object:
# Match the cells with "1.", "2.", "3." etc.
rank_match = re.compile(r"^\d+\.$")
for rank in soup.findAll("td", text=rank_match):
rating, title_date, votes = rank.findParent("td").findNextSiblings("td")
# etc.
Or pass in a function that takes a Tag object:
# True if a list rank <tr>, else False
def check_this_tag(tag):
# must be a row
if tag.name != "tr":
return False
# with four children
if len(tag) > 4:
return False
# the first child must contain "<b>{[0-9]+}.</b>"
elif not re.match(r"\d+\.", tag.contents[0].b.string):
return False
# and so on
else:
return True
for tr in soup.findAll(check_this_tag):
# etc.
I think I’m in love.
No, not WS-*. ↩
Almost had a heart attack:
This is just a notice that your DreamHost Account #XXXX ("XXXX's Account") has a balance of $1335.57 (including any charges not due until 2009-01-13), with $1335.57 due (since 2008-12-13).
Most people know two things about the Hays Code. One is that the bedrooms of all married couples could contain only twin beds, which had to be at least 27 inches apart. The other is that although the Code was written in 1930, it was not enforced until 1934, and that as a result, the "pre-Code cinema" of the early 1930s violated its rules with impunity in a series of "wildly unconventional films" that were "more unbridled, salacious, subversive, and just plain bizarre" than in any other period of Hollywood's history.
Neither of these things is true.
From his Believer interview with Nick Hornby:
My standard for verisimilitude is simple and I came to it when I started to write prose narrative: fuck the average reader. I was always told to write for the average reader in my newspaper life. The average reader, as they meant it, was some suburban white subscriber with two-point-whatever kids and three-point-whatever cars and a dog and a cat and lawn furniture. He knows nothing and he needs everything explained to him right away, so that exposition becomes this incredible, story-killing burden. Fuck him. Fuck him to hell.
Nice Win32 front-end for DVDAuthor, amongst others.
[A] restive contingent of our tribe is convinced that it can shed light on traditional philosophical problems by going out and gathering information about what people actually think and say about our thought experiments. [...]
It always irritated me in philosophy discussions when someone would seriously make an argument that something was "intuitively" true or false. There are already some interesting results:
Recently, a team of philosophers led by Machery came up with situations that had the same form as Kripke's and presented them to two groups of undergraduates — one in New Jersey and another in Hong Kong. The Americans, it turned out, were significantly more likely to give the responses that Kripke took to be obvious; the Chinese students had intuitions that were consonant with the older theory of reference.