ESPN Soup [rephrase.net]

Sunday, 04 March 2007

ESPN Soup

Someone emailed out of the blue after stumbling across my previous post on Beautiful Soup, wanting help extracting pitch-by-pitch baseball match data from the ESPN match-result pages. (For example, this game.) I thought: "why not?" If you're new to screen-scraping, it might be of interest to you too.

Note

Actually harvesting ESPN data in this way is against the site's terms of use, and republishing it, especially in a commercial context, would be like fresh blood in the water to their lawyers. It is not a course of action I recommend or endorse. It does, however, make a pretty good case-study in screen-scraping structured data out of markup not designed for that purpose. I believe that any use of ESPN materials in this post constitutes fair use/fair dealing for the purpose of research or study.

On with it

First, the setup: we need to get the page in question and parse it. httplib2 is a good choice for the fetching. For parsing, Beautiful Soup.

import httplib2
from BeautifulSoup import BeautifulSoup

url = "http://scores.espn.go.com/mlb/playbyplay?gameId=260726120&full=1"

http = httplib2.Http(".cache")
resp, body = http.request(url)
soup = BeautifulSoup(body)

What we're looking for is the pitch-by-pitch data for each half, i.e., everything from "San Francisco - Top of 1st" on down. The page markup looks like this:

<table cellpadding="3" cellspacing="1" class="tablehead">
    <tr style="background-color: #000000;" class="stathead">
        <td colspan=2>San Francisco - Top of 1st</td>
        <td colspan=2 align="center">SCORE</td>
    </tr>
    <tr class="colhead">
        <td colspan=2>Pedro Astacio pitching for Washington</td>
        <td align="center">SFO</td>
        <td align="center">WAS</td>
    </tr>
    <tr class="oddrow">
        <td nowrap>Randy Winn</td>
        <td>Ball, Strike (foul), Strike (foul), Ball, <span class="bi">R Winn singled to center</span></td>
        <td align="center">0</td>
        <td align="center">0</td>            
    </tr>

There are lots of ways we could go about finding this in the page tree. We could look for some specific text, then navigate from it to the table:

td = soup.find(text="San Francisco - Top of 1st")
table = td.findPrevious("table")
# or:
table = td.parent.parent

Obviously, that's not good enough if we want the script to work for a game where San Francisco wasn't playing. Just the "Top of 1st" part, then:

expression = re.compile(r"Top of 1st$")
td = soup.find(text=expression)

Conveniently, though, that table is the only one on the page with a class of "tablehead", so we'll search by that instead.

table = soup.find("table", {"class":"tablehead"})

This method might be less robust, more fragile than searching by text. The problem with searching by attributes like "class" is that they're non-essential: when the page is redesigned, there's a reasonable chance that the class will change, be replaced by an id, and so on. The table itself might be replaced with an unordered list or a series of divs. It's only the nonessential data (like the scores) that are guaranteed to stay.

We've already noticed that every half starts with a row containing unique text, like this:

San Francisco - Top of 1st

Or, more generally:

.*? - (Top|Bottom) of [1-9](st|nd|rd|th)

Again, we could search by that text using a regular expression, but each row has a class of "stathead", so it's easier to search by that.

halves = table.findAll("tr", {"class":"stathead"})
for half in halves:
    # process

Each of the halves we've found is a row (<tr>) with two cells, the first of which contains the relevant text. We can select it in half a dozen ways:

text = half.contents[0].contents[0]
text = half.find("td").contents[0]
text = half.td.contents[0]
text = half.td.string
text = half.find(text=True)

I prefer the last, because it most accurately models the way I'm thinking about the problem: "find the text in this element". The others are more conscious of the specific markup involved.

halves = table.findAll("tr", {"class":"stathead"})
for half in halves:
    battingteam = half.find(text=True)

The score data is found in the rows following that one:

rows = half.findNextSiblings("tr")

# The pitcher is listed in the first row
pitcher = rows[0].find(text=True)

for row in rows[1:]:

Unfortunately, the call to findNextSiblings returned many more rows than we want. All of the halves are in the same table, so asking for every following row is the same as asking for every row in the current half, as well as every row in every half after that. Ideally we would only select the rows that we want, but it's easier to get them all and just stop when we find one we don't like.

for row in rows[1:]:
    # If the class is "stathead", we're at the next half
    if row["class"] == "stathead":
        break

There are three more types of rows we'll have to deal with, and each one needs to be handled differently.

At the end of each half is a summary of runs, hits and errors:

# If the class is "colhead", it's showing the runs/hits/errors summary
elif row["class"] == "colhead":
summary = row.find(text=True)

Some rows note events such as a change of pitchers or a pinch-hitter:

# If the first cell contains nothing but a space, the second cell has
# information about relief etc.
elif row.contents[0].string == "&nbsp;":
    secondcell = row.contents[1]
    # Ignore the markup, get all the text and join it together
    info = "".join(secondcell.findAll(text=True))

There's some extra markup there we could use to find out more about the data -- when the pitcher changes, for example, it's listed in bold green, but when the catcher changes the text is normal -- but that also makes it harder to extract the text. We can't just find the first text node: we have to recursively find all text nodes and join them together.

The rest of the rows list, pitch-by-pitch, each batter's performance:

else:
    data = row.findAll("td")
    # First cell contains the batter's name
    batter = data[0].find(text=True)

    # Second cell lists pitch-by-pitch results.
    pitchbypitch = "".join(data[1].findAll(text=True))

    # Third and fourth cells list the current score
    team1score = data[2].find(text=True)
    team2score = data[3].find(text=True)

As with previous type of row, the cells with pitch-by-pitch play data have some text in bold, some in bold green, some in bold red. If we were scraping for statistics, we could do some more interesting parsing to pull out stats on how players went out, how they scored, and so on, but that's something for another day.

The complete script has a few print statements thrown in there, for output like this:

San Francisco - Top of 1st
 Pedro Astacio pitching for Washington:
  Randy Winn: Ball, Strike (foul), Strike (foul), Ball, R Winn singled to center (0/0)
  Omar Vizquel: Strike (bunted foul), Ball, O Vizquel grounded into double play, third to second to
first, R Winn out at second (0/0)
  Shea Hillenbrand: Strike (looking), S Hillenbrand lined out to left (0/0)
-- 0 Runs, 1 Hits, 0 Errors --

I don't even like baseball...