Sam's infrequently-updated cabinet of curiosities
Friday, 27 April 2007

PyParsing, A Casual Introduction

PyParsing is a Python library for building recursive descent parsers. An example speaks a thousand words:

>>> from pyparsing import Word, OneOrMore
>>> words = Word("abcdefghijklmnopqrstuvwxyz")
>>> sentence = OneOrMore(words) + "."
>>> print sentence.parseString("i am a simple sentence.")
['i', 'am', 'a', 'simple', 'sentence', '.']

This document is a record of my first attempt at using PyParsing: we're going to be building a parser for Google-style search queries. It's not a hard project, but the syntax is deceptively simple. Basic search terms are trivial, but then we have to deal with quoted strings, boolean operators, special term-operators like site: and inurl:, parenthesised groups...

In approximately ascending order of difficulty, these examples will be the basis for a test suite; the parser will be considered complete when it can handle all of them.

>>> tests = [
... 'paris',
... 'love paris',
... 'paris -hate',
... '"Paris, je t\'aime"',
... 'love incity:paris',
... 'love incity:paris with:"pretty french girl"',
... 'love OR "just lust?"',
... '(love OR "just lust?") incity:paris',
... '(love OR "just lust?") (in:paris with:girl)',
... '(((love OR lust) in:paris) with:girl) OR "watch TV at home"',
... '((tv at:home alone) OR (in:paris with:girl)) -(difficult decision)'
... ]
>>> def test(parser):
...     for test in tests:
...         try:
...             print test, "->", parser.parseString(test)
...         except:
...             print "FAILED"
...             return
...     print "HOORAY!"

It won't be complete, but it'll do. The tests aren't automatic, because for that we'd need to decide up-front on the kind of data structure to parse the queries into, and it's easier to learn by diving in than by collecting requirements. We'll just adjust the output as necessary.

1. --

A basic query consists of one or more words. With apologies to most of the population of the world, we'll assume that a word can only contain the letters [a-zA-Z].

>>> from pyparsing import *
>>> term = Word(alphas)
>>> terms = OneOrMore(term) + StringEnd()

>>> test(terms)
paris -> ['paris']
love paris -> ['love', 'paris']
paris -hate -> FAILED

Already a failure! The parser can't handle the minus sign. The easiest fix is to just add "-" to the list of allowed characters:

>>> term = Word(alphas + "-")

2. --

Negative terms have different semantics, so it's better to represent them in the grammar with a different rule. It's easier to read, and we can get the parser to trigger different actions for different types of token.

>>> simple = Word(alphas)
>>> negative = "-" + simple
>>> terms = OneOrMore(simple | negative) + StringEnd()

>>> test(terms)
paris -> ['paris']
love paris -> ['love', 'paris']
paris -hate -> ['paris', '-', 'hate']
"Paris, je t'aime" -> FAILED

3. --

The quoted string was a disaster waiting to happen, what with all those spaces and commas and things. PyParsing offers a couple of solutions, so we'll use the built-in dblQuotedString.

>>> single = dblQuotedString | Word(alphas)

The quotes aren't technically part of the search term, so we'll use a parser action to strip them. Parser actions are just functions (or classes, or anything else callable like a function), and they can be added inline. These two examples are effectively the same:

quoted = dblQuotedString
quoted.setParseAction(lambda s,l,t: [t[1:-1]])
single = quoted | Word(alphas)

single = dblQuotedString.setParseAction(removeQuotes) | Word(alphas)

4. --

Moving on, it's obvious that we're going to get an error from the colon in incity:paris, too, so it needs its own rule. Combining with the double-quote fix, we get:

>>> single = dblQuotedString.setParseAction(removeQuotes) | Word(alphas)
>>> special = Word(alphas) + ":" + single
>>> negative = "-" + (special | single)
>>> terms = OneOrMore(special | single | negative)

>>> print terms.parseString("love incity:paris")
['love', 'incity', ':', 'paris']
>>> print terms.parseString('love incity:paris with:"pretty french girl"')
['love', 'incity', ':', 'paris', 'with', ':', 'pretty french girl']

It does work, but not very well. There are three tokens for every special term, when we really only want one. We could do something fancy here, like return a different structure (a class instance?) as the token, and validate the term at the same time:

allowed_operators = ["incity", "in", "with"]
def parse_special(string_, string_pos, tokens):
    operator, colon, term = tokens
    if operator not in allowed_operators:
        raise IllegalQueryError
    return SpecialQueryTerm(term, [operator])


But that's a little too complicated, since we have neither a SpecialQueryTerm class nor an IllegalQueryError exception. PyParsing has some built-in functionality we can use, like Combine, which combines the tokens into one, and Group, which groups them into a list. Let's try Combine.

>>> special = Combine(Word(alphas) + ":" + single)
>>> negative = "-" + (special | single)
>>> terms = OneOrMore(special | single | negative) + StringEnd()

>>> print terms.parseString('love incity:paris with:"pretty french girl"')
['love', 'incity:paris', 'with:pretty french girl']

Our tests again:

>>> test(terms)
paris -> ['paris']
love paris -> ['love', 'paris']
paris -hate -> ['paris', '-', 'hate']
"Paris, je t'aime" -> ["Paris, je t'aime"]
love incity:paris -> ['love', 'incity:paris']
love incity:paris with:"pretty french girl" -> ['love', 'incity:paris', 'with:pretty french girl']
love OR "just lust?" -> ['love', 'OR', 'just lust?']
(love OR "just lust?") incity:paris -> FAILED

5. --

The critical failure was caused by the parentheses, but there's a subtler problem with the first boolean test. It seems to work, but "OR" is being matched as though it were just another word, as in the hypothetical query "OR OR AND AND AND". We should fix that.

>>> operator = CaselessLiteral("or") | CaselessLiteral("and")
>>> terms = single + ZeroOrMore(Optional(operator) + single)
>>> query = OneOrMore(terms) + StringEnd()

6. --

A single level of parentheses couldn't be simpler. We could use a QuotedString, with the two brackets the "quotes", but this works too:

>>> parenthetical = Suppress("(") + Group(terms) + Suppress(")")

Suppress means "match these tokens, but don't add them to the output".

Unfortunately, for nested parentheses we need recursion. After the extreme suffering I was put through the last time I wrote a recursive parser, I was expecting that statement to be met with an ominous crash of thunder. As it turns out, PyParsing makes recursion absurdly easy too. We just use an instance of the special class Forward() as a placeholder, then set its value with the << operator.

This is about the simplest possible recursive parser for nested parentheses:

terms = Forward()
parenthesis = "(" + terms + ")"
terms << OneOrMore(Word(alphas) | parenthesis)

Ours is only slightly more complex:

>>> terms = Forward()
>>> parenthetical = Suppress("(") + Group(terms) + Suppress(")")
>>> negative = "-" + (special | single | parenthetical)
>>> term = special | single | parenthetical | negative
>>> parser = terms << term + ZeroOrMore(Optional(operator) + terms)

>>> test(parser)
paris -> ['paris']
love paris -> ['love', 'paris']
paris -hate -> ['paris', '-', 'hate']
"Paris, je t'aime" -> ["Paris, je t'aime"]
love incity:paris -> ['love', 'incity:paris']
love incity:paris with:"pretty french girl" -> ['love', 'incity:paris', 'with:pretty french girl']
love OR "just lust?" -> ['love', 'or', 'just lust?']
(love OR "just lust?") incity:paris -> [['love', 'or', 'just lust?'], 'incity:paris']
(love OR "just lust?") (in:paris with:girl) -> [['love', 'or', 'just lust?'], ['in:paris', 'with:girl']]
(((love OR lust) in:paris) with:girl) OR "watch TV at home" -> [[[['love', 'or', 'lust'], 'in:paris'], 'with:girl'], 'or', 'watch TV at home']
((tv at:home alone) OR (in:paris with:girl)) -(difficult decision) -> [[['tv', 'at:home', 'alone'], 'or', ['in:paris', 'with:girl']], '-', ['difficult', 'decision']]

7. --

Hooray! Now to combine it all:

from pyparsing import *

# simple tokens
simple = Word(alphas)
quoted = dblQuotedString.setParseAction(removeQuotes)
single = simple | quoted
special = Combine(Word(alphas) + ":" + single)

# negative terms
negative = "-" + (special | single | parenthetical)

# boolean operators
operator = CaselessLiteral("or") | CaselessLiteral("and")

# recursive parenthetical groups
terms = Forward()
parenthetical = Suppress("(") + Group(terms) + Suppress(")")

# bring it all together!
term = special | single | parenthetical | negative 
terms << term + ZeroOrMore(Optional(operator) + terms)
query = OneOrMore(terms) + StringEnd()

It's still not done, but it'll do. It's left as an exercise for the reader to add support for syntax like "inurl:(parenthesised terms)", though the beauty of a parser like this is that it's a very, very easy to do so.

It needs to support a large fraction of Unicode. The numbers 0-9 would be a good place to start. It should work around bad input, not die horribly. It should have an explicit order of operations: is "a or b and c" the same as "(a or b) and c" or "a or (b and c)"? And, of course, the hard bit is taking the data from the query and actually doing something useful with it.

I have no need whatsoever for such a parser, but it was fun to write this much of it: PyParsing is a great tool. At the very least, it's useful as an infinitely more readable replacement for complex regular expressions.

8. Other Resources