pyparsing, a Python module for parsing text using a context-free grammar, has been updated with the release of version 1.2 in June, 2004, and version 1.2.1 this past week.
pyparsing's approach to defining grammars differs from the conventional lex/yacc approach. No external file using BNF or regex-style syntax is required. pyparsing's parse grammars are defined right in the Python parse code itself, using a library of parsing construction classes to compose the grammar. pyparsing includes such classes as:
- Literal and CaselessLiteral
Also, pyparsing is written in pure Python, simplifying its inclusion in other Python projects.
pyparsing uses Python's operator overloading to simplify the definition of conjunction and alternation constructs. For example, a parser for "Hello, World!" might look like:
greeting = Word( alphas ) + "," + Word( alphas ) + oneOf("! ? .")
To actually parse the string "Hello, World!", one calls the parseString method on greeting:
hello = "Hello, World!"
print hello, "->", greet.parseString( hello )
This returns the results in the default list form:
Hello, World! -> ['Hello', ',', 'World', '!']
The results returned from parseString can be accessed as a simple list of tokens, as a hierarchical parse tree, as a dictionary, or as an object with named attributes (depending on the options and constructs specified in the grammar). As a new feature in 1.2, hierarchical parse results can now be reported in XML format; this feature enables one to easily write scripts to extract data from text data files and create representative XML with a single method call!
Since version 1.2, pyparsing now also supports parsing modes for scanning a string for matches, or for transforming a string (given transformation actions have been defined for specific grammar elements).
Other features introduced in 1.2 and 1.2.1 are:
- Added SkipTo(expression) token type, simplifying grammars that only
want to specify delimiting expressions, and want to match any characters
- Added helper method dictOf(key,value), making it easier to work with
the Dict class, and structure the returned tokens as a Python dictionary.
- Added optional argument listAllMatches (default=False) to
setResultsName(). Setting listAllMatches to True overrides the default
modal setting of tokens to results names; instead, the results name
acts as an accumulator for all matching tokens within the local
- Added definition for htmlComment to help support HTML scanning and
- Added getName() method to ParseResults. This method is helpful when
a grammar specifies ZeroOrMore or OneOrMore of a MatchFirst or Or
expression, and the parsing code needs to know which expression matched.
- Added items() and values() methods to ParseResults, to better support
using ParseResults as a Dictionary.
- Added parseFile() as a convenience function to parse the contents of an
entire text file. Accepts either a file name or a file object.
- Performance improvements of 20-50% reduction!
pyparsing also comes with a directory of sample programs. Some new examples include:
- A Delphi Form parser to examples, dfmparse.py, plus a couple of
sample Delphi forms as tests
- An EBNF parser to examples, including a demo where it parses its own
- Extended fourFn.py to support exponentiation, and simple built-in
- A beautiful example for parsing Mozilla calendar files
- urlExtractor.py, an example of using scanString and parse actions to
extract web links and their target URL's in a given HTML web page
Visit pyparsing's project home page at http://pyparsing.sourceforge.net. Please let me know if you find pyparsing helpful in your text processing efforts.
(I'd like to thank Pavel Volkovitskiy, Wilson Fowlie, Rick Walia, Brad Clements, Dang Griffith, Seo Sanghyeon, Petri Savolainen, Eric van der Vlist, and Amaury Le Leyzour for their testing, suggestions, feedback, and contributions to these latest pyparsing releases!)
-- Paul McGuire
Log in to post a comment.