Parser-tools

Here are notes on various parsing tools that might help develop specifications or implementations of sweet-expressions.

Overall

We want an LL-based tool, to help check recursive descent, enable us to create a good spec, and possibly develop an implementation.

A list of parser generators is here: http://en.wikipedia.org/wiki/Comparison_of_parser_generators

Requirements:

  • LL... or recursive descent. LL(k) preferred.
  • OSS, GPL-compatible
  • EBNF, so can say x* instead of multiple productions

Hints:

  • Maintained/Widely-used
  • Gens multiple languages (to avoid baked-in stuff)

Below are a list of tools, followed by other notes.

ANTLR - capable, BSD, LL(*)

ANTLR is probably one of the most popular LL-based parser generators today.

As a practical matter must buy book ($24 for v3 electrons, inc. PDF). Extremely popular, active development, lively support. Many generated languages, including Javascript (for v3). On Fedora, install "antlr3" (just "antlr" installs v2) - very convenient!

The current version as of 2012 is version 3 (v3). ANTLR v4 due out Dec. 2012, v4 book draft out. But ANTLR v4 is radically different; it accepts far more than LL (which is probably a good thing for many users, but not what we want for our purposes). Version 4 is Java-only for now, and looks like documentation is really sketchy for v4. It might be best to stick with v3 for now.

Notation looks really clean.

http://antlr.org/

   Example (v3):
    expr    : term ( ( PLUS | MINUS )  term )* ;
    term    : factor ( ( MULT | DIV ) factor )* ;
    factor  : NUMBER ;

   Example (v4) (it appears it has some LL left-recursion extensions):
    grammar Expr;       
    prog:   (expr NEWLINE)* ;
    expr:   expr ('*'|'/') expr
        |   expr ('+'|'-') expr
    |   INT
    |   '(' expr ')'
    ;

Here are some potential ANTLR pointers:

Sadly, changing modes in a lexer isn't so easy in most LL tools.

Some old ANTLR v2 docs: http://www.antlr2.org/doc/lexer.html http://ftp.camk.edu.pl/camk/chris/antlrman/antlrman.pdf

Coco/R

capable: Attributed grammar, LL(k). GPL (with exceptions, no prob).
Different programs for different generated languages (ugh,
but not relevant?) Coco/R EXAMPLE:

CompilationUnit
= [ "package" Qualident ';' ]
  { ImportDeclaration }
  { TypeDeclaration }
           (. CODE HERE .)
.

http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/

Grammatica

"C# and Java parser generator (compiler compiler). It improves upon simlar tools (like yacc and ANTLR) by creating well-commented and readable source code, by having automatic error recovery and detailed error messages, and by support for testing and debugging grammars without generating source code."
"Grammatica supports LL(k) grammars with an unlimited number of look-ahead tokens." GNU LGPL. Example:

ImportList = "IMPORTS" SymbolsFromModule* ";" ;

http://grammatica.percederberg.net/

JavaCC

Claims "most popular" for Java. Originally by Sun.
Originally wasn't OSS, now it is.

Slightly ugly notation; productions have type declarations and look
like Java. Reasonable if you're doing Java of course, but
not the point in this case. You enter ".jj" files. Example:
void enumerator() :
{}
{
<ID> ("=" constant_expression())?
}

APG

Capable, GPL, web version. Lots of capabilities. Can gen Javascript,
and that's a big plus.
Ugh: Uses prefix repetition notation *(...), a BNF notation
I (wheeler) really hate (it's not what the textbooks do,
so it's painful to read).

Parsec

Haskell's Parsec library combined with Haskell's
Monad Transformer library allows using the same syntax for both INDENT
/ DEDENT / SAME -guarded parsers, and basic parsers (like n-expr) that
don't have INDENT/DEDENT/SAME, and allowing the second to be used
inside the first.

Wheeler agrees that having a parser with significant indent-processing capabilities would be a big plus. But Wheeler didn't include Parsec for several reasons, all based on the fact that Parsec is totally tied to Haskell:

  • I have serious notation concerns. We want to create a spec that will be read by others as part of the SRFI. ANTLR's notation is really excellent, it looks "just like the books". APG's is nasty. When I look at the Parsec example here: http://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_Hours/Parsing it's clear that the Parsec notation is basically... Haskell. (Well, what a surprise :-) ). I like Parsec's notation better than APG's, but Parsec's notation is a REALLY different notation than "usual" BNFs and is not at all "what the books use".
  • A lot of people don't grok Haskell, and it's certainly not my strongest language either. I particularly worry that we'll need to handle certain cases specially if we want to seriously implement the spec with the tool, and at that point I'll end up throwing up my hands. I'm confident of my ability to fiddle with ANTLR and its ilk, but not with Parsec/Haskell. There are ports of Parsec, but they're tied to their languages too, and it's not clear that the ports are as widely used/supported.
  • It'd also be nice to be able to generate Javascript, so that we could have it working directly on the website. Parsec can't do that, again since it's tied to Haskell. That's not as important as the other issues.

Rejected tools

AXE - Baked into C++.

HiLexed - looks unfinished.

SableCC - LALR.

Random web pages and notes

The traditional way to parse indentation-sensitive languages is based
on Python's approach, inserting INDENT and DEDENT tokens as appropriate.

http://erezsh.wordpress.com/2008/07/12/python-parsing-1-lexing/
- gives pseudocode and shows a C# ANTLR implementation

http://stackoverflow.com/questions/232682/how-would-you-go-about-implementing-off-side-rule
- Inserts LINE (newline) as first stage (with indents), then adds
INDENT/DEDENT while keeping LINE. Says it has better results

http://www.antlr.org/pipermail/antlr-interest/2012-June/044736.html
- ANTLR indent/dedent

https://github.com/sirthias/parboiled/wiki/Indentation-Based-Grammars
- Parboiled builds in indentation processing

http://mail.python.org/pipermail/tutor/2009-January/066395.html
- Email discussion about parsing, with INDENT/DEDENT

http://stackoverflow.com/questions/1513020/how-do-i-do-python-style-indent-dedent-tokens-with-alex-haskell
- Discussion about implementing indentation with Alex & Haskell.
Asserts that most people do this in the lexer.

http://antlr.1301665.n2.nabble.com/emitting-multiple-tokens-INDENT-DEDENT-td7578225.html
- Sample indentation processing with ANTLR

https://github.com/antlr/examples-v3/blob/master/java/python/PythonTokenSource.java
- ANTLR source Python tokenizer

http://mobilesystemsengineering.appspot.com/staticdocs/AdvancedMobileSystemsEngineering/Solution1.txt
- ANTLR repairs for Python; may be OBE

http://www.jarvana.com/jarvana/view/org/python/jython/2.5.0/jython-2.5.0-sources.jar!/org/python/antlr/PythonTokenSource.java?format=ok

https://github.com/bamboo/boo/tree/master/src

http://www.jarvana.com/jarvana/view/org/python/jython/2.5.0/jython-2.5.0-sources.jar!/org/python/antlr/PythonTokenSource.java?format=ok

http://shaurz.wordpress.com/2008/03/11/haskell-style-parser-combinators-in-scheme/
- Haskell-style parser combinators in Scheme. "We can define a parser as a procedure which takes an input string and an index (i.e. the index of the current character on the input). If the parse succeeds, it returns a value and an index ≥ the original index to the unconsumed input. On failure it returns #f for the both the value and index (the index is checked for failure and the value is ignored)."
- Links to more.

http://weblambda.blogspot.com/2009/10/building-parser-combinators-in-scheme.html
- Comments on How to build a parser combinator in Scheme.

http://stackoverflow.com/questions/5507665/are-there-any-ll-parser-generators-for-functional-languages-such-as-haskell-or-s

Several sources talk about sllgen, a simple LL(1) tool in Scheme
used in many courses per "Essentials of Programming Languages"

http://www.ccs.neu.edu/course/cs7400/lectures/lecture02.pdf
- Simple example use of sllgen

http://docs.racket-lang.org/eopl/index.html
- Talks about "Essentials of Programming Languages" language in DrRacket,
cross-refs sllgen stuff.

http://www.cs.indiana.edu/eopl/code/interps/sllgen.scm
- sllgen source code. Scheme LL(1) parser generator

http://3e8.org/zb/eggs/lexing.txt
"Notes on Scheme lexers and parsers"
- A text file of someone else's research on Scheme parsers.
Talks about sllgen, essence, packrat.

http://www.softpanorama.org/Algorithms/Compilers/recursive_descent_parsing.shtml
- One of the many pages on recursive descent.


Related

Wiki: Join