Here are notes on various parsing tools that might help develop specifications or implementations of sweet-expressions.
We want an LL-based tool, to help check recursive descent, enable us to create a good spec, and possibly develop an implementation.
A list of parser generators is here: http://en.wikipedia.org/wiki/Comparison_of_parser_generators
Requirements:
Hints:
Below are a list of tools, followed by other notes.
ANTLR is probably one of the most popular LL-based parser generators today.
As a practical matter must buy book ($24 for v3 electrons, inc. PDF). Extremely popular, active development, lively support. Many generated languages, including Javascript (for v3). On Fedora, install "antlr3" (just "antlr" installs v2) - very convenient!
The current version as of 2012 is version 3 (v3). ANTLR v4 due out Dec. 2012, v4 book draft out. But ANTLR v4 is radically different; it accepts far more than LL (which is probably a good thing for many users, but not what we want for our purposes). Version 4 is Java-only for now, and looks like documentation is really sketchy for v4. It might be best to stick with v3 for now.
Notation looks really clean.
Example (v3):
expr : term ( ( PLUS | MINUS ) term )* ;
term : factor ( ( MULT | DIV ) factor )* ;
factor : NUMBER ;
Example (v4) (it appears it has some LL left-recursion extensions):
grammar Expr;
prog: (expr NEWLINE)* ;
expr: expr ('*'|'/') expr
| expr ('+'|'-') expr
| INT
| '(' expr ')'
;
Here are some potential ANTLR pointers:
Sadly, changing modes in a lexer isn't so easy in most LL tools.
Some old ANTLR v2 docs: http://www.antlr2.org/doc/lexer.html http://ftp.camk.edu.pl/camk/chris/antlrman/antlrman.pdf
capable: Attributed grammar, LL(k). GPL (with exceptions, no prob).
Different programs for different generated languages (ugh,
but not relevant?) Coco/R EXAMPLE:
CompilationUnit
= [ "package" Qualident ';' ]
{ ImportDeclaration }
{ TypeDeclaration }
(. CODE HERE .)
.
http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/
"C# and Java parser generator (compiler compiler). It improves upon simlar tools (like yacc and ANTLR) by creating well-commented and readable source code, by having automatic error recovery and detailed error messages, and by support for testing and debugging grammars without generating source code."
"Grammatica supports LL(k) grammars with an unlimited number of look-ahead tokens." GNU LGPL. Example:
ImportList = "IMPORTS" SymbolsFromModule* ";" ;
http://grammatica.percederberg.net/
Claims "most popular" for Java. Originally by Sun.
Originally wasn't OSS, now it is.
Slightly ugly notation; productions have type declarations and look
like Java. Reasonable if you're doing Java of course, but
not the point in this case. You enter ".jj" files. Example:
void enumerator() :
{}
{
<id> ("=" constant_expression())?
}</id>
Capable, GPL, web version. Lots of capabilities. Can gen Javascript,
and that's a big plus.
Ugh: Uses prefix repetition notation *(...), a BNF notation
I (wheeler) really hate (it's not what the textbooks do,
so it's painful to read).
Haskell's Parsec library combined with Haskell's
Monad Transformer library allows using the same syntax for both INDENT
/ DEDENT / SAME -guarded parsers, and basic parsers (like n-expr) that
don't have INDENT/DEDENT/SAME, and allowing the second to be used
inside the first.
Wheeler agrees that having a parser with significant indent-processing capabilities would be a big plus. But Wheeler didn't include Parsec for several reasons, all based on the fact that Parsec is totally tied to Haskell:
AXE - Baked into C++.
HiLexed - looks unfinished.
SableCC - LALR.
The traditional way to parse indentation-sensitive languages is based
on Python's approach, inserting INDENT and DEDENT tokens as appropriate.
http://erezsh.wordpress.com/2008/07/12/python-parsing-1-lexing/
http://stackoverflow.com/questions/232682/how-would-you-go-about-implementing-off-side-rule
http://www.antlr.org/pipermail/antlr-interest/2012-June/044736.html
https://github.com/sirthias/parboiled/wiki/Indentation-Based-Grammars
http://mail.python.org/pipermail/tutor/2009-January/066395.html
http://antlr.1301665.n2.nabble.com/emitting-multiple-tokens-INDENT-DEDENT-td7578225.html
https://github.com/antlr/examples-v3/blob/master/java/python/PythonTokenSource.java
https://github.com/bamboo/boo/tree/master/src
http://shaurz.wordpress.com/2008/03/11/haskell-style-parser-combinators-in-scheme/
http://weblambda.blogspot.com/2009/10/building-parser-combinators-in-scheme.html
Several sources talk about sllgen, a simple LL(1) tool in Scheme
used in many courses per "Essentials of Programming Languages"
http://www.ccs.neu.edu/course/cs7400/lectures/lecture02.pdf
http://docs.racket-lang.org/eopl/index.html
http://www.cs.indiana.edu/eopl/code/interps/sllgen.scm
http://3e8.org/zb/eggs/lexing.txt
"Notes on Scheme lexers and parsers"
http://www.softpanorama.org/Algorithms/Compilers/recursive_descent_parsing.shtml