The main reason for the changes in this patch is that the lexicon entry in the Grammar() is both unneccessary and inconsequently used. Also, chart parsing is optimized considerably.
The lexicon is only used when parsing with the Earley parser, and it's not even necessary for the algorithm. By removing the lexicon, much of the code can be simplified.
This has consequences for the following files: cfg.py, parse/chart.py, draw/chart.py, parse/featurechart.py. (Only minor changes in the last two). Attached is a zip archive with patch files and the updated python sources.
Affected existing bugs and patches:
- Patch #2001163 is included (bug in Nonterminal.__cmp__).
- Patch #1999576 is included (agenda-based parsing).
- A corrected version of patch #1999613 is included (Bottom-Up Combine prediction).
- This patch solves bug #1997700.
****************************************
* cfg.py: cfg.patch
In addition to the necessary changes to remove the lexicon, I have rewritten the code for loading grammars. CFG, PCFG and FCFG all now use the same two functions (parse_production and parse_grammar), which are generalized versions of the previous functions for loading FCFG grammars. This reduces the code size considerably, but also allows for directives (e.g., the 'start' directive) in CFGs and PCFGs.
Another small addition: Grammar.productions() is compacted by replacing if-then-else's by dict.get().
* parse/chart.py: parse-chart.patch
Patch #1999576 is included (agenda-based parsing), which makes parsing a LOT faster:
- ChartParser.__init__ takes a new argument (use_agenda) which is True by default. But it only works if NUM_EDGES =< 1 for all rules.
- to be able to work with an agenda, the Earley CompleterRule has to be applied to both complete and incomplete edges, which makes it equal to the SingleEdgeFundamentalRule. Thus the Completer is just a subclass of SingleEdge.
- ChartParser has a new method chart_parse(), which does the parsing and returns the final Chart. This is called by nbest_parse(). chart_parse() can e.g. be used if the chart needs some massaging before the trees can be extracted.
Furthermore, by removing the lexicon, we no longer need an argument when initializing the Earley ScannerRule. Thus makeEarleyStrategy() is no longer needed. Instead there's a new constant EARLEY_STRATEGY.
Some improvements unrelated to the lexicon issue:
- Chart.insert() now takes any number of child_pointer_lists; FundamentalRule therefore only has to call insert() once for each new_edge.
- BottomUpInitRule was duplicated - is no longer.
- A new bottom-up algorithm which applies predict directly followed by combine (called Bottom-up Kilbury, e.g., by Wiren (1992)). The relevant predict rule is called BottomUpPredictCombineRule, and the strategy is called BUC_STRATEGY. The new strategy is included in the demo() - it's about 2 times faster than BU_STRATEGY, which in turn is 4 times faster than TD_STRATEGY (on the ATIS grammar). Note: this is a corrected version of patch #1999613.
- A small fix to SteppingChartParser, where the default strategy is [] and not None. Otherwise there was a TypeError when calling set_strategy.
* draw/chart.py: draw-chart.patch
Only minor changes. The ScannerRule can (and must) now be used in the (pseudo) Earley algorithm, instead of TopdownMatch.
* parse/featurechart.py: parse-featurechart.patch
Only minor changes. EarleyChartParser._check_lexicon_coverage() can be removed - Grammar.check_coverage() suffices.
Patches and new Python files