Home

Authors:

Welcome to PyLangParser's wiki!

Parse C source code from Python https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/

Parse SQL scripts from Python https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/

Parse GTK-Doc style comments from Python https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/
(parses function name, arguments, annotations and return value)

Parse food recipes from Python https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/

pylangparser - Simple language parsing from Python.
Project provides classes for parsing formal languages in an easy way.
Without using any external libraries, only unittest, re and pprint.
There is a Lexer and a Parser class. The lexer produces list of tokens that the
Parser then uses to build the AST. The lexer can also be used as a stand alone
component. There is support for building customized AST's.
The grammars are defined directly into the Python code.

In the examples folder you will find both simple example scripts demonstrating
basic usage of the parser and some more useful and complex ones. For example,
there is a script for parsing C source code and building and iterating the AST.
SQL parser will be added soon too.

Note: Documentation is not fully complete yet. Existing APIs can still change.

Feel free to send suggestions, comments and patches.

Example usage of the Parser:

The test defines simple calculator language MATABC and demonstrates how programs written
in that language are parsed.

#!/usr/bin/python
from pylangparser import *

# define all tokens in the language
IF = Keyword(r'if')

KEYWORDS = IF

PLUS = Operator(r'+')
MINUS = Operator(r'-')
ASSIGNMENT = Operator(r'=')
SEMICOLON = Operator(r';')
EQ = Operator(r'==')
LE = Operator(r'<')
GT = Operator(r'>')
LPAR = Operator(r'(')
RPAR = Operator(r')')

# order is important as first operator that matches will be considered
# so it is important that '<=' is taken before '<'
OPERATORS = EQ & PLUS & MINUS & ASSIGNMENT & LE & GT & SEMICOLON & \
    LPAR & RPAR

IGNORE_CHARS = Ignore(r'[ \t\v\f\n]+')

COMMENTS = Ignore(r'\#.*\n')

IDENTIFIER = Symbols(r'[A-Za-z_]+[A-Za-z0-9_]*')

CONSTANT = Symbols(r'[0-9]+')

TOKENS = KEYWORDS & OPERATORS & CONSTANT & IDENTIFIER & \
    COMMENTS & IGNORE_CHARS

# we want that certain tokens are ignored in the AST
IgnoreTokensInAST(SEMICOLON & LPAR & RPAR)

# define our grammar

arthm_operator = \
    OperatorParser(PLUS) | \
    OperatorParser(MINUS)

comp_operator = \
    OperatorParser(LE) | \
    OperatorParser(GT) | \
    OperatorParser(EQ)

operand = \
    SymbolsParser(CONSTANT) | \
    SymbolsParser(IDENTIFIER)

arthm_expression = \
    SymbolsParser(IDENTIFIER) & \
    OperatorParser(ASSIGNMENT) & \
    (operand << Optional(arthm_operator << operand)) & \
    OperatorParser(SEMICOLON)

condition = \
    operand << \
    comp_operator << \
    operand

# if_statement and statement have circular dependency, that is why
# we have to use RecursiveParser
statement = RecursiveParser()

if_statement = \
    KeywordParser(IF) & \
    OperatorParser(LPAR) & \
    condition & \
    OperatorParser(RPAR) & \
    statement

# notice the usage of the '+=' operator below
statement += \
    if_statement | arthm_expression

# use AllTokensConsumed so that the parser parses the
# complete source
program = AllTokensConsumed(ZeroOrMore(statement))

# our source code
source = """

# example program written in ABCMATH

p = 12;

if (p == 12)
  if (p == 5)
    p = 3 + 2;

"""

# obtain list of tokens present in the source
lexer = Lexer(TOKENS)
tokens = lexer.parseTokens(source)
print(tokens)

# build AST
result = program(tokens, 0)
result.pretty_print()

When the program is run, it will output the following tree:

[[['p'], ['='], ['12']],
 [['if'],
  [['p'], ['=='], ['12']],
  [['if'], [['p'], ['=='], ['5']], [['p'], ['='], [['3'], ['+'], ['2']]]]]]

But maybe the tree can be reorganized a bit so that it is easier to interpret it.
Let's modify our code a bit.

First we modify the arthm_expression parser:

def update_arthm_expression(result):
    token = result.get_token()

    if len(token) == 3:
       # p = 1
       # ('p', '=', '1') or ('p', '=', ('3', '+', '2'))
       (lo, op, ro) = token
       if not ro.is_basic_token():
           ro = update_arthm_expression(ro)
       token = (op, lo, ro)

    result.set_token(token)
    return result

arthm_expression = \
    CustomizeResult (SymbolsParser(IDENTIFIER) & \
    OperatorParser(ASSIGNMENT) & \
    operand & \
    Optional(arthm_operator & operand) & \
    OperatorParser(SEMICOLON), update_arthm_expression)

And then the if_statement parser:

def update_condition(result):
    # p == 1
    # ('p', '==', '1')
    token = result.get_token()
    (lo, op, ro) = token
    result.set_token((op, lo, ro))
    return result

if_statement = \
    KeywordParser(IF) & \
    OperatorParser(LPAR) & \
    CustomizeResult (condition, update_condition) & \
    OperatorParser(RPAR) & \
    statement

The result tree will look a bit different now:

[[['='], ['p'], ['12']],
 [['if'],
  [['=='], ['p'], ['12']],
  [['if'], [['=='], ['p'], ['5']], [['='], ['p'], [['+'], ['3'], ['2']]]]]]

Always use CheckErrors or AllTokensConsumed as a top level parser in order
to get relevant information about parse errors:

Traceback (most recent call last):
  File "simple_calc_language.py", line 103, in <module>
    result = program(tokens, 0)
  File "../pylangparser.py", line 915, in __call__
    "Unknown symbol: %s" % tokens[i].get_token())
pylangparser.ParseException: row: 7, column: 7,
    message: Unknown symbol: (

List of supported Tokens:

Keyword
Symbols
Operator
Ignore

If case-insensitive matching is desired when parsing Tokens, the ignorecase constructor property should be set when creating Token instances:

IF = Keyword(r'if', ignorecase=True)

List of supported Parsers:

KeywordParser
OperatorParser
SymbolsParser
Optional
ZeroOrMore
Repeat
AllTokensConsumed
RecursiveParser
IgnoreResult
CustomizeResult
CheckErrors

Parsers can be combined using the following operators: |, & and <<

p1 & p2
and
p1 << p2

mean almost the same thing but there is still a tiny difference. To illustrate it, lets take as an example variable declaration parsing in C:

int a, b, c, d;

The grammar may look like:

additional_declarator_with_modifier = \
            OperatorParser(COMMA) & declarator_with_modifier

variable_declaration = \
        (type_specifier & declarator_with_modifier << \
            ZeroOrMore(additional_declarator_with_modifier) & \
            OperatorParser(SEMICOLON))

or:

additional_declarator_with_modifier = \
            OperatorParser(COMMA) & declarator_with_modifier

variable_declaration = \
        (type_specifier & declarator_with_modifier & \
            ZeroOrMore(additional_declarator_with_modifier) & \
            OperatorParser(SEMICOLON))

And the AST in bothe cases:

['int'], [['a'], ['b'], ['c'], ['d']]

and

['int'], [['a'], [['b'], ['c'], ['d']]]

Iterating the AST:

The result of applying a parser combination to some input is a ParserResult.
A ParserResult may contain simple token, another ParserResult or a tuple of ParserResult's.
A ParserResult can be iterated using the get_sub_group(index) function, indexes or iterators. Indexes start from 1. 0 means the whole tree.

result = parser(tokens, 0)

sub_group = result.get_sub_group(1)
sub_group.pretty_print()

Or

sub_group = result[1]
sub_group.pretty_print()

Or

for sub_group in result:
    sub_group.pretty_print()

To check if a given group/sub-group is a result of applying a particular parser use the check_parser(parser) and check_parser_instance(parser_class) functions:

result = program(tokens, 0)
sub_group = result.get_sub_group(1)
if sub_group.check_parser(if_statement)
    print("this is an if-statement")

For more detailed info check the source code and the c_parser.py example.

Each group/sub-group can be pretty-printed with the pretty_print() function:

result.pretty_print()
sub_group.pretty_print()

You can download and try the Examples:https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/

Project Members:

Ognyan Tonchev (admin)

pylangparser Wiki

pylangparser - Simple language parsing from Python