Parse C source code from Python https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/
Parse SQL scripts from Python https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/
Parse GTK-Doc style comments from Python https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/
(parses function name, arguments, annotations and return value)
Parse food recipes from Python https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/
pylangparser - Simple language parsing from Python.
Project provides classes for parsing formal languages in an easy way.
Without using any external libraries, only unittest, re and pprint.
There is a Lexer and a Parser class. The lexer produces list of tokens that the
Parser then uses to build the AST. The lexer can also be used as a stand alone
component. There is support for building customized AST's.
The grammars are defined directly into the Python code.
In the examples folder you will find both simple example scripts demonstrating
basic usage of the parser and some more useful and complex ones. For example,
there is a script for parsing C source code and building and iterating the AST.
SQL parser will be added soon too.
Note: Documentation is not fully complete yet. Existing APIs can still change.
Feel free to send suggestions, comments and patches.
The test defines simple calculator language MATABC and demonstrates how programs written
in that language are parsed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|
When the program is run, it will output the following tree:
[[['p'], ['='], ['12']],
[['if'],
[['p'], ['=='], ['12']],
[['if'], [['p'], ['=='], ['5']], [['p'], ['='], [['3'], ['+'], ['2']]]]]]
But maybe the tree can be reorganized a bit so that it is easier to interpret it.
Let's modify our code a bit.
First we modify the arthm_expression parser:
def update_arthm_expression(result):
token = result.get_token()
if len(token) == 3:
# p = 1
# ('p', '=', '1') or ('p', '=', ('3', '+', '2'))
(lo, op, ro) = token
if not ro.is_basic_token():
ro = update_arthm_expression(ro)
token = (op, lo, ro)
result.set_token(token)
return result
arthm_expression = \
CustomizeResult (SymbolsParser(IDENTIFIER) & \
OperatorParser(ASSIGNMENT) & \
operand & \
Optional(arthm_operator & operand) & \
OperatorParser(SEMICOLON), update_arthm_expression)
And then the if_statement parser:
def update_condition(result):
# p == 1
# ('p', '==', '1')
token = result.get_token()
(lo, op, ro) = token
result.set_token((op, lo, ro))
return result
if_statement = \
KeywordParser(IF) & \
OperatorParser(LPAR) & \
CustomizeResult (condition, update_condition) & \
OperatorParser(RPAR) & \
statement
The result tree will look a bit different now:
[[['='], ['p'], ['12']],
[['if'],
[['=='], ['p'], ['12']],
[['if'], [['=='], ['p'], ['5']], [['='], ['p'], [['+'], ['3'], ['2']]]]]]
Always use CheckErrors or AllTokensConsumed as a top level parser in order
to get relevant information about parse errors:
Traceback (most recent call last):
File "simple_calc_language.py", line 103, in <module>
result = program(tokens, 0)
File "../pylangparser.py", line 915, in __call__
"Unknown symbol: %s" % tokens[i].get_token())
pylangparser.ParseException: row: 7, column: 7,
message: Unknown symbol: (
List of supported Tokens:
Keyword
Symbols
Operator
Ignore
If case-insensitive matching is desired when parsing Tokens, the ignorecase constructor property should be set when creating Token instances:
IF = Keyword(r'if', ignorecase=True)
List of supported Parsers:
KeywordParser
OperatorParser
SymbolsParser
Optional
ZeroOrMore
Repeat
AllTokensConsumed
RecursiveParser
IgnoreResult
CustomizeResult
CheckErrors
Parsers can be combined using the following operators: |, & and <<
p1 & p2
and
p1 << p2
mean almost the same thing but there is still a tiny difference. To illustrate it, lets take as an example variable declaration parsing in C:
int a, b, c, d;
The grammar may look like:
additional_declarator_with_modifier = \
OperatorParser(COMMA) & declarator_with_modifier
variable_declaration = \
(type_specifier & declarator_with_modifier << \
ZeroOrMore(additional_declarator_with_modifier) & \
OperatorParser(SEMICOLON))
or:
additional_declarator_with_modifier = \
OperatorParser(COMMA) & declarator_with_modifier
variable_declaration = \
(type_specifier & declarator_with_modifier & \
ZeroOrMore(additional_declarator_with_modifier) & \
OperatorParser(SEMICOLON))
And the AST in bothe cases:
['int'], [['a'], ['b'], ['c'], ['d']]
and
['int'], [['a'], [['b'], ['c'], ['d']]]
The result of applying a parser combination to some input is a ParserResult.
A ParserResult may contain simple token, another ParserResult or a tuple of ParserResult's.
A ParserResult can be iterated using the get_sub_group(index) function, indexes or iterators. Indexes start from 1. 0 means the whole tree.
result = parser(tokens, 0)
sub_group = result.get_sub_group(1)
sub_group.pretty_print()
Or
sub_group = result[1]
sub_group.pretty_print()
Or
for sub_group in result:
sub_group.pretty_print()
To check if a given group/sub-group is a result of applying a particular parser use the check_parser(parser) and check_parser_instance(parser_class) functions:
result = program(tokens, 0)
sub_group = result.get_sub_group(1)
if sub_group.check_parser(if_statement)
print("this is an if-statement")
For more detailed info check the source code and the c_parser.py example.
Each group/sub-group can be pretty-printed with the pretty_print() function:
result.pretty_print()
sub_group.pretty_print()
You can download and try the Examples:https://sourceforge.net/p/pylangparser/code/ci/master/tree/examples/