[Pyparsing] text grammar -- first rough
Brought to you by:
ptmcg
From: spir <den...@fr...> - 2008-11-18 17:52:57
|
purpose * lighter grammar to manipulate at design time * standard & easily legible format to communicate * conversion at design- or run-time -- adaptable design goals * clear clear clear * adapted to pyParsing's way * limited, simple, not full-featured contents: vocabulary organisation expressions actions features === v o c a b u l a r y ============================== This list is intended to help and identify elements, factors, views on the topic of text grammer for pyParsing. Help and not mess up things, notions, words. Feel free to change, add, redefine. Unsupported features are not here. literal Literal string to be taken as is. Usually in quotes or apostrophes. set choice Set as choice list of character literals. Expressed liked a (regexp-like) range. alternative choice Expression of a choice between alternative possibilities. In pyParsing FirstMatch='|'. Question: support of longest_match=Or='^'. row A sequence of items that must appear sequentially for a pattern to match. Expressed with ',' in BNF, with '+' in pyParsing code. Question: support of '&'=Each? quantifier Operator used to define the number of potential (sub-)expressions. Usually '?', '*', '+'. Possible support of literal numbers for exact length item Any part of a pattern expression: literal, token, range, or a (sub-)group. May be quantified, named, grouped. group Grouping of (sub-)expressions to allow use of quantity, naming, or any other overall action, on the whole group. Usually expressed as '(...)'. token (Sub-)Pattern result expressed for the sake of clarity or for avoiding repetition; appears in the expression of a (super-)pattern. Can be opposed to 'product'. product Pattern result relevant for the application, either as final result items or for further processing. Often contains tokens. Usually, packed, concatenated, and/or mutated. May be automatically named/typed. pack ~ pyParsing Group(). Expression of a product as a sequence of tokens. concatenation ~ pyParsing Combine(). Expression of a product as concatenation of tokens. mutation ~ pyParsing parseAction with returns a value. Expression of a product as transformation of a raw parse result, using a built-in or custom function. task ~ pyParsing parseAction with no return value. A task is an action executed when a match is found, but does not change the result. id, name Identifier of a pattern. Will become attribute 'nature' of its products. use, use case Use of a (sub-)pattern in a specific situation inside a super-pattern, and defined with a custom name. Will become attribute 'role' of its products. === o r g a n i s a t i o n ============================= For the sake of clarity, useless symbols, that do not actually express anything, will be avoided. E.g. standard (E)BNF ':==', ',' separator, final ';' and '[...]' for options will not be used. These characters still may be used in an expressive function. We need: * pattern name: left side * pattern expression: middle side * pattern action(s): right side As the name will become a code identifier, it will hold no whitespace. There is thus no need for a (visible) separator with the expression, whitespace is enough. Still, using e.g. ':' may help legibility. We need a separator for the action part of a line. It can be the same as the first separator, if any, or another one. I would propose as simplest solution to use ':' for both separation (id est BNF ':==' reduced to ':'). wich leads to name : expression : action --------------------------------------------- integer : [0-9]+ : int decimal : integer '.' integer : float number : decimal | integer : "num" add : {number '+' number} : "add" calc : add+ where: * action without quote is function cal * action with quotes is pattern naming (--> result nature) * {} may be packing = Group-ing pattern : name ':' expression (':' action*) === e x p r e s s i o n ========================= Here are pattern expression elements described; and also expressed using the in progress specification of pyParsing"s text grammar itself ;-) === literal A literal is expressed in quotes. We may chose only apostrophs (simple quotes) for this, which allows use of double quotes for another function -- such as identifying use cases, or concatenation. Backslash is used for escaping single chars which happen to be grammar codes -- first of all, backslash itself. It is also used to identify characters by ordinals (code numbers). I would rather support only decimal & hex (with 'x') ordinals; and that these ordinals have a fix length of resp 3 & 2 digits -- left-padding with '0'. This is more legible & simpler, esp. when a number follows an ordinal. (below dd means decimal digit, hd means hex digit) char : char_literal | '\\' dd dd dd | '\\x' hd hd literal : '\'' char+ '\'' === set/choice A set is written inside []; it may contain both characters and ranges. Unlike pyParsing srange() function, it basically expresses a *choice* with no further need of Or, FirstMatch, or oneOf. For instance, [A-Z] is more or less equivalent to FirstMatch(list(srange("[A-Z]"))) or Word(srange("[A-Z]"),exact=1). A set can be negated with a negation code: I would support the use of '!' for this role. See below quantifiers for expression of Word(set). Ranges are expressed as usual with a pair of characters separated with '-'. A range can also be named using a pyParsing term (like 'nums'(*)). Characters can be identified either literally, or using their ordinal (code number) preceded par the escape char. range : (char '-' char) | range_name set : '!'? '[' (range | char)+ ']' Ranges could be renamed! Eg nums --> digit or dec_digit. Also, no plural: set, which hold ranges, express choices. === token A token inside another pattern is identified by its name with no further character. token : identifier === group A group of tokens is built with (...). This allows use of quantities, role identifying, maybe more. group : '(' expression ')' === item row Items that must appear in a row are simply written one after another. A space is sufficient separator, no use of ',' or '+' is needed. I would support that a space separates items in row even if not necessary for parsing: this seems both simpler and clearer. row : item (' ' item)+ === quantifier todo ****************** === a c t i o n s ==================================== Actually, naming, packing, concatenation, as well as what is called parse actions, all can be seen as parse actions; in the sense that they are additional jobs done on, or with, a result -- once this result has been distinguished. Right? This applies both for actions on results of full patterns and ones executed on sub-expressions. About the latter case, I wonder if including this feature in a text grammar is really useful. Except for naming of *roles*. Reasons are: ~ It may rarely be used. (?) ~ It complicates the grammar. ~ The programmer can define a sub-pattern to apply the action on. ~ The action can be added once the text grammar is converted into code. If needed, the syntax should use the same codes as the ones for whole pattern actions, so that these codes should allow wrapping a sub-expression. I will let this aside for now, only keeping this requirement in mind, and concentrate on actions on whole patterns. As all these kinds of parsing actions apply on results, they all fit well in the right side of a line. This has the side advantage of avoiding expression overload. Examples: integer : [0-9]+ : int "int" --> convert integer to python int; set its 'nature' attribute to "int" decimal : integer '.' integer : <> float --> concatenate decimal, convert it to python float mult : num '+' num : {} --> pack mult into pyParsing Group (sequence). If find the organisation really convenient; and the choice of codes rather easy to use. An alternative would be to use () for packing=Group-ing, which would let {} free for Dict-ing. But then it would not be possible anymore to pack_Group() sub-expressions (which I personly do not find a serious drawback -- but others may). Naming of pattern use cases, i.e. result roles, happens rather naturally inside expressions: decimal : integer "l_int" '.' integer "r_int" : <> float Now, remains a possible issue: actually, there are 2 kinds of parse actions: * Tasks (think at Pascal procedures) execute an additional job when a matching string is found. Example: count the number of integers, add them, print them. We could call 'tasks' this kind of actions. Naming can be seen as a kind of task. * Mutations return a transformed result. Very different, I guess. Grouping and combining, even if they do not really change the result's content, should rather be seen as mutations as they change at least its type. The point is that both kinds of actions are implemented the same way. The reason why they are, logically enough, both called 'actions'. But I guess they obviously do not serve the same kind of purposes. They have a very different sense, meaning a different semantic from the point of view of the programmer. Tasks can be seen like parsing side-effects. Mutations also mean that the potential results of the pattern are relevant for the application, they are products. As a consequence, I would prefere expressing tasks and mutations differently. This would also allow distinction of relevant results. We may prefix mutations with e.g. '->', which is imo quite obvious: integer : [0-9]+ : ->int "int" decimal : integer '.' integer : <> ->float Now, I have no clue of which code would be sensible & meaningful for tasks, if any. === n a m i n g , t y p i n g ==================================== When a result is generated by a whole pattern, then the pattern's name, if any, defines the result's type -- or more precisely its *nature*. When results are relevant, they should be able to hold their nature, by default. When a (sub-)pattern is used inside one or more other (super-)patterns, then it can express that its results have several possible *roles*. Below, 'integer' and 'decimal' define product natures. I used setNames("id") for this. integer = Word(nums).setName.setName("integer") decimal = Combine(integer("int_part") + '.' + integer("dec_part")).setName("decimal") num = decimal | integer mult = num("left-num") + '*' + num("right-num") Depending on the situation, the integer pattern can be used as int_part or dec_part of a decimal. These terms thus define product roles instead. Moreover, both integers and decimals can be nums, and as such used as left- or right- operands of a multiplication, which are different roles again. I find all of this relevant. I would really like that both kinds of pattern naming / result typing are possible. The present implementation of pyParsing is so that achieving this goal is not straitforward -- but still possible. We could do it so: * The programmer selects which kinds of results are relevant: switches in text grammer can define which results should then know about their type (nature and/or role). * Patterns know 'who' they are: write code grammar in a scope (class or separate module), use the scope's dict to set an 'id' attribute on each pattern. No need to name whole patterns manually, as anyway their name is the variable name. Still, let it possible to use another name? Rather call this 'id' than 'name' which is used for debugging. * Patterns can get define a use case: implemented with SetParseResults("use") or simply ("use") pseudo-call. We can copy it to a 'use' attribute to avoid confusion. * Results, or only selected ones, hold a reference to their source pattern: either with a parse action that adds a dict item, or as an argument at result's initialisation. * Selected results read relevant info from pattern: either with an additional parse action, or at initialisation when they have the proper reference. Storage can be done as attribute or dict item. === f e a t u r e s =================================== List of not basic, yet unsupported, features to think at. Note that few additional codes or syntactic forms may well compromise the grammar's overall readibility. Furthermore, individual operations can easily be added by hand once the text grammar is translated to pyParsing code. Not to forget is the main goal of clarity that fits well with decomposition of complex expressions. As a consequence, here is a list of criteria to consider for supporting a feature: --> very common need (how to measure that?) --> difficult to express by combining existing ones --> obvious format that fits well in the grammar operations * '!' or '~' not (overall negation) * '&'=Each() (unordered And) * '^'=Or() (longest match) * FollowedBy() (look ahead - right side condition) * '~'=NotAny (!FollowedBy()) * SkipTo() (as name says) * Forward --> recursive paterns: useless in text grammar? tokens * Keyword() * White() * QuotedString() * CharsNotIn() * overall CaseLess() * Regexp helpers * nestedExpr() * delimitedList() * OnlyOnce() actions * Suppress() * Dict() * replaceWith() switches / parameters Switches all False by default (easy to remember). Params have same default value as by pyParsing (ditto). Switches/params can change (between lines). * auto naming of all patterns * auto naming of pattern class * define important (product) / incidental (token) results * respect/ignore whitespace * set of whitespace chars |