Thread: [Pyparsing] text grammar -- first rough
Brought to you by:
ptmcg
From: spir <den...@fr...> - 2008-11-18 17:52:57
|
purpose * lighter grammar to manipulate at design time * standard & easily legible format to communicate * conversion at design- or run-time -- adaptable design goals * clear clear clear * adapted to pyParsing's way * limited, simple, not full-featured contents: vocabulary organisation expressions actions features === v o c a b u l a r y ============================== This list is intended to help and identify elements, factors, views on the topic of text grammer for pyParsing. Help and not mess up things, notions, words. Feel free to change, add, redefine. Unsupported features are not here. literal Literal string to be taken as is. Usually in quotes or apostrophes. set choice Set as choice list of character literals. Expressed liked a (regexp-like) range. alternative choice Expression of a choice between alternative possibilities. In pyParsing FirstMatch='|'. Question: support of longest_match=Or='^'. row A sequence of items that must appear sequentially for a pattern to match. Expressed with ',' in BNF, with '+' in pyParsing code. Question: support of '&'=Each? quantifier Operator used to define the number of potential (sub-)expressions. Usually '?', '*', '+'. Possible support of literal numbers for exact length item Any part of a pattern expression: literal, token, range, or a (sub-)group. May be quantified, named, grouped. group Grouping of (sub-)expressions to allow use of quantity, naming, or any other overall action, on the whole group. Usually expressed as '(...)'. token (Sub-)Pattern result expressed for the sake of clarity or for avoiding repetition; appears in the expression of a (super-)pattern. Can be opposed to 'product'. product Pattern result relevant for the application, either as final result items or for further processing. Often contains tokens. Usually, packed, concatenated, and/or mutated. May be automatically named/typed. pack ~ pyParsing Group(). Expression of a product as a sequence of tokens. concatenation ~ pyParsing Combine(). Expression of a product as concatenation of tokens. mutation ~ pyParsing parseAction with returns a value. Expression of a product as transformation of a raw parse result, using a built-in or custom function. task ~ pyParsing parseAction with no return value. A task is an action executed when a match is found, but does not change the result. id, name Identifier of a pattern. Will become attribute 'nature' of its products. use, use case Use of a (sub-)pattern in a specific situation inside a super-pattern, and defined with a custom name. Will become attribute 'role' of its products. === o r g a n i s a t i o n ============================= For the sake of clarity, useless symbols, that do not actually express anything, will be avoided. E.g. standard (E)BNF ':==', ',' separator, final ';' and '[...]' for options will not be used. These characters still may be used in an expressive function. We need: * pattern name: left side * pattern expression: middle side * pattern action(s): right side As the name will become a code identifier, it will hold no whitespace. There is thus no need for a (visible) separator with the expression, whitespace is enough. Still, using e.g. ':' may help legibility. We need a separator for the action part of a line. It can be the same as the first separator, if any, or another one. I would propose as simplest solution to use ':' for both separation (id est BNF ':==' reduced to ':'). wich leads to name : expression : action --------------------------------------------- integer : [0-9]+ : int decimal : integer '.' integer : float number : decimal | integer : "num" add : {number '+' number} : "add" calc : add+ where: * action without quote is function cal * action with quotes is pattern naming (--> result nature) * {} may be packing = Group-ing pattern : name ':' expression (':' action*) === e x p r e s s i o n ========================= Here are pattern expression elements described; and also expressed using the in progress specification of pyParsing"s text grammar itself ;-) === literal A literal is expressed in quotes. We may chose only apostrophs (simple quotes) for this, which allows use of double quotes for another function -- such as identifying use cases, or concatenation. Backslash is used for escaping single chars which happen to be grammar codes -- first of all, backslash itself. It is also used to identify characters by ordinals (code numbers). I would rather support only decimal & hex (with 'x') ordinals; and that these ordinals have a fix length of resp 3 & 2 digits -- left-padding with '0'. This is more legible & simpler, esp. when a number follows an ordinal. (below dd means decimal digit, hd means hex digit) char : char_literal | '\\' dd dd dd | '\\x' hd hd literal : '\'' char+ '\'' === set/choice A set is written inside []; it may contain both characters and ranges. Unlike pyParsing srange() function, it basically expresses a *choice* with no further need of Or, FirstMatch, or oneOf. For instance, [A-Z] is more or less equivalent to FirstMatch(list(srange("[A-Z]"))) or Word(srange("[A-Z]"),exact=1). A set can be negated with a negation code: I would support the use of '!' for this role. See below quantifiers for expression of Word(set). Ranges are expressed as usual with a pair of characters separated with '-'. A range can also be named using a pyParsing term (like 'nums'(*)). Characters can be identified either literally, or using their ordinal (code number) preceded par the escape char. range : (char '-' char) | range_name set : '!'? '[' (range | char)+ ']' Ranges could be renamed! Eg nums --> digit or dec_digit. Also, no plural: set, which hold ranges, express choices. === token A token inside another pattern is identified by its name with no further character. token : identifier === group A group of tokens is built with (...). This allows use of quantities, role identifying, maybe more. group : '(' expression ')' === item row Items that must appear in a row are simply written one after another. A space is sufficient separator, no use of ',' or '+' is needed. I would support that a space separates items in row even if not necessary for parsing: this seems both simpler and clearer. row : item (' ' item)+ === quantifier todo ****************** === a c t i o n s ==================================== Actually, naming, packing, concatenation, as well as what is called parse actions, all can be seen as parse actions; in the sense that they are additional jobs done on, or with, a result -- once this result has been distinguished. Right? This applies both for actions on results of full patterns and ones executed on sub-expressions. About the latter case, I wonder if including this feature in a text grammar is really useful. Except for naming of *roles*. Reasons are: ~ It may rarely be used. (?) ~ It complicates the grammar. ~ The programmer can define a sub-pattern to apply the action on. ~ The action can be added once the text grammar is converted into code. If needed, the syntax should use the same codes as the ones for whole pattern actions, so that these codes should allow wrapping a sub-expression. I will let this aside for now, only keeping this requirement in mind, and concentrate on actions on whole patterns. As all these kinds of parsing actions apply on results, they all fit well in the right side of a line. This has the side advantage of avoiding expression overload. Examples: integer : [0-9]+ : int "int" --> convert integer to python int; set its 'nature' attribute to "int" decimal : integer '.' integer : <> float --> concatenate decimal, convert it to python float mult : num '+' num : {} --> pack mult into pyParsing Group (sequence). If find the organisation really convenient; and the choice of codes rather easy to use. An alternative would be to use () for packing=Group-ing, which would let {} free for Dict-ing. But then it would not be possible anymore to pack_Group() sub-expressions (which I personly do not find a serious drawback -- but others may). Naming of pattern use cases, i.e. result roles, happens rather naturally inside expressions: decimal : integer "l_int" '.' integer "r_int" : <> float Now, remains a possible issue: actually, there are 2 kinds of parse actions: * Tasks (think at Pascal procedures) execute an additional job when a matching string is found. Example: count the number of integers, add them, print them. We could call 'tasks' this kind of actions. Naming can be seen as a kind of task. * Mutations return a transformed result. Very different, I guess. Grouping and combining, even if they do not really change the result's content, should rather be seen as mutations as they change at least its type. The point is that both kinds of actions are implemented the same way. The reason why they are, logically enough, both called 'actions'. But I guess they obviously do not serve the same kind of purposes. They have a very different sense, meaning a different semantic from the point of view of the programmer. Tasks can be seen like parsing side-effects. Mutations also mean that the potential results of the pattern are relevant for the application, they are products. As a consequence, I would prefere expressing tasks and mutations differently. This would also allow distinction of relevant results. We may prefix mutations with e.g. '->', which is imo quite obvious: integer : [0-9]+ : ->int "int" decimal : integer '.' integer : <> ->float Now, I have no clue of which code would be sensible & meaningful for tasks, if any. === n a m i n g , t y p i n g ==================================== When a result is generated by a whole pattern, then the pattern's name, if any, defines the result's type -- or more precisely its *nature*. When results are relevant, they should be able to hold their nature, by default. When a (sub-)pattern is used inside one or more other (super-)patterns, then it can express that its results have several possible *roles*. Below, 'integer' and 'decimal' define product natures. I used setNames("id") for this. integer = Word(nums).setName.setName("integer") decimal = Combine(integer("int_part") + '.' + integer("dec_part")).setName("decimal") num = decimal | integer mult = num("left-num") + '*' + num("right-num") Depending on the situation, the integer pattern can be used as int_part or dec_part of a decimal. These terms thus define product roles instead. Moreover, both integers and decimals can be nums, and as such used as left- or right- operands of a multiplication, which are different roles again. I find all of this relevant. I would really like that both kinds of pattern naming / result typing are possible. The present implementation of pyParsing is so that achieving this goal is not straitforward -- but still possible. We could do it so: * The programmer selects which kinds of results are relevant: switches in text grammer can define which results should then know about their type (nature and/or role). * Patterns know 'who' they are: write code grammar in a scope (class or separate module), use the scope's dict to set an 'id' attribute on each pattern. No need to name whole patterns manually, as anyway their name is the variable name. Still, let it possible to use another name? Rather call this 'id' than 'name' which is used for debugging. * Patterns can get define a use case: implemented with SetParseResults("use") or simply ("use") pseudo-call. We can copy it to a 'use' attribute to avoid confusion. * Results, or only selected ones, hold a reference to their source pattern: either with a parse action that adds a dict item, or as an argument at result's initialisation. * Selected results read relevant info from pattern: either with an additional parse action, or at initialisation when they have the proper reference. Storage can be done as attribute or dict item. === f e a t u r e s =================================== List of not basic, yet unsupported, features to think at. Note that few additional codes or syntactic forms may well compromise the grammar's overall readibility. Furthermore, individual operations can easily be added by hand once the text grammar is translated to pyParsing code. Not to forget is the main goal of clarity that fits well with decomposition of complex expressions. As a consequence, here is a list of criteria to consider for supporting a feature: --> very common need (how to measure that?) --> difficult to express by combining existing ones --> obvious format that fits well in the grammar operations * '!' or '~' not (overall negation) * '&'=Each() (unordered And) * '^'=Or() (longest match) * FollowedBy() (look ahead - right side condition) * '~'=NotAny (!FollowedBy()) * SkipTo() (as name says) * Forward --> recursive paterns: useless in text grammar? tokens * Keyword() * White() * QuotedString() * CharsNotIn() * overall CaseLess() * Regexp helpers * nestedExpr() * delimitedList() * OnlyOnce() actions * Suppress() * Dict() * replaceWith() switches / parameters Switches all False by default (easy to remember). Params have same default value as by pyParsing (ditto). Switches/params can change (between lines). * auto naming of all patterns * auto naming of pattern class * define important (product) / incidental (token) results * respect/ignore whitespace * set of whitespace chars |
From: Ralph C. <ra...@in...> - 2008-11-18 20:40:20
|
Hi spir, I don't wish to discourage your using the list as a sounding board, but I for one would have more inclination to read in detail if there was an example at the top that contrasts how I'd have to write something now in PyParsing compared to the better way I could write it with your suggestions. Cheers, Ralph. |
From: Paul M. <pt...@au...> - 2008-11-19 00:50:57
|
Ralph - Actually, spir/Denis is proposing a specialized form of BNF, which would be compilable into pyparsing constructs (I think that's where we left this idea last), probably using a pyparsing parser. I do agree, though, that writing up an actual example would be very instructive, both in conveying the concepts, and in actually testing the syntax out. How about this for a test structure, a log message: An integer message number Date/time timestamp Severity (Debug/Info/Warning/Error) Message everything up to the end of line Example: 10001 2008/11/18 12:34:56 Error A bad thing happened Use the syntax to name each field, each field in the date-timestamp, and convert the integers to actual integers and the date-timestamp into a datetime.datetime (that is a datetime object as imported from the Python datetime module). You can also look back on the examples I've posted over the past few years on comp.lang.python and use one of them. -- Paul -----Original Message----- From: Ralph Corderoy [mailto:ra...@in...] Sent: Tuesday, November 18, 2008 2:40 PM To: spir Cc: [pyParsing] Subject: Re: [Pyparsing] text grammar -- first rough Hi spir, I don't wish to discourage your using the list as a sounding board, but I for one would have more inclination to read in detail if there was an example at the top that contrasts how I'd have to write something now in PyParsing compared to the better way I could write it with your suggestions. Cheers, Ralph. ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Pyparsing-users mailing list Pyp...@li... https://lists.sourceforge.net/lists/listinfo/pyparsing-users |
From: spir <den...@fr...> - 2008-11-19 09:51:21
|
Paul McGuire a écrit : > Ralph - > > Actually, spir/Denis is proposing a specialized form of BNF, which would be > compilable into pyparsing constructs (I think that's where we left this idea > last), probably using a pyparsing parser. Exactly. Would be highly useful for a present project of mine: I need to write the grammar at runtime because it depends on user choices. Then I realized that it may be worthwhile for other needs; kind of alternative, or desigh-time format. Reason why I let my explorations public, could be a collaborative work. Now, just ignore if you do not care. > I do agree, though, that writing up an actual example would be very > instructive, both in conveying the concepts, and in actually testing the > syntax out. How about this for a test structure, a log message: I do agree, too. Simply illustrating, testing and such things are not my cup of tee... each one his/her favorite playing field. If you're interested, contribute with what you like to do or what you are good at. > An integer message number > Date/time timestamp > Severity (Debug/Info/Warning/Error) > Message everything up to the end of line > > Example: > 10001 2008/11/18 12:34:56 Error A bad thing happened > > Use the syntax to name each field, each field in the date-timestamp, and > convert the integers to actual integers and the date-timestamp into a > datetime.datetime (that is a datetime object as imported from the Python > datetime module). Anyway,... some notes: * Nothing is fix in grammar format yet. All open to critics. Just propose better alternatives with rationale. * There may be parsing directives to allow programmer control (like preprocessor). * Right side is a field for all post-parsing actions including Group-ing Combin-ing, pattern naming (= result typing). * All patterns could be automatically named (using left-side name) with a "grammar directive": this feature does not fit here, as some patterns are not to keep (they define tokens, as opposed to products). * All "post-processed" patterns could be named: this would be all right here. No need to rename on right side like I did below -- all products would then hold their type like "naturally". * Some fields can be especially typed to catch their special *role* in a specific situation. I wrote the 'text' pattern to illustrate this feature with 'severity': results would get 'word' & 'severity' respectively as .nature & .role attributes. d : [0-9] number : d+ : int "number" date : d4 '/' d2 '/' d2 time : d2 ':' d2 ':' d2 date_time : date time : <> datetime.datetime "date_time" word : [printable]+ : text : word"severity" word* : <> str "text" message : number data_time text : () "message" NL : '\10' '\13' '\13\10' log : message (NL message)+ comments: ~ d - NL: There may be constants for such things --> [digit] [new_line] ~ date time: Use of ints as quantifiers seems both useful and easy. ~ date_time, text: <> means concatenation (= Combine()) ~ message: () or {} could mean packing (= Group()). Not to be confused with in-pattern expression grouping with (), too... {} for Dict() instead? See previous & next posts. I realize that a whole lot of charset constants could be worthwhile, too. Imo separation of pattern expression and post-parsing action is very good for legibility. About right side actions: I will send a separate post on the topic -- have a better image of what fits in, and how. > You can also look back on the examples I've posted over the past few years > on comp.lang.python and use one of them. > > -- Paul > > > -----Original Message----- > From: Ralph Corderoy [mailto:ra...@in...] > Sent: Tuesday, November 18, 2008 2:40 PM > To: spir > Cc: [pyParsing] > Subject: Re: [Pyparsing] text grammar -- first rough > > > Hi spir, > > I don't wish to discourage your using the list as a sounding board, but I > for one would have more inclination to read in detail if there was an > example at the top that contrasts how I'd have to write something now in > PyParsing compared to the better way I could write it with your suggestions. nou spennn yer prrishuz taym n inerdji ! *I* will not supply samples only for advocating. I like design instead. I do this for I wish to do it, because it fulfills a need of mine, and brings me pleasure. If *you* feel like having illustrations, if you think it would be worthwhile for others maybe: do it. denis > Cheers, > > > Ralph. > |