[Pyparsing] text grammar -- first rough

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

purpose
* lighter grammar to manipulate at design time
* standard & easily legible format to communicate
* conversion at design- or run-time -- adaptable

design goals
* clear clear clear
* adapted to pyParsing's way
* limited, simple, not full-featured

contents:
	vocabulary
	organisation
	expressions
	actions
	features

===   v o c a b u l a r y   ==============================

This list is intended to help and identify elements, factors, views on the 
topic of text grammer for pyParsing. Help and not mess up things, notions, 
words. Feel free to change, add, redefine. Unsupported features are not here.

literal
	Literal string to be taken as is. Usually in quotes or apostrophes.

set choice
	Set as choice list of character literals. Expressed liked a (regexp-like) range.

alternative choice
	Expression of a choice between alternative possibilities. In pyParsing 
FirstMatch='|'. Question: support of longest_match=Or='^'.

row
	A sequence of items that must appear sequentially for a pattern to match. 
Expressed with ',' in BNF, with '+' in pyParsing code. Question: support of 
'&'=Each?

quantifier
	Operator used to define the number of potential (sub-)expressions. Usually 
'?', '*', '+'. Possible support of literal numbers for exact length

item
	Any part of a pattern expression: literal, token, range, or a (sub-)group. May 
be quantified, named, grouped.

group
	Grouping of (sub-)expressions to allow use of quantity, naming, or any other 
overall action, on the whole group. Usually expressed as '(...)'.

token
	(Sub-)Pattern result expressed for the sake of clarity or for avoiding 
repetition; appears in the expression of a (super-)pattern. Can be opposed to 
'product'.

product
Pattern result relevant for the application, either as final result items or 
for further processing. Often contains tokens. Usually, packed, concatenated, 
and/or mutated. May be automatically named/typed.

pack
	~ pyParsing Group(). Expression of a product as a sequence of tokens.

concatenation
	~ pyParsing Combine(). Expression of a product as concatenation of tokens.

mutation
	~ pyParsing parseAction with returns a value. Expression of a product as 
transformation of a raw parse result, using a built-in or custom function.

task
	~ pyParsing parseAction with no return value. A task is an action executed 
when a match is found, but does not change the result.

id, name
	Identifier of a pattern. Will become attribute 'nature' of its products.

use, use case
	Use of a (sub-)pattern in a specific situation inside a super-pattern, and 
defined
with a custom name. Will become attribute 'role' of its products.

===   o r g a n i s a t i o n   =============================

For the sake of clarity, useless symbols, that do not actually express 
anything, will be avoided. E.g. standard (E)BNF ':==', ',' separator, final ';' 
and '[...]' for options will not be used. These characters still may be used in 
an expressive function.
We need:
* pattern name: left side
* pattern expression: middle side
* pattern action(s): right side
As the name will become a code identifier, it will hold no whitespace. There is 
thus no need for a (visible) separator with the expression, whitespace is 
enough. Still, using e.g. ':' may help legibility.
We need a separator for the action part of a line. It can be the same as the 
first separator, if any, or another one. I would propose as simplest solution 
to use ':' for both separation (id est BNF ':==' reduced to ':'). wich leads to
name		: expression			: action
---------------------------------------------
integer		: [0-9]+				: int
decimal		: integer '.' integer	: float
number		: decimal | integer		: "num"
add			: {number '+' number}	: "add"
calc		: add+
where:
* action without quote is function cal
* action with quotes is pattern naming (--> result nature)
* {} may be packing = Group-ing
pattern		: name ':' expression (':' action*)

===   e x p r e s s i o n   =========================

Here are pattern expression elements described; and also expressed using the in 
progress specification of pyParsing"s text grammar itself ;-)

=== literal
A literal is expressed in quotes. We may chose only apostrophs (simple quotes) 
for this, which allows use of double quotes for another function -- such as 
identifying use cases, or concatenation.
Backslash is used for escaping single chars which happen to be grammar codes -- 
first of all, backslash itself. It is also used to identify characters by 
ordinals (code numbers). I would rather support only decimal & hex (with 'x') 
ordinals; and that these ordinals have a fix length of resp 3 & 2 digits -- 
left-padding with '0'. This is more legible & simpler, esp. when a number 
follows an ordinal.
(below dd means decimal digit, hd means hex digit)
char		: char_literal | '\\' dd dd dd | '\\x' hd hd
literal		: '\'' char+ '\''

=== set/choice
A set is written inside []; it may contain both characters and ranges. Unlike 
pyParsing srange() function, it basically expresses a *choice* with no further 
need of Or, FirstMatch, or oneOf. For instance, [A-Z] is more or less 
equivalent to FirstMatch(list(srange("[A-Z]"))) or Word(srange("[A-Z]"),exact=1).
A set can be negated with a negation code: I would support the use of '!' for 
this role.
See below quantifiers for expression of Word(set).
Ranges are expressed as usual with a pair of characters separated with '-'. A 
range can also be named using a pyParsing term (like 'nums'(*)). Characters can 
be identified either literally, or using their ordinal (code number) preceded 
par the escape char.
range		: (char '-' char) | range_name
set			: '!'? '[' (range | char)+ ']'

Ranges could be renamed! Eg nums --> digit or dec_digit. Also, no plural: set, 
which hold ranges, express choices.

=== token
A token inside another pattern is identified by its name with no further character.
token		: identifier

=== group
A group of tokens is built with (...). This allows use of quantities, role 
identifying, maybe more.
group		: '(' expression ')'

=== item row
Items that must appear in a row are simply written one after another. A space 
is sufficient separator, no use of ',' or '+' is needed. I would support that a 
space separates items in row even if not necessary for parsing: this seems both 
simpler and clearer.
row			: item (' ' item)+

=== quantifier
todo ******************

===   a c t i o n s  	====================================

Actually, naming, packing, concatenation, as well as what is called parse 
actions, all can be seen as parse actions; in the sense that they are 
additional jobs done on, or with, a result -- once this result has been 
distinguished. Right? This applies both for actions on results of full patterns 
and ones executed on sub-expressions.

About the latter case, I wonder if including this feature in a text grammar is 
really useful. Except for naming of *roles*. Reasons are:
~ It may rarely be used. (?)
~ It complicates the grammar.
~ The programmer can define a sub-pattern to apply the action on.
~ The action can be added once the text grammar is converted into code.
If needed, the syntax should use the same codes as the ones for whole pattern 
actions, so that these codes should allow wrapping a sub-expression. I will let 
this  aside for now, only keeping this requirement in mind, and concentrate on 
actions on whole patterns.

As all these kinds of parsing actions apply on results, they all fit well in 
the right side of a line. This has the side advantage of avoiding expression 
overload. Examples:

integer	: [0-9]+				: int "int"
--> convert integer to python int; set its 'nature' attribute to "int"

decimal	: integer '.' integer	: <> float
--> concatenate decimal, convert it to python float

mult	: num '+' num			: {}
--> pack mult into pyParsing Group (sequence).

If find the organisation really convenient; and the choice of codes rather easy 
to use. An alternative would be to use () for packing=Group-ing, which would 
let {} free for Dict-ing. But then it would not be possible anymore to 
pack_Group() sub-expressions (which I personly do not find a serious drawback 
-- but others may).

Naming of pattern use cases, i.e. result roles, happens rather naturally inside 
expressions:
decimal	: integer "l_int" '.' integer "r_int"	: <> float

Now, remains a possible issue: actually, there are 2 kinds of parse actions:
* Tasks (think at Pascal procedures) execute an additional job when a matching 
string is found. Example: count the number of integers, add them, print them. 
We could call 'tasks' this kind of actions. Naming can be seen as a kind of task.
* Mutations return a transformed result. Very different, I guess. Grouping and 
combining, even if they do not really change the result's content, should 
rather be seen as mutations as they change at least its type.
The point is that both kinds of actions are implemented the same way. The 
reason why they are, logically enough, both called 'actions'. But I guess they 
obviously do not serve the same kind of purposes. They have a very different 
sense, meaning a different semantic from the point of view of the programmer. 
Tasks can be seen like parsing side-effects. Mutations also mean that the 
potential results of the pattern are relevant for the application, they are 
products.

As a consequence, I would prefere expressing tasks and mutations differently. 
This would also allow distinction of relevant results.
We may prefix mutations with e.g. '->', which is imo quite obvious:

integer		: [0-9]+				: ->int "int"
decimal		: integer '.' integer	: <> ->float

Now, I have no clue of which code would be sensible & meaningful for tasks, if any.

===   n a m i n g ,   t y p i n g  	====================================

When a result is generated by a whole pattern, then the pattern's name, if any, 
defines the result's type -- or more precisely its *nature*. When results are 
relevant, they should be able to hold their nature, by default.
When a (sub-)pattern is used inside one or more other (super-)patterns, then it 
can express that its results have several possible *roles*.
Below, 'integer' and 'decimal' define product natures. I used setNames("id") 
for this.

integer		= Word(nums).setName.setName("integer")
decimal		= Combine(integer("int_part") + '.' + 
integer("dec_part")).setName("decimal")
num			= decimal | integer
mult		= num("left-num") + '*' + num("right-num")

Depending on the situation, the integer pattern can be used as int_part or 
dec_part of a decimal. These terms thus define product roles instead. Moreover, 
both integers and decimals can be nums, and as such used as left- or right- 
operands of a multiplication, which are different roles again.

I find all of this relevant. I would really like that both kinds of pattern 
naming / result typing are possible. The present implementation of pyParsing is 
so that achieving this goal is not straitforward -- but still possible. We 
could do it so:

* The programmer selects which kinds of results are relevant: switches in text 
grammer can define which results should then know about their type (nature 
and/or role).
* Patterns know 'who' they are: write code grammar in a scope (class or 
separate module), use the scope's dict to set an 'id' attribute on each 
pattern. No need to name whole patterns manually, as anyway their name is the 
variable name. Still, let it possible to use another name? Rather call this 
'id' than 'name' which is used for debugging.
* Patterns can get define a use case: implemented with SetParseResults("use") 
or simply ("use") pseudo-call. We can copy it to a 'use' attribute to avoid 
confusion.
* Results, or only selected ones, hold a reference to their source pattern: 
either with a parse action that adds a dict item, or as an argument at result's 
initialisation.
* Selected results read relevant info from pattern: either with an additional 
parse action, or at initialisation when they have the proper reference. Storage 
can be done as attribute or dict item.

===   f e a t u r e s   ===================================

List of not basic, yet unsupported, features to think at. Note that few 
additional codes or syntactic forms may well compromise the grammar's overall 
readibility. Furthermore, individual operations can easily be added by hand 
once the text grammar is translated to pyParsing code. Not to forget is the 
main goal of clarity that fits well with decomposition of complex expressions.
As a consequence, here is a list of criteria to consider for supporting a feature:
--> very common need (how to measure that?)
--> difficult to express by combining existing ones
--> obvious format that fits well in the grammar

operations
	* '!' or '~' not 	(overall negation)
	* '&'=Each()		(unordered And)
	* '^'=Or() 			(longest match)
	* FollowedBy() 		(look ahead - right side condition)
	* '~'=NotAny 		(!FollowedBy())
	* SkipTo()			(as name says)
	* Forward --> recursive paterns: useless in text grammar?

tokens
	* Keyword()
	* White()
	* QuotedString()
	* CharsNotIn()
	* overall CaseLess()
	* Regexp

helpers
	* nestedExpr()
	* delimitedList()
	* OnlyOnce()

actions
	* Suppress()
	* Dict()
	* replaceWith()

switches / parameters
	Switches all False by default (easy to remember).
	Params have same default value as by pyParsing (ditto).
	Switches/params can change (between lines).
	* auto naming of all patterns
	* auto naming of pattern class
	* define important (product) / incidental (token) results
	* respect/ignore whitespace
	* set of whitespace chars