pyparsing-users Mailing List for Python parsing module (Page 17)

Brought to you by: ptmcg

pyparsing-users — User notes and help on the pyparsing module

You can subscribe to this list here.

2004	Jan	Feb	Mar (1)	Apr	May (1)	Jun	Jul	Aug (2)	Sep	Oct	Nov (2)	Dec
2005	Jan (2)	Feb	Mar (2)	Apr (12)	May (2)	Jun	Jul	Aug (12)	Sep	Oct (1)	Nov	Dec
2006	Jan (5)	Feb (1)	Mar (10)	Apr (3)	May (7)	Jun (2)	Jul (2)	Aug (7)	Sep (8)	Oct (17)	Nov	Dec (3)
2007	Jan (4)	Feb	Mar (10)	Apr	May (6)	Jun (11)	Jul (1)	Aug	Sep (19)	Oct (8)	Nov (32)	Dec (8)
2008	Jan (12)	Feb (6)	Mar (42)	Apr (47)	May (17)	Jun (15)	Jul (7)	Aug (2)	Sep (13)	Oct (6)	Nov (11)	Dec (3)
2009	Jan (2)	Feb (3)	Mar	Apr	May (11)	Jun (13)	Jul (19)	Aug (17)	Sep (8)	Oct (3)	Nov (7)	Dec (1)
2010	Jan (2)	Feb	Mar (19)	Apr (6)	May	Jun (2)	Jul	Aug (1)	Sep	Oct (4)	Nov (3)	Dec (2)
2011	Jan (4)	Feb	Mar (5)	Apr (1)	May (3)	Jun (8)	Jul (6)	Aug (8)	Sep (35)	Oct (1)	Nov (1)	Dec (2)
2012	Jan (2)	Feb	Mar (3)	Apr (4)	May	Jun (1)	Jul	Aug (6)	Sep (18)	Oct	Nov (1)	Dec
2013	Jan (7)	Feb (7)	Mar (1)	Apr (4)	May	Jun	Jul (1)	Aug (5)	Sep (3)	Oct (11)	Nov (3)	Dec
2014	Jan (3)	Feb (1)	Mar	Apr (6)	May (10)	Jun (4)	Jul	Aug (5)	Sep (2)	Oct (4)	Nov (1)	Dec
2015	Jan	Feb	Mar	Apr (13)	May (1)	Jun	Jul (2)	Aug	Sep (9)	Oct (2)	Nov (11)	Dec (2)
2016	Jan	Feb (3)	Mar (2)	Apr	May	Jun	Jul (3)	Aug	Sep	Oct (1)	Nov (1)	Dec (4)
2017	Jan (2)	Feb (2)	Mar (2)	Apr	May	Jun	Jul (4)	Aug	Sep	Oct (4)	Nov (3)	Dec
2018	Jan (10)	Feb	Mar (1)	Apr	May	Jun (1)	Jul	Aug	Sep	Oct (2)	Nov	Dec
2019	Jan	Feb	Mar	Apr	May	Jun (2)	Jul	Aug	Sep	Oct	Nov	Dec
2020	Jan	Feb (1)	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (1)
2023	Jan	Feb	Mar	Apr (1)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2024	Jan	Feb (1)	Mar	Apr (1)	May	Jun	Jul (1)	Aug (3)	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 15 16 17 18 19 .. 31 > >> (Page 17 of 31)

[Pyparsing] parsing page

From: Boštjan J. <ml...@ja...> - 2008-12-03 07:59:39

Hello!

I'd like to parse a page by searching if it contains searched word and  
after that I have a known syntax.
Let me explain:

<a lot of text with unknown length with possible line breaks>
<searched word>
<known syntax to parse>
.....

If I try to use Word(alphas), it stops at line 2 charachter 15 (is  
there a limit)?
Should I just use pythons index command?

Hope I was clear with the question/explanation.

Boštjan

[Pyparsing] action field

From: spir <den...@fr...> - 2008-11-19 11:43:01

Here is how I imagine the right field of pattern lines -- as of now: wed 19 of 
nov 2008, 12:41 here.



=== What fits in?

Post-match jobs. All actions that apply on, or use, results generated according 
to this precise the pattern. This includes:

* mutation
	(Post-) Parse action that returns a transformed result.
* task
	Parse action that performs an additional operation when a match is found -- 
but let the result unchanged.
* format
	Kind of intermediate between mutation and task. Will not change the result's 
content, only its format. This includes packing (=Group-ing), concatenation 
(=Combine), maybe toList() and Dict().
* pattern name = result type
	Products, or relevant results, can get a 'nature' field. May be implemented as 
attribute or dict item. Says what kind of result it is, hence no need to 
reparse when kinds of results are not predictable. Appropriate actions can be 
launched directly. Grammar/Parser directives may let relevant results be 
automatically typed, using pattern name.
* 'star' (?) (new idea)
	Product identification, kind of flag. A special code, e.g. '*' used to 
identify relevant patterns. Meaning the ones needed for further processing.



=== Order
There is a kind of natural order, maybe. Could be for instance:
action		: star? (mutation | task | format)* name?



=== comments

separator
	I am not fully happy of ':' beeing the separator between expression and action 
fields. All right, especially because it is same as the name/expression 
separator. But there may be better; not obvious enough to my taste, actually.

Mutations
	Must obviously be defined elsewhere. This is outside the scope of grammar 
itself. Can be built-in types/functions such as int(). Typically, I guess, cope 
only with the result as argument.
	
Mutations & Formats
	It is not clear for me if the formatted result should then really be of target 
type (e.g. list, int) or a ParseResults object holding this content. I am not 
sure to understand what really happens when using such functions in 
ParseResults itself. There are very heterogeneous cases:
	int() or whatever ParseAction returning new result object
	asList() ParseResults method
	Dict() Group() Combine() ParserElement subtypes



=== star: token vs product

Paul chose to write a single-pass parser generator. This is perfectly good, 
especially when grammar in pure code ;-). Still, for the programmer, results do 
not all have the same status: some are intermediate results (I call them 
tokens), some are the results one needs to cope with after parsing (produts) -- 
even if only for output. I find it rather natural to make a kind of flagging 
possible. I have some use for that in mind, but probably there is more.
Note: there is no absolute need to write down token patterns, they could be 
sub-expressions of products. But it highly enhances legibility and avoids 
repetition. This is even more relevant for a text grammar which first purpose 
is clarity.

Actually, I would support 'staring' (or anything that has a similar semantics) 
even if it had no real use: because it makes sense...
May also be stored on product itself, together with nature/role. Precisely, as 
I see it, this flag is particuliarly meaningful in combination with nature 
and/or role.

It shows what's important - and what's not.
The 'star' code would allow, after parsing, the programmer to sort out relevant 
results.
This code would also allow, and filter, automatic actions such as naming, 
suppression, or whatever. These actions could be controlled with parser 
'switch' directives (all default values to False).
All of this mess can be written manually for patterns on which it applies. But 
global control seems clearer to me and avoids useless grammar overload. Now, 
this is nothing specific to text grammar. Instead, global control could exist 
for pyParsing coded grammar, like presently whitespace control.



=== grammar/parser directives

Some rough ideas.

whitespace
	_ respect_whitespace (default: ignore)
	_ list of (default: idem pyParsing)

product
	_ all_products (default: no) avoids staring all
	_ product_list (default: []) alternative to staring in-line
	_ type_dict (default: {}) --> automatic instanciation (*)

typing/naming
	_ type_all (default: no)
	_ type_products

actions
	_ product_format (default: None) default format, e.g. Dict
	_ product_action (default: None) default conversion, e.g. int
	_ product_task (default: None) default side task, e.g. count, add, print
	_ all_format (default: None) default format, e.g. Dict
	_ all_action (default: None) default conversion, e.g. int
	_ all_task (default: None) default side task, e.g. count, add, print

(*) This is a specific need of mine. No idea if it matches a common 
requirements. Objects will instanciated with a type defined by the product's 
nature, and init data taken from the product itself. Looks like:
type_dict[product.nature](product)
Could be a parse action applying to all products -- but products only.

denis

Re: [Pyparsing] text grammar -- single sample

From: spir <den...@fr...> - 2008-11-19 09:51:21

Paul McGuire a écrit :
 > Ralph -
 >
 > Actually, spir/Denis is proposing a specialized form of BNF, which would be
 > compilable into pyparsing constructs (I think that's where we left this idea
 > last), probably using a pyparsing parser.

Exactly. Would be highly useful for a present project of mine: I need to write 
the grammar at runtime because it depends on user choices. Then I realized that 
it may be worthwhile for other needs; kind of alternative, or desigh-time 
format. Reason why I let my explorations public, could be a collaborative work. 
Now, just ignore if you do not care.

 > I do agree, though, that writing up an actual example would be very
 > instructive, both in conveying the concepts, and in actually testing the
 > syntax out.  How about this for a test structure, a log message:

I do agree, too. Simply illustrating, testing and such things are not my cup of 
tee... each one his/her favorite playing field. If you're interested, 
contribute with what you like to do or what you are good at.

 > An integer message number
 > Date/time timestamp
 > Severity (Debug/Info/Warning/Error)
 > Message everything up to the end of line
 >
 > Example:
 > 10001 2008/11/18 12:34:56 Error A bad thing happened
 >
 > Use the syntax to name each field, each field in the date-timestamp, and
 > convert the integers to actual integers and the date-timestamp into a
 > datetime.datetime (that is a datetime object as imported from the Python
 > datetime module).

Anyway,... some notes:
* Nothing is fix in grammar format yet. All open to critics. Just propose 
better alternatives with rationale.
* There may be parsing directives to allow programmer control (like preprocessor).
* Right side is a field for all post-parsing actions including Group-ing 
Combin-ing, pattern naming (= result typing).
* All patterns could be automatically named (using left-side name) with a 
"grammar directive": this feature does not fit here, as some patterns are not 
to keep (they define tokens, as opposed to products).
* All "post-processed" patterns could be named: this would be all right here. 
No need to rename on right side like I did below -- all products would then 
hold their type like "naturally".
* Some fields can be especially typed to catch their special *role* in a 
specific situation. I wrote the 'text' pattern to illustrate this feature with 
'severity': results would get 'word' & 'severity' respectively as .nature & 
.role attributes.

d			: [0-9]
number		: d+						: int "number"
date		: d4 '/' d2 '/' d2
time		: d2 ':' d2 ':' d2
date_time	: date time					: <> datetime.datetime "date_time"
word		: [printable]+				:
text		: word"severity" word*		: <> str "text"
message		: number data_time text		: () "message"
NL			: '\10' '\13' '\13\10'
log			: message (NL message)+

comments:
~ d - NL: There may be constants for such things --> [digit] [new_line]
~ date time: Use of ints as quantifiers seems both useful and easy.
~ date_time, text: <> means concatenation (= Combine())
~ message: () or {} could mean packing (= Group()). Not to be confused with 
in-pattern expression grouping with (), too... {} for Dict() instead? See 
previous & next posts.

I realize that a whole lot of charset constants could be worthwhile, too.
Imo separation of pattern expression and post-parsing action is very good for 
legibility.
About right side actions: I will send a separate post on the topic -- have a 
better image of what fits in, and how.

 > You can also look back on the examples I've posted over the past few years
 > on comp.lang.python and use one of them.
 >
 > -- Paul
 >
 >
 > -----Original Message-----
 > From: Ralph Corderoy [mailto:ra...@in...]
 > Sent: Tuesday, November 18, 2008 2:40 PM
 > To: spir
 > Cc: [pyParsing]
 > Subject: Re: [Pyparsing] text grammar -- first rough
 >
 >
 > Hi spir,
 >
 > I don't wish to discourage your using the list as a sounding board, but I
 > for one would have more inclination to read in detail if there was an
 > example at the top that contrasts how I'd have to write something now in
 > PyParsing compared to the better way I could write it with your suggestions.

nou spennn yer prrishuz taym n inerdji !

*I* will not supply samples only for advocating. I like design instead. I do 
this for I wish to do it, because it fulfills a need of mine, and brings me 
pleasure. If *you* feel like having illustrations, if you think it would be 
worthwhile for others maybe: do it.

denis

 > Cheers,
 >
 >
 > Ralph.
 >

Re: [Pyparsing] text grammar -- first rough

From: Paul M. <pt...@au...> - 2008-11-19 00:50:57

Ralph -

Actually, spir/Denis is proposing a specialized form of BNF, which would be
compilable into pyparsing constructs (I think that's where we left this idea
last), probably using a pyparsing parser.

I do agree, though, that writing up an actual example would be very
instructive, both in conveying the concepts, and in actually testing the
syntax out.  How about this for a test structure, a log message:

An integer message number
Date/time timestamp
Severity (Debug/Info/Warning/Error)
Message everything up to the end of line

Example:
10001 2008/11/18 12:34:56 Error A bad thing happened

Use the syntax to name each field, each field in the date-timestamp, and
convert the integers to actual integers and the date-timestamp into a
datetime.datetime (that is a datetime object as imported from the Python
datetime module).

You can also look back on the examples I've posted over the past few years
on comp.lang.python and use one of them.

-- Paul


-----Original Message-----
From: Ralph Corderoy [mailto:ra...@in...] 
Sent: Tuesday, November 18, 2008 2:40 PM
To: spir
Cc: [pyParsing]
Subject: Re: [Pyparsing] text grammar -- first rough


Hi spir,

I don't wish to discourage your using the list as a sounding board, but I
for one would have more inclination to read in detail if there was an
example at the top that contrasts how I'd have to write something now in
PyParsing compared to the better way I could write it with your suggestions.

Cheers,


Ralph.


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great
prizes Grand prize is a trip for two to an Open Source event anywhere in the
world http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Pyparsing-users mailing list
Pyp...@li...
https://lists.sourceforge.net/lists/listinfo/pyparsing-users

Re: [Pyparsing] text grammar -- first rough

From: Ralph C. <ra...@in...> - 2008-11-18 20:40:20

Hi spir,

I don't wish to discourage your using the list as a sounding board, but
I for one would have more inclination to read in detail if there was an
example at the top that contrasts how I'd have to write something now in
PyParsing compared to the better way I could write it with your
suggestions.

Cheers,


Ralph.

[Pyparsing] text grammar -- first rough

From: spir <den...@fr...> - 2008-11-18 17:52:57

purpose
* lighter grammar to manipulate at design time
* standard & easily legible format to communicate
* conversion at design- or run-time -- adaptable

design goals
* clear clear clear
* adapted to pyParsing's way
* limited, simple, not full-featured

contents:
	vocabulary
	organisation
	expressions
	actions
	features



===   v o c a b u l a r y   ==============================

This list is intended to help and identify elements, factors, views on the 
topic of text grammer for pyParsing. Help and not mess up things, notions, 
words. Feel free to change, add, redefine. Unsupported features are not here.

literal
	Literal string to be taken as is. Usually in quotes or apostrophes.

set choice
	Set as choice list of character literals. Expressed liked a (regexp-like) range.

alternative choice
	Expression of a choice between alternative possibilities. In pyParsing 
FirstMatch='|'. Question: support of longest_match=Or='^'.

row
	A sequence of items that must appear sequentially for a pattern to match. 
Expressed with ',' in BNF, with '+' in pyParsing code. Question: support of 
'&'=Each?

quantifier
	Operator used to define the number of potential (sub-)expressions. Usually 
'?', '*', '+'. Possible support of literal numbers for exact length

item
	Any part of a pattern expression: literal, token, range, or a (sub-)group. May 
be quantified, named, grouped.

group
	Grouping of (sub-)expressions to allow use of quantity, naming, or any other 
overall action, on the whole group. Usually expressed as '(...)'.

token
	(Sub-)Pattern result expressed for the sake of clarity or for avoiding 
repetition; appears in the expression of a (super-)pattern. Can be opposed to 
'product'.

product
Pattern result relevant for the application, either as final result items or 
for further processing. Often contains tokens. Usually, packed, concatenated, 
and/or mutated. May be automatically named/typed.

pack
	~ pyParsing Group(). Expression of a product as a sequence of tokens.

concatenation
	~ pyParsing Combine(). Expression of a product as concatenation of tokens.

mutation
	~ pyParsing parseAction with returns a value. Expression of a product as 
transformation of a raw parse result, using a built-in or custom function.

task
	~ pyParsing parseAction with no return value. A task is an action executed 
when a match is found, but does not change the result.

id, name
	Identifier of a pattern. Will become attribute 'nature' of its products.

use, use case
	Use of a (sub-)pattern in a specific situation inside a super-pattern, and 
defined
with a custom name. Will become attribute 'role' of its products.


	
===   o r g a n i s a t i o n   =============================

For the sake of clarity, useless symbols, that do not actually express 
anything, will be avoided. E.g. standard (E)BNF ':==', ',' separator, final ';' 
and '[...]' for options will not be used. These characters still may be used in 
an expressive function.
We need:
* pattern name: left side
* pattern expression: middle side
* pattern action(s): right side
As the name will become a code identifier, it will hold no whitespace. There is 
thus no need for a (visible) separator with the expression, whitespace is 
enough. Still, using e.g. ':' may help legibility.
We need a separator for the action part of a line. It can be the same as the 
first separator, if any, or another one. I would propose as simplest solution 
to use ':' for both separation (id est BNF ':==' reduced to ':'). wich leads to
name		: expression			: action
---------------------------------------------
integer		: [0-9]+				: int
decimal		: integer '.' integer	: float
number		: decimal | integer		: "num"
add			: {number '+' number}	: "add"
calc		: add+
where:
* action without quote is function cal
* action with quotes is pattern naming (--> result nature)
* {} may be packing = Group-ing
pattern		: name ':' expression (':' action*)



===   e x p r e s s i o n   =========================

Here are pattern expression elements described; and also expressed using the in 
progress specification of pyParsing"s text grammar itself ;-)

=== literal
A literal is expressed in quotes. We may chose only apostrophs (simple quotes) 
for this, which allows use of double quotes for another function -- such as 
identifying use cases, or concatenation.
Backslash is used for escaping single chars which happen to be grammar codes -- 
first of all, backslash itself. It is also used to identify characters by 
ordinals (code numbers). I would rather support only decimal & hex (with 'x') 
ordinals; and that these ordinals have a fix length of resp 3 & 2 digits -- 
left-padding with '0'. This is more legible & simpler, esp. when a number 
follows an ordinal.
(below dd means decimal digit, hd means hex digit)
char		: char_literal | '\\' dd dd dd | '\\x' hd hd
literal		: '\'' char+ '\''

=== set/choice
A set is written inside []; it may contain both characters and ranges. Unlike 
pyParsing srange() function, it basically expresses a *choice* with no further 
need of Or, FirstMatch, or oneOf. For instance, [A-Z] is more or less 
equivalent to FirstMatch(list(srange("[A-Z]"))) or Word(srange("[A-Z]"),exact=1).
A set can be negated with a negation code: I would support the use of '!' for 
this role.
See below quantifiers for expression of Word(set).
Ranges are expressed as usual with a pair of characters separated with '-'. A 
range can also be named using a pyParsing term (like 'nums'(*)). Characters can 
be identified either literally, or using their ordinal (code number) preceded 
par the escape char.
range		: (char '-' char) | range_name
set			: '!'? '[' (range | char)+ ']'

Ranges could be renamed! Eg nums --> digit or dec_digit. Also, no plural: set, 
which hold ranges, express choices.

=== token
A token inside another pattern is identified by its name with no further character.
token		: identifier

=== group
A group of tokens is built with (...). This allows use of quantities, role 
identifying, maybe more.
group		: '(' expression ')'

=== item row
Items that must appear in a row are simply written one after another. A space 
is sufficient separator, no use of ',' or '+' is needed. I would support that a 
space separates items in row even if not necessary for parsing: this seems both 
simpler and clearer.
row			: item (' ' item)+

=== quantifier
todo ******************


===   a c t i o n s  	====================================

Actually, naming, packing, concatenation, as well as what is called parse 
actions, all can be seen as parse actions; in the sense that they are 
additional jobs done on, or with, a result -- once this result has been 
distinguished. Right? This applies both for actions on results of full patterns 
and ones executed on sub-expressions.

About the latter case, I wonder if including this feature in a text grammar is 
really useful. Except for naming of *roles*. Reasons are:
~ It may rarely be used. (?)
~ It complicates the grammar.
~ The programmer can define a sub-pattern to apply the action on.
~ The action can be added once the text grammar is converted into code.
If needed, the syntax should use the same codes as the ones for whole pattern 
actions, so that these codes should allow wrapping a sub-expression. I will let 
this  aside for now, only keeping this requirement in mind, and concentrate on 
actions on whole patterns.

As all these kinds of parsing actions apply on results, they all fit well in 
the right side of a line. This has the side advantage of avoiding expression 
overload. Examples:

integer	: [0-9]+				: int "int"
--> convert integer to python int; set its 'nature' attribute to "int"

decimal	: integer '.' integer	: <> float
--> concatenate decimal, convert it to python float

mult	: num '+' num			: {}
--> pack mult into pyParsing Group (sequence).

If find the organisation really convenient; and the choice of codes rather easy 
to use. An alternative would be to use () for packing=Group-ing, which would 
let {} free for Dict-ing. But then it would not be possible anymore to 
pack_Group() sub-expressions (which I personly do not find a serious drawback 
-- but others may).

Naming of pattern use cases, i.e. result roles, happens rather naturally inside 
expressions:
decimal	: integer "l_int" '.' integer "r_int"	: <> float

Now, remains a possible issue: actually, there are 2 kinds of parse actions:
* Tasks (think at Pascal procedures) execute an additional job when a matching 
string is found. Example: count the number of integers, add them, print them. 
We could call 'tasks' this kind of actions. Naming can be seen as a kind of task.
* Mutations return a transformed result. Very different, I guess. Grouping and 
combining, even if they do not really change the result's content, should 
rather be seen as mutations as they change at least its type.
The point is that both kinds of actions are implemented the same way. The 
reason why they are, logically enough, both called 'actions'. But I guess they 
obviously do not serve the same kind of purposes. They have a very different 
sense, meaning a different semantic from the point of view of the programmer. 
Tasks can be seen like parsing side-effects. Mutations also mean that the 
potential results of the pattern are relevant for the application, they are 
products.

As a consequence, I would prefere expressing tasks and mutations differently. 
This would also allow distinction of relevant results.
We may prefix mutations with e.g. '->', which is imo quite obvious:

integer		: [0-9]+				: ->int "int"
decimal		: integer '.' integer	: <> ->float

Now, I have no clue of which code would be sensible & meaningful for tasks, if any.



===   n a m i n g ,   t y p i n g  	====================================

When a result is generated by a whole pattern, then the pattern's name, if any, 
defines the result's type -- or more precisely its *nature*. When results are 
relevant, they should be able to hold their nature, by default.
When a (sub-)pattern is used inside one or more other (super-)patterns, then it 
can express that its results have several possible *roles*.
Below, 'integer' and 'decimal' define product natures. I used setNames("id") 
for this.

integer		= Word(nums).setName.setName("integer")
decimal		= Combine(integer("int_part") + '.' + 
integer("dec_part")).setName("decimal")
num			= decimal | integer
mult		= num("left-num") + '*' + num("right-num")

Depending on the situation, the integer pattern can be used as int_part or 
dec_part of a decimal. These terms thus define product roles instead. Moreover, 
both integers and decimals can be nums, and as such used as left- or right- 
operands of a multiplication, which are different roles again.

I find all of this relevant. I would really like that both kinds of pattern 
naming / result typing are possible. The present implementation of pyParsing is 
so that achieving this goal is not straitforward -- but still possible. We 
could do it so:

* The programmer selects which kinds of results are relevant: switches in text 
grammer can define which results should then know about their type (nature 
and/or role).
* Patterns know 'who' they are: write code grammar in a scope (class or 
separate module), use the scope's dict to set an 'id' attribute on each 
pattern. No need to name whole patterns manually, as anyway their name is the 
variable name. Still, let it possible to use another name? Rather call this 
'id' than 'name' which is used for debugging.
* Patterns can get define a use case: implemented with SetParseResults("use") 
or simply ("use") pseudo-call. We can copy it to a 'use' attribute to avoid 
confusion.
* Results, or only selected ones, hold a reference to their source pattern: 
either with a parse action that adds a dict item, or as an argument at result's 
initialisation.
* Selected results read relevant info from pattern: either with an additional 
parse action, or at initialisation when they have the proper reference. Storage 
can be done as attribute or dict item.



===   f e a t u r e s   ===================================

List of not basic, yet unsupported, features to think at. Note that few 
additional codes or syntactic forms may well compromise the grammar's overall 
readibility. Furthermore, individual operations can easily be added by hand 
once the text grammar is translated to pyParsing code. Not to forget is the 
main goal of clarity that fits well with decomposition of complex expressions.
As a consequence, here is a list of criteria to consider for supporting a feature:
--> very common need (how to measure that?)
--> difficult to express by combining existing ones
--> obvious format that fits well in the grammar

operations
	* '!' or '~' not 	(overall negation)
	* '&'=Each()		(unordered And)
	* '^'=Or() 			(longest match)
	* FollowedBy() 		(look ahead - right side condition)
	* '~'=NotAny 		(!FollowedBy())
	* SkipTo()			(as name says)
	* Forward --> recursive paterns: useless in text grammar?

tokens
	* Keyword()
	* White()
	* QuotedString()
	* CharsNotIn()
	* overall CaseLess()
	* Regexp

helpers
	* nestedExpr()
	* delimitedList()
	* OnlyOnce()

actions
	* Suppress()
	* Dict()
	* replaceWith()

switches / parameters
	Switches all False by default (easy to remember).
	Params have same default value as by pyParsing (ditto).
	Switches/params can change (between lines).
	* auto naming of all patterns
	* auto naming of pattern class
	* define important (product) / incidental (token) results
	* respect/ignore whitespace
	* set of whitespace chars

Re: [Pyparsing] results typing

From: Paul M. <pt...@au...> - 2008-11-16 18:52:29

Denis -

Thanks for your contributions on this list - please don't be discouraged if
you don't get many replies to your messages.  The readers here are mostly
lurkers (not that there's anything wrong with that!), or folks who post with
particular questions they need answered.  I too look at the list as
something of an archive of past pyparsing discussions.

I've had a chance to look over your notes in brief, but I really want to
give them more thought and consideration than I can spare just now.  In
general, I would say that the best way to start prototyping the integration
of your ideas with pyparsing as it exists is through parse actions (for
embellishing parse results) and helper methods (to simplify the construction
of expressions, or linking them to parse actions).

Whatever you do, I *strongly* suggest that you make these enhancements under
the control of the developer, and not automatically applied across the
board.  Pyparsing creates many intermediate parse expressions when building
the grammar, and parse results when parsing the source text, many of which
get either discarded or absorbed into a larger expressions.  Also, if your
enhancements stay in the realm of extensions that users may add or not of
their own choosing, then forward compatibility of existing code will be
preserved, and it will be easier to add your ideas to the core pyparsing
code.

Here is one idea that might address your question about linking results to
the original pyparsing expression (untested):

from pyparsing import *

def linkToResults(expr):
    def parseAction(tokens):
        tokens["expr"] = expr
    expr.addParseAction(parseAction)

integer = Word(nums).setName("integer")
decimal		= Combine(integer("int_part") + '.' + integer("dec_part"))
linkToResults(decimal)

print decimal.parseString("3.14159").dump()

Which prints:
['3.14159']
- dec_part: 14159
- expr: Combine:({integer "." integer})
- int_part: 3

Now the parse results that get created by expr will contain an added field
named "expr" that points back to the original expression (as shown by the
dump() output).  If this works well as a prototype, then it may be just as
easy to add it as a member function of ParserElement, so that any expression
can get linked back to from its results.

-- Paul

[Pyparsing] results typing

From: spir <den...@fr...> - 2008-11-16 18:27:59

Hello, pyParsing world!

[rough version -- I have not read over this message]

-0- intro ========

Now is time for serious things! Below a kind of study on this subject I have 
brought up & approached several times already: result typing.

Here is 'type' primarly used in its non-technical sense. Probably there are 
whole parsing use fields where result types are not such important. Each time, 
in fact, the types of results is predictable, they need not be explicitely 
defined. For instance, a file may contain, or one may extract, only data of a 
single type. Or a file format may define types of data in a constant repeated 
order, such as x y color x y color... Still the general situation, I guess, is 
so that we cannot predict in which order types of valuable data will happen in 
source texts, so that having them in the results would be highly helpful. This 
especially applies when parsing texts written in any king of /language/.

The type of a result is similar to the one of any kind of data item: it carries 
the sense of the result. Without it, we are unable to do anything out of it; 
like without a type we, as well as the language "decoder" itself, do not even 
know what kind of operation may apply to a bit of data. If the results do not 
hold their type, we are obliged to re-parse them only to determine what kind of 
thing they are. pyParsing provides two functions to give patterns names -- I 
will talk about them later.


-1- pattern idS, result typeS ========

What is the name of a pattern?; what is the type of a result? What is, 
actually, the link between patterns & results? Patterns define or generate 
results. They are /classes/ of results, in a similar manner as (programming) 
types are classes of instances. Actually, patterns could be (programming) types 
-- but this wouldn't fit in pyParsing. Results are like pattern samples, they 
share characteristics which are specified in/by patterns. A pattern identifier 
(name, id) thus defines its potential results' type.

pattern object		<-->		result type
pattern object id	<-->		result type id

Now, there are actually several kinds of patterns IDs matching several kinds of 
result types:

integer		= Word(nums).setName("int")
decimal		= integer("int_part") + '.' + integer("dec_part")

Basically, a pattern usually defines the /nature/ of results, like in the first 
line above. Now, a single pattern may have several use cases, like in the 
second, like, which define several results' /roles/. I intentionally used 
setName to define pattern names and setResultName (abbreviated in call) to 
define use cases -- but obviously nothing forces us to de that. The example can 
be extended to show the difference between result nature and role more accutely:

integer		= Word(nums).setName("integer")
decimal		= Combine(integer("int_part") + '.' + 
integer("dec_part")).setName("decimal")
num			= decimal | integer
mult		= num("left-num") + '*' + num("right-num")

Both integers & decimals (nature) may be left-nums or right-nums (role).

pattern id			<-->		result nature
pattern use			<-->		result role

Depending on the application, results nature, role or both may be relevant 
information.


-2- pyParsing	=======================

As I have used pyParsing for a few weeks only, I may say stuppid things. But, i 
have tried hard to find friendly ways to get such info from parse results -- I 
could not find any. Actually, I ended up with:
* additional data to patterns
* a custom result type
* changes in pyParsing code

First, patterns basicaly do not know anything about themselves. Especially, 
they do not know they are, not even their (variable) name. If patterns would be 
types, they would know it; but custom type do not have a __name__ attribute to 
receive their (variable) name. Pity. We nevertheless can give a pattern a name 
with setName, or setResultsName.

The main problem anyway is that there no interconnection between patterns and 
results. A result have no access to the pattern that yielded it, not even a 
simple reference. A pattern only passes ResultsName at result init time.
			ResultsName only
pattern			--o-->			results
pattern			<--x--			results
				nothing
An additional obstacle comes from the protection of results access by 
__slots__, for performance reasons, which prevents setting/reading custom 
attributes. Fortunately, patterns are not protected.


-3- letting patterns know	===============

We can use a simple trick to let patterns know a bit about themselves. If they 
are put in a scope (e.g. separate module or class), we have access to a dict 
that holds together names and objects. With that information, we have all we 
need to tweak in patterns guts. Assuming the Grammar is in a class, we could 
even have a class method to do the job. [Note: the name can't be called 
'.name', as this name (!) is used by pyParsing to format pattern repr output, 
esp. for error display.] It may look like that:

class Grammar(object):
	''' pyParsing grammar '''
	integer	= Word(nums)
	decimal	= Combine(integer("int_part") + '.' + integer("dec_part"))
	num		= (decimal | integer).setName("num")
	mult		= Group(num("left-num") + '*' + num("right-num"))
	calc		= OneOrMore(mult)
	@classmethod
	def _setNames(Grammar):
		''' give patterns their name '''
		# exclude '-*' names
		attribs = Grammar.__dict__.items()
		namedPatterns = filter(lambda (name,pattern):
							name[0]!='_', attribs)
		# set .id attributes
		for (name,pattern) in namedPatterns:
			pattern.id=name
		Grammar.patterns = [pattern for (name,pattern) in namedPatterns]
Grammar._setNames()
for pattern in Grammar.patterns:
	print "%s: %s" %(pattern.id,pattern)
===>>
num: num
integer: W:(0123...)
calc: {Group:({{num "*"} num})}...
decimal: Combine:({{W:(0123...) "."} W:(0123...)})
mult: Group:({{num "*"} num})

Now, we have a proper tool to automatically name patterns. Manual setName is no 
more necessary, it can serve more specific needs such as delivering clearer 
info to users. We are ready to transmit results information about their use. 
ResultsName can set info about use cases.


-4- results structure	====================

I have posted a message displaying a type called 'Data'. [Still not really 
allright, I have discovered a bug.] If ever results magically could receive the 
information that patterns now hold, about their nature and role, we could use 
Data objects to properly hold and display typed results. Output may then look 
like that:

calc:[mult:[dec:1.1  <str>:*  int:2]  mult:[int:1  <str>:*  dec:2.2]]

The types shown like prefixes would be taken from the most accurate information 
available:
* role		= pattern use (e.g. left_num)
* nature	= pattern id (e.g. integer)
* pattern format (as presently held in .name, e.g. W:(0123...))
* result type (e.g. <str> or <int>)

Now, we have to find a way to let the results know about that.

-5- passing info to results 	=====================

As results have no access to patterns, we are presently blocked. If we just 
gave them a reference to the patterns, we would be unblocked. I did some 
explorations & trials, and it seems allright. Things to do:
* Add specific fields to patterns: id, nature
* Add a reference to pattern at result's instanciation. This happens 3 times in 
the method _parseNobuffer of the class ParserElement. 'self' can be added there 
as new argument for result initialisation. For instance:
retTokens = ParseResults(tokens, self.resultsName, asList=self.saveAsList 
,modal=self.modalResults, pattern=self)
this arg becomes a 'pattern' param in ParseResult's __new__ & __init__ methods.
* Add private attribs to ParseResult. In __init__:
self.__pattern	= pattern
self.__nature	= pattern.id
self.__role		= pattern.use
And matching accessors (because access is protected). E.g.:
def pattern(): return self.__pattern

denis

[Pyparsing] the typed Data type

From: spir <den...@fr...> - 2008-11-16 10:14:49

Hello,

[As I don't if there anybody else on this list, well... I use it like a log for 
ideas and trials using pyParsing; and an oppportunity to express them clearly 
(?). denis]

Here is an implementation of a custom type used to give parse results a 
alternative structure, and an illustration of what it is intended to.
Data (sic!) is primarly used to natively give (nested) parse results a /type/. 
I will come back to this point of view that results should be typed in a 
further message. So, Data allow results to have a type -- like ordinary dat, 
hence the name -- and show in a type:content format.
Actually, the implementation i uselessly complicated because presently it is 
able to receive content from several kinds of sources: final parse result, 
parse results created during the parsing process, ordinary data, objects that 
already are of type Data. This makes its typing and content reading overly 
complex. (ToDo: implement __new__ for the case when content of Data object.) 
Additionaly, it holds currently useless list-like operator overloading. For 
more specific use, it could be written in a dozen of lines, as it was before. I 
also added a Seq type to avoid a problem with built-in lists.
Additionally, Data is able to receive content from any kind of simple or 
sequential object.
The type property may be defined from several source, here listed from the most 
specific to the least one:
	* arg passed at init [ParseResults object only]
	* ResultsType retrieved from getName() [ditto]
	* pattern's .use or .ResultsType
	* pattern's .id or .name
	* pattern's type_name
	* result's own type_name
Some sources of info listed above for typing an object belong in fact to 
further exploration about pattern naming that I will present in another post.
Here is the Data thing:

def typ(obj): return obj.__class__.__name__
class Seq(list):
	''' specialized sequence type with improved str
		Override list's behaviour that str(list) calls repr instead of str on items.
		'''
	def __str__(self):
		if len(self) == 0:
			return '[]'
		text = str(self[0])
		for item in self[1:]:
			if isinstance(item, list):
				item = Seq(item)
			text += " ,%s" %item
		return "[%s]" %text
class Data(object):
	''' nestable type:content object
		with built_in toolset
		'''
	def __init__(self, content, type=None, pattern=None):
		''' store startup data '''
		self.type = type
		# read info from pattern, if available
		self.read_pattern(pattern)
		# case content is ParseResults: extract proper info
		if isinstance(content,ParseResults):
			content, self.type = self.from_result(content, type)
		# case (new) content is Data object: copy
		if isinstance(content,Data):
			self.type, self.pattern = content.type, content.pattern
			self.content, self.isSimple = content.content, content.isSimple
		# case content is ordinary data: record it
		else:
			self.content = self.recursive_record(content)
		# define type if not given by user, nore read from pattern
		if not self.type:
			self.type = "<%s>" %typ(self.content)
		#print "* new Data  - %s" %self
	def read_pattern(self,pattern):
		''' if available, read info from pattern
			about source of result '''
		self.pattern = pattern
		self.nature = self.role = self.pattern_type_name = None
		# get info about source of result
		if pattern:
			# pattern_type_name (e.g. Literal, MatchFirst, Group...)
			self.pattern_type_name = typ(pattern)
			# role <-- pattern.use: pattern use case
			try:
				self.role = pattern.use
			except AttributeError:
				self.role = pattern.ResultsName
			# nature <-- pattern name/id : pattern naming
			try:
				self.nature = pattern.id
			except AttributeError:
				try:
					self.nature = pattern.name
				except AttributeError:
					pass
		# if not yet set, try and define type from this info
		if not self.type:
			if self.role:
				self.type = self.role
			elif self.nature:
				self.type = self.nature
			elif self.pattern_type_name:
				self.type = "<%s>" %self.pattern_type_name
	def from_result(self,content,type):
		''' define properties from result data '''
		# try & set type from user-defined info
		if (not type) and content.getName():
			type = content.getName()
		# jump inside Group if not sequence
		if len(content)==1:
			content = content[0]
		# take result as list
		if isinstance(content,ParseResults):
			content = content.asList()
		return content, type
	def recursive_record(self,content):
		''' record content according to its structure '''
		# return if isSimple
		if not isinstance(content,list):
			self.isSimple = True
			return content
		# === case complex / nested
		# mutate each nested item to Data object
		# may already be a Data object -- or not
		content = Seq(content)
		self.isSimple = False
		seq = Seq()
		for item in content:
			if isinstance(item,Data):
				seq.append(item)
			else:
				seq.append(Data(item))
		return seq
	def treeView(self, noType=False, showGroup=False, level=0):
		''' return full & legible tree view of object's data '''
		tree = ''
		# this level's line
		tree += level * '\t'
		if not noType:
			tree+= "%s: " %self.type
		if self.isSimple or showGroup:
			tree+= "%s" %self.content_text()
		tree += "\n"
		# recursion for nested results
		if not self.isSimple:
			for item in self.content:
				tree += item.treeView(noType, showGroup, level+1)
		# final result
		return tree
	def leaves(self, noType=False):
		''' return a flat list of 'terminal', low-level,
			object items -- actually called 'leaves' '''
		seq = Seq()
		# case simple result : add content to seq
		if self.isSimple:
			if noType:
				seq.append(self.content)
			else:
				seq.append(self)
		# case compound result: recursively explore nested result
		else:
			for item in self.content:
				seq.extend(item.leaves(noType))
		return seq
	def allFlat(self, noType=False):
		''' return full flat list of object'items
			-- either compound or simple '''
		seq = Seq()
		# in all cases : add content to seq
		if noType:
			seq.append(self.content)
		else:
			seq.append(self)
		# case compound result: recursively explore nested result
		if not self.isSimple:
			for item in self.content:
				seq.extend(item.allFlat(noType))
		return seq
	def __len__(self):
		try:
			return len(self.content)
		except TypeError:
			return 0
	def __getitem__(self,index):
		return self.content[index]
	def __getslice__(self,i1,i2):
		return self.content[i1:i2]
	def __repr__(self):
		''' type:content format '''
		return "%s:%s" %(self.type, self.content_text())
	def content_text(self):
		'''	content expression
			for either simple or sequential content '''
		# case simple content: just output as is
		if self.isSimple:
			return str(self.content)
		# case compound content: resursive text seq in []
		else:
			text = str(self.content[0])
			for item in self.content[1:]:
				text += "  %s" %item
			return "[%s]" %text

Below are illustrations for two use cases:

-1- Parsing is done normally. The results feed a Data object. Both normal and 
Data results are printed, so that the difference is made clear. Additionally, 
tree view, 'leaves' & flat list of all-level nested results are also shown -- 
see my previous for more info about these latter things. Contained results are 
recursively converted into Data objects, so that all end up typed.

-2- The parser is cheated to make it 'natively' return Data object instead of 
ParseResults one. Actually, for the sake of illustration, only important 
(named) result are converted. But it makes no difference to convert all, for 
anymay nested results will be recursively converted into Data objects.

# === Data retrieved from final parse results ==================
class Grammar(object):
	# tokens
	integer 	= Word(nums)
	integer.setParseAction(lambda i: int(i[0]))
	point		= Literal('.')
	decimal		= Combine(integer + point + integer)
	decimal.setParseAction(lambda x: float(x[0]))
	#decimal		= Group(decimal)("dec")
	add			= Literal('+')
	mult		= Literal('*')
	# symbols
	num			= decimal | integer
	mult_op		= Group(num + mult + num)("mult_op")
	add_op		= Group((mult_op|num) + add + (mult_op|num))("add_op")
	#group		= Group(l_paren + in_op + r_paren)("group")
	operation	= (add_op|mult_op)
	calcs		= OneOrMore(operation)("calcs")
calcs = Grammar.calcs
# source text
text = "1+2.2*3 4.4*5+6.6"
print text
# standard result
results = calcs.parseString(text)
print "=== standard results:", results
# custom use & output
data = Data(results)
print "\n=== data:\n", data
print "\n=== default treeview :\n", data.treeView()
print "\n=== treeview with group w/o lead type:\n", 
data.treeView(showGroup=True, noType=True)
print "\n=== show lowest-level flat sequence:\n", data.leaves()
print "\n=== show lowest-level flat sequence w/o type:\n", data.leaves(noType=True)
print "\n=== show flat sequence of items on all levels /lines :"
for item in data.allFlat():
	print item

# === Data 'natively' returned by parsing process ===============
class Grammar(object):
	# tokens
	integer 	= Word(nums)
	integer.setParseAction(lambda i: int(i[0]))
	point		= Literal('.')
	decimal		= Combine(integer + point + integer)
	decimal.setParseAction(lambda x: float(x[0]))
	#decimal		= Group(decimal)("dec")
	add			= Literal('+')
	mult		= Literal('*')
	# symbols
	num			= (decimal | integer)("num")
	mult_op		= Group(num + mult + num)("mult_op")
	add_op		= Group((mult_op|num) + add + (mult_op|num))("add_op")
	#group		= Group(l_paren + in_op + r_paren)("group")
	operation	= (add_op|mult_op)
	calcs		= OneOrMore(operation)("calcs")
	#integer.addParseAction(toData)
	#decimal.addParseAction(toData)
	#mult_op.setParseAction(toData)
	#add_op.setParseAction(toData)
	#calcs.setParseAction(toData)
	@classmethod
	def _setToData(Grammar):
		patterns = filter(lambda(n,p): n[0]!='_', Grammar.__dict__.items())
		print "patterns: %s" %([name for (name,pattern) in patterns])
		named_patterns = filter(lambda(n,p): p.resultsName, patterns)
		print "named patterns: %s" %([name for (name,pattern) in named_patterns])
		for name, pattern in named_patterns:
			pattern.setParseAction(lambda result: Data(result))
		print
print "\n========================================\n"
Grammar._setToData()
calcs = Grammar.calcs
# standard result
results = calcs.parseString(text)
print "=== standard results holding data: %s:\n%s" %(results.__class__, results)
data = Data(results)
print "\n=== data: %s:\n%s" %(data.__class__,data)
print "\n=== data treeview :\n", data.treeView()
print "\n=== data leaves:\n", data.leaves()
print "\n=== show flat sequence of items on all levels /lines :"
for item in data.allFlat():
	print item

======================================================
					O U T P U T
======================================================
C:/prog/ACTIVE~1/pythonw.exe -u  "D:/prog/parsing/Data.pyw"
1+2.2*3 4.4*5+6.6
=== standard results: [[1, '+', [2.2000000000000002, '*', 3]], 
[[4.4000000000000004, '*', 5], '+', 6.5999999999999996]]

=== data:
calcs:[<Seq>:[<int>:1  <str>:+  <Seq>:[<float>:2.2  <str>:*  <int>:3]] 
<Seq>:[<Seq>:[<float>:4.4  <str>:*  <int>:5]  <str>:+  <float>:6.6]]

=== default treeview :
calcs:
	<Seq>:
		<int>: 1
		<str>: +
		<Seq>:
			<float>: 2.2
			<str>: *
			<int>: 3
	<Seq>:
		<Seq>:
			<float>: 4.4
			<str>: *
			<int>: 5
		<str>: +
		<float>: 6.6


=== treeview with group w/o lead type:
[<Seq>:[<int>:1  <str>:+  <Seq>:[<float>:2.2  <str>:*  <int>:3]] 
<Seq>:[<Seq>:[<float>:4.4  <str>:*  <int>:5]  <str>:+  <float>:6.6]]
	[<int>:1  <str>:+  <Seq>:[<float>:2.2  <str>:*  <int>:3]]
		1
		+
		[<float>:2.2  <str>:*  <int>:3]
			2.2
			*
			3
	[<Seq>:[<float>:4.4  <str>:*  <int>:5]  <str>:+  <float>:6.6]
		[<float>:4.4  <str>:*  <int>:5]
			4.4
			*
			5
		+
		6.6


=== show lowest-level flat sequence:
[<int>:1 ,<str>:+ ,<float>:2.2 ,<str>:* ,<int>:3 ,<float>:4.4 ,<str>:* ,<int>:5 
,<str>:+ ,<float>:6.6]

=== show lowest-level flat sequence w/o type:
[1 ,+ ,2.2 ,* ,3 ,4.4 ,* ,5 ,+ ,6.6]

=== show flat sequence of items on all levels /lines :
calcs:[<Seq>:[<int>:1  <str>:+  <Seq>:[<float>:2.2  <str>:*  <int>:3]] 
<Seq>:[<Seq>:[<float>:4.4  <str>:*  <int>:5]  <str>:+  <float>:6.6]]
<Seq>:[<int>:1  <str>:+  <Seq>:[<float>:2.2  <str>:*  <int>:3]]
<int>:1
<str>:+
<Seq>:[<float>:2.2  <str>:*  <int>:3]
<float>:2.2
<str>:*
<int>:3
<Seq>:[<Seq>:[<float>:4.4  <str>:*  <int>:5]  <str>:+  <float>:6.6]
<Seq>:[<float>:4.4  <str>:*  <int>:5]
<float>:4.4
<str>:*
<int>:5
<str>:+
<float>:6.6

========================================

patterns: ['mult_op', 'point', 'decimal', 'calcs', 'add', 'num', 'add_op', 
'integer', 'operation', 'mult']
named patterns: ['mult_op', 'calcs', 'num', 'add_op']

=== standard results holding data: <class 'pyparsing.ParseResults'>:
[calcs:[add_op:[num:1  <str>:+  mult_op:[num:2.2  <str>:*  num:3]] 
add_op:[mult_op:[num:4.4  <str>:*  num:5]  <str>:+  num:6.6]]]

=== data: <class '__main__.Data'>:
calcs:[add_op:[num:1  <str>:+  mult_op:[num:2.2  <str>:*  num:3]] 
add_op:[mult_op:[num:4.4  <str>:*  num:5]  <str>:+  num:6.6]]

=== data treeview :
calcs:
	add_op:
		num: 1
		<str>: +
		mult_op:
			num: 2.2
			<str>: *
			num: 3
	add_op:
		mult_op:
			num: 4.4
			<str>: *
			num: 5
		<str>: +
		num: 6.6


=== data leaves:
[num:1 ,<str>:+ ,num:2.2 ,<str>:* ,num:3 ,num:4.4 ,<str>:* ,num:5 ,<str>:+ 
,num:6.6]

=== show flat sequence of items on all levels /lines :
calcs:[add_op:[num:1  <str>:+  mult_op:[num:2.2  <str>:*  num:3]] 
add_op:[mult_op:[num:4.4  <str>:*  num:5]  <str>:+  num:6.6]]
add_op:[num:1  <str>:+  mult_op:[num:2.2  <str>:*  num:3]]
num:1
<str>:+
mult_op:[num:2.2  <str>:*  num:3]
num:2.2
<str>:*
num:3
add_op:[mult_op:[num:4.4  <str>:*  num:5]  <str>:+  num:6.6]
mult_op:[num:4.4  <str>:*  num:5]
num:4.4
<str>:*
num:5
<str>:+
num:6.6

[Pyparsing] tools

From: spir <den...@fr...> - 2008-11-12 15:35:08

Hello,

If ever there is someone alive on the list, these days...
I had some trouble understanding the output structure of parsing results. 
Probably very different of what I was +/- unconsciously expecting. Below some 
tools used to get more usable results, according to my very personal taste.
Possibly some of you may find them useful -- or know better ways to obtain 
similar results. Comments welcome.

listAll() is used to get a flat list out of nested results; with compound and 
included results listed in sequence. Avoids the need to recursively walk 
through nested structure, when action has to be performed on each single 
result. By default, listAll actually returns (type,content) tuples, else simple 
results. The type is given by typ(): either the type set at pattern defintion 
(with ('type') or setResultName('type')) ; or the 'real' type of the result.
I use listAll e.g. for instanciating objects which types are given by the 
result's type (mapping) and init data are taken from the content of the 
results. For instance:
for item in listAll(calc.parseString(text)):
	Type = typ(item[0])
	data = item[1]
	symbols.append(Type(data))
Well, in fact, I don't really use it like that anymore, because:
* listAll (and other funcs below) as ParseResult methods.
* I created a custom result type that natively hold type and content as fields. 
So that ParseStrings now returns objects of that kind for all named results 
(through a dedicated parse action).

pickLeaves() is very similar to listAll, except it skips all compound results 
to jump inside instead -- at any level -- and retain only 'terminal' ones: ==> 
its name. Very nice to get a low-level overview of the results. If these build 
a complete representation of the source, then pickLeaves give it back at the 
lowest *relevant* level, as defined by the various grouping patterns in the 
grammar.
treeView() builds a (python-like) hierarchical picture of the results. Compact 
and clear.
Both are mainly intended for testing. Both also return types (as given by 
typ()) by default in addition to the content.
showSeq() can be used to properly format pickLeaves screen output. This also 
applies to listAll.

denis



#=============================
def typ(result):
	try:
		#print "type --- %s:%s" %(result.getName(),result)
		return result.getName()
	except AttributeError:
		return "<%s>" %result.__class__.__name__

def listAll(tree, noType=False):
	seq = []
	for part in tree:
		isValue = not(isinstance(part,ParseResults))
		isSimple = (not isValue) and (len(part) == 1)
		# case simple result
		if isValue:
			if noType:
				seq.append(part)
			else:
				seq.append((typ(part),part))
		elif isSimple:
			if noType:
				seq.append(part[0])
			else:
				seq.append((typ(part),part[0]))
		# recursively explore nested result
		else:
			if noType:
				seq.append(part)
			else:
				seq.append((typ(part),part))
			seq.extend(listAll(part, noType))
	return seq

# o u t p u t  f u n c s
def showSeq(seq):
	if len(seq) == 0:
		return ''
	# define if seq holds types, or not
	noType = not isinstance(seq[0],tuple)
	# build return text
	text = seq[0] if noType else "%s:%s" %(seq[0][0],str(seq[0][1]))
	for item in seq[1:]:
		if noType:
			text += " , %s" %str(item)
		else:
			text += " , %s:%s" %(item[0],str(item[1]))
	# add [...]
	return "[%s]" %text
def pickLeaves(tree, noType=False):
	seq = []
	for part in tree:
		isValue = not(isinstance(part,ParseResults))	# str, int, float...
		isSimple = (not isValue) and (len(part) == 1)	# unique item inside
		# case value result : add value to seq
		if isValue:
			if noType:
				seq.append(part)
			else:
				seq.append((typ(part),part))
		# case simple result : add content to seq
		elif isSimple:
			if noType:
				seq.append(part[0])
			else:
				seq.append((typ(part),part[0]))
		# case compound result: recursively explore nested result
		else:
			seq.extend(pickLeaves(part, noType))
	return seq
def treeView(results, level=0, skipAnonymous=False, defaultType=None, TAB='\t'):
	NL = '\n'
	texte = ''
	for result in results:
		# case named result
		try:
			texte += level*TAB + result.getName() + ': ' + str(result) + NL
		# case anonymous result
		except AttributeError:
			if not skipAnonymous:
				if defaultType:
					type = defaultType
				else:
					type = "<%s>" %(result.__class__.__name__)
				texte += level*TAB + type + ': ' + str(result) + NL
		# case compound result: walk through recursive nesting
		if result.__class__ == ParseResults and len(result) > 1:
			texte += treeView(result, level+1)
	return texte

# =================================
# 			examples
# =================================
# === g r a m m a r
from pyparsing import *	# !!!
class Grammar(object):
	integer 	= Word(nums).setParseAction(lambda i: int(i[0]))
	point		= Literal('.')
	decimal		= Combine(integer + point + integer).setParseAction(lambda x: 
float(x[0]))
	num			= Group(decimal | integer)("num")
	plus		= Literal('+')("plus")
	op			= Group(num + plus + num)("op")
	calc		= OneOrMore(op)
calc = Grammar.calc
# === i n p u t   t e x t
text = "1+2 3.0+4 5.0+6.0"
# === standard results
results = calc.parseString(text)
print "=== standard results :"
print results
# === show leaves
print "=== lowest-level flat sequence :"
leaves = pickLeaves(results)
print showSeq(leaves)
# === show treeView
print "=== tree view :"
print treeView(results)

# =================================
# === g r a m m a r
class Grammar(object):
	# tokens
	add			= Literal('+')
	mult		= Literal('*')
	l_paren		= Literal('(')
	r_paren		= Literal(')')
	num		 	= Group(Word(nums).setParseAction(lambda i: int(i[0])))("num")
	# symbols
	mult_op		= Group(num + mult + num)("mult_op")
	add_op		= Group((mult_op|num) + add + (mult_op|num))("add_op")
	#group		= Group(l_paren + in_op + r_paren)("group")
	operation	= (add_op|mult_op)
	calc		= OneOrMore(operation)
calc = Grammar.calc
# === i n p u t   t e x t
text = "	1+2*3 4*5+6"
# === standard results
results = calc.parseString(text)
print;print "=== standard results :"
print results
# === show leaves
print "=== lowest-level flat sequence :"
leaves = pickLeaves(results)
print showSeq(leaves)
# === show treeView
print "=== tree view :"
print treeView(results)

[Pyparsing] pattern type, result type, & naming

From: spir <den...@fr...> - 2008-11-09 19:25:54

Hello,

New to the list. I'm trying to understand how things work with pattern types, 
result types, and naming through setResultsName() or (). Example:

a	= Literal('a')
b	= Literal('b')
c1	= (a|b)('|    ')
c2	= MatchFirst(a,b)('first')
c3	= (a^b)('^    ')
c4	= Or(a,b)('or   ')
c5	= Combine(a|b)('combi')
c0	= Group(a|b)('group')
patterns = [c1,c2,c3,c4,c5,c0]
t = "a"
# === use
for p in patterns:
	result = p.parseString(t)
	token = result[0]
	print p.resultsName, token.__class__.__name__, token,
	try:
		print token.getName()
	except AttributeError:
		print 'no_getName()'
print 'full result: %s' %(result)
# === output
|     str a no_getName()
first str a no_getName()
^     str a no_getName()
or    str a no_getName()
combi str a no_getName()
group ParseResults ['a'] group
full result: [['a']]

This leads me to the following questions ;-)
-1- Why aren't all results of type ParseResults?
-2- Which patterns, actually, generate ParseResults results, and which don't?
-3- Is there another way to access resultsName, from a result instance, than 
getName()?
-4- Is there another way to guess a result's "kind" (i.e. which pattern has 
generated it)?
-5- Do we need to systematically Group results to identify them?
I wish to get rid of Group where useless, in order not to get results such as 
[['a']] (['a'] is enough overload).

Denis

Re: [Pyparsing] To parse any language character set

From: Ujjaval S. <usm...@gm...> - 2008-11-03 01:05:23

Hi Eike,
Thanks for that. Actually, the reason my grammer was not working is because
I had to put \r inside CharsNotIn() where I only had '|'. The did the trick
for me.

Cheers,

On Thu, Oct 30, 2008 at 3:17 AM, Eike Welk <eik...@gm...> wrote:

> Hello Ujjaval!
>
> On Tuesday 28 October 2008, you wrote:
> > Now to parse such sentence, I changed your parser code to the
> > following: Here, I want to parse this string as a string that
> > starts with 'ABC' followed by '|' and ends with '\r'. I need
> > everything in between with '|' as delimiter in a list including
> > 'XYZ' as last element in this case.
>
> Look at:
>    LineEnd()
>
> Your parsers normally don't see '\n' because the whitespace is removed
> by the parsing machinery. If you want to use the end-of-line
> frequently as an element in your grammar, you could tell Pyparsing
> that '\n' should not be treated as whitespace:
>    ParserElement.setDefaultWhitespaceChars('\t ')
>
> But you have to care for all the newlines youself then, which might
> become tedious. Look at
>    indentedBlock(...)
> as an example how Paul (Pyparsing's author) does it. (I use
> indentedBlock myself.)
>
> Kind regards,
> Eike.
>
>

Re: [Pyparsing] To parse any language character set

From: Eike W. <eik...@gm...> - 2008-10-29 16:18:14

Hello Ujjaval!

On Tuesday 28 October 2008, you wrote:
> Now to parse such sentence, I changed your parser code to the
> following: Here, I want to parse this string as a string that
> starts with 'ABC' followed by '|' and ends with '\r'. I need
> everything in between with '|' as delimiter in a list including
> 'XYZ' as last element in this case.

Look at:
    LineEnd()

Your parsers normally don't see '\n' because the whitespace is removed 
by the parsing machinery. If you want to use the end-of-line 
frequently as an element in your grammar, you could tell Pyparsing 
that '\n' should not be treated as whitespace:
    ParserElement.setDefaultWhitespaceChars('\t ')

But you have to care for all the newlines youself then, which might 
become tedious. Look at 
    indentedBlock(...)
as an example how Paul (Pyparsing's author) does it. (I use 
indentedBlock myself.)

Kind regards,
Eike.

Re: [Pyparsing] Speeding up a parse

From: Paul M. <pt...@au...> - 2008-10-28 23:34:05

Sorry to not have been plugged in recently, and thanks Eike for taking up
the slack!

Please look into the latest pyparsing release, with the helper method
"originalTextFor".  This method is intended to displace the usage of
keepOriginalText, and does so without using the inspect module.

Instead of hacking _parseNoCache, you can do the same kind of parse
profiling using debug actions to tally up the number of parse attempts,
successes, and failures for one or more epxressions in a grammar.  I added
this code to a grammar that contained the expressions named in the varnames
string:

parseTally = {}
def updTally(v,i):
    if v not in parseTally:
        parseTally[v] = [0,0,0]
    parseTally[v][i] += 1
def tallyTry(n):
    def tally(*args):
        updTally(n,0)
    return tally
def tallyFail(n):
    def tally(*args):
        updTally(n,2)
    return tally
def tallyMatch(n):
    def tally(*args):
        updTally(n,1)
    return tally

varnames = "IDENTIFIER hexnumber decnumber realnumber boolvalue
stringliteral label".split()
for v in varnames:
 
vars()[v].setName(v).setDebugActions(tallyTry(v),tallyMatch(v),tallyFail(v))

After the parser was run, I could enumerate the values in the parseTally
dict, and see if there were any expressions that were tested *many* times,
but only matched a few.  It would be possible that such an expression was
part of a MatchFirst, and, as long as the grammar remained unambiguous, I
could move that expression further to the right in the list of alternatives.
(Nowadays, parseTally would benefit from being a defaultdict - this code is
a bit old.)

Here is an optimization I added to the cStyleComment expression.
Originally, this expression read:

cStyleComment = Combine( Literal("/*") +
                         ZeroOrMore( CharsNotIn("*") | ( "*" + ~Literal("/")
) ) +
                         Literal("*/") ).streamline().setName("cStyleComment
enclosed in /* ... */")
restOfLine = Optional( CharsNotIn( "\n\r" ), default="" ).setName("rest of
line up to \\n").leaveWhitespace()
dblSlashComment = "//" + restOfLine
cppStyleComment = ( dblSlashComment | cStyleComment )

Now if I was ignoring cppStyleComment, then this expression would get
evaluated a *lot*.  I realized that both alternatives start with a leading
'/' character, so if I wasn't currently pointing at a '/', there was no
point in checking either alternative.  So I modified cppStyleComment to:

cppStyleComment = FollowedBy("/") + ( dblSlashComment | cStyleComment )

(This is all really old code - I have since replaced most of these internal
expressions with Regex's.)

HTH,
-- Paul


-----Original Message-----
From: dav...@l-... [mailto:dav...@l-...] 
Sent: Thursday, September 25, 2008 3:26 PM
To: Eike Welk; pyp...@li...
Subject: Re: [Pyparsing] Speeding up a parse

Well, it's something that I was actually looking into doing.  I profiled my
parse actions by hand (using a stopwatch class), and when the code gets
handed off to me, it's actually a really small amount of time that I use it
(ms times usually).  

I did try psyco, and it worked great, but I had to disable the
"keepOriginalText" chunk of my code, because it imports Inspect, which hoses
psyco (known psyco issue).

As a matter of fact, it gave me almost a 100% speedup, which is fantastic,
and doesn't require much code.  But I lose the ability to keep the source.  

I think some more investigation of this would be a bit more in my favor, as
psyco is a "known" method to speed up, and frankly did an amazing job when I
got it working.

--dw

> -----Original Message-----
> From: Eike Welk [mailto:eik...@gm...]
> Sent: Thursday, September 25, 2008 3:19 PM
> To: pyp...@li...
> Subject: Re: [Pyparsing] Speeding up a parse
> 
> Hi David!
> 
> I've read that guessing, which parts of a program need optimizing, is 
> usually impossible. You need to profile your program. There is a 
> profiler built into python:
> http://docs.python.org/lib/profile.html
> 
> I have no experience with the profiler, but the basic usage is fairly 
> simple.
> 
> However, the profiler will probably show you, that most of the time is 
> spent in some function of the Pyparsing library.
> You expected it anyway, and it doesn't help you very much. 
> You might catch a parse action that consumes much time this way; and 
> you might spot parts of Pyparsing that need optimization.
> 
> So maybe you should start to write a profiling extension for 
> Pyparsing! I think it is feasible because the class ParserElement 
> contains some highlevel driver functions that are executed for each 
> parser object (_parseNoCache, _parseCache). I think it could be done 
> like this:
> 
> You create a class variable:
>     ParserElement.profileStats = {}
> It maps: 
>     <parser's name> : n_enter, n_success, n_fail, t_cumulative
> 
> Then at the start of _parseNoCache or _parseCache you locate the 
> matching entry,
>     ParserElement.profileStats[self.name]
> increment the the enter counter and store the current time. 
> 
> At the exit points you increase either the success or the failure 
> counters; compute the time spent in the parser and add it to the 
> cumulative time value.
> 
> At the end of the program you convert the dict to a list and store it 
> to a text file. You should also add a sorting function.
> 
> I think _parseCache looks pretty simple; adding something like my 
> proposed profiling facility seems easy. I haven't looked carefully at 
> anything, nor did I write any code. (I should have maybe, instead of 
> writing this lengthy email.) As packrat parsing gives you no big 
> additional problems I think you should just use  _parseCache because 
> its more easy.
> 
> I hope this helps you at least somewhat.
> Kind regards,
> Eike.
> 
> --------------------------------------------------------------
> -----------
> This SF.Net email is sponsored by the Moblin Your Move Developer's 
> challenge Build the coolest Linux based applications with Moblin SDK & 
> win great prizes Grand prize is a trip for two to an Open Source event 
> anywhere in the world 
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Pyparsing-users mailing list
> Pyp...@li...
> https://lists.sourceforge.net/lists/listinfo/pyparsing-users
> 

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great
prizes Grand prize is a trip for two to an Open Source event anywhere in the
world http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Pyparsing-users mailing list
Pyp...@li...
https://lists.sourceforge.net/lists/listinfo/pyparsing-users

Re: [Pyparsing] To parse any language character set

From: Ujjaval S. <usm...@gm...> - 2008-10-28 02:44:36

Hi Eike,

Thats exactly what I wanted. Thanks for that. It worked for me.

One more question following what I've done which is really stupid....

I wanted to end each text with a new line character. For example:

text6 = 'ABC | iöü | &#24212;iöü | XYZ\r'

Now to parse such sentence, I changed your parser code to the following:
Here, I want to parse this string as a string that starts with 'ABC'
followed by '|' and ends with '\r'. I need everything in between with '|' as
delimiter in a list including 'XYZ' as last element in this case.

start_kw = Keyword('ABC')
fieldContents = Optional(CharsNotIn('|'),'')
fields = delimitedList(fieldContents, '|', False)
fieldSep = Literal('|').suppress()
the_parser = (start_kw + fieldSep + fields + Literal('\r').suppress())

I can't get it to work. Could you tell what I am doing wrong?

Thanks,


On Tue, Oct 28, 2008 at 2:30 AM, Eike Welk <eik...@gm...> wrote:

> On Monday 27 October 2008, Eike Welk wrote:
> >
> > Here is an example for CharsNotIn:
> >   http://pastebin.com/f7d6a3331
>
> I just see that pastebin can't correctly work with Asian characters.
> But I guess you understand how the example was meant anyways. Just
> paste some Asian characters into the example strings and replace
> these numbers (HTML entities?) with them. The original characters
> were taken from Chinese and Japanese I-Pod ads.
>
> Kind regards,
> Eike.
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's
> challenge
> Build the coolest Linux based applications with Moblin SDK & win great
> prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Pyparsing-users mailing list
> Pyp...@li...
> https://lists.sourceforge.net/lists/listinfo/pyparsing-users
>

Re: [Pyparsing] To parse any language character set

From: Eike W. <eik...@gm...> - 2008-10-27 15:30:25

On Monday 27 October 2008, Eike Welk wrote:
>
> Here is an example for CharsNotIn:
>   http://pastebin.com/f7d6a3331

I just see that pastebin can't correctly work with Asian characters. 
But I guess you understand how the example was meant anyways. Just 
paste some Asian characters into the example strings and replace 
these numbers (HTML entities?) with them. The original characters 
were taken from Chinese and Japanese I-Pod ads. 

Kind regards,
Eike.

Re: [Pyparsing] To parse any language character set

From: Eike W. <eik...@gm...> - 2008-10-27 15:02:06

On Monday 27 October 2008, Ujjaval Suthar wrote:
> Hi everyone,
>
> I need to parse strings with mix of English and any other unicode
> characters from any Asian or European languages.
>
> The format of strings is like following:
>
> ABC|[any unicode character]|[any unicode character]|XYZ

Hello Ujjaval!

If I understand your question right, CharsNotIn is the parser you are 
looking for. I don't see any general problem with Unicode. As you 
seem somewhat knowledgeable about the requirements of Asian 
languages, you could maybe propose a parser for words in Asian 
languages (or even post a patch).

Here is an example for CharsNotIn:
  http://pastebin.com/f7d6a3331

I hope this helped you.

Kind regards,
Eike.

[Pyparsing] To parse any language character set

From: Ujjaval S. <usm...@gm...> - 2008-10-27 04:39:56

Hi everyone,

I need to parse strings with mix of English and any other unicode characters
from any Asian or European languages.

The format of strings is like following:

ABC|[any unicode character]|[any unicode character]|XYZ

In above string, I have ABC and XYZ as literals which are start and end of
the string while '|' is the delimiter for the content in between start and
end of the strings.

How can I use pyparsing to parse this kind of string? Here in the outcome I
should have a list of unicode character strings which are in between ABC and
XYZ in a form of list. These strings are separated by '|' in between.

Thanks,

Ujjaval

Re: [Pyparsing] Speeding up a parse

From: <dav...@l-...> - 2008-09-25 20:26:32

Well, it's something that I was actually looking into doing.  I profiled
my parse actions by hand (using a stopwatch class), and when the code
gets handed off to me, it's actually a really small amount of time that
I use it (ms times usually).  

I did try psyco, and it worked great, but I had to disable the
"keepOriginalText" chunk of my code, because it imports Inspect, which
hoses psyco (known psyco issue).

As a matter of fact, it gave me almost a 100% speedup, which is
fantastic, and doesn't require much code.  But I lose the ability to
keep the source.  

I think some more investigation of this would be a bit more in my favor,
as psyco is a "known" method to speed up, and frankly did an amazing job
when I got it working.

--dw

> -----Original Message-----
> From: Eike Welk [mailto:eik...@gm...] 
> Sent: Thursday, September 25, 2008 3:19 PM
> To: pyp...@li...
> Subject: Re: [Pyparsing] Speeding up a parse
> 
> Hi David!
> 
> I've read that guessing, which parts of a program need 
> optimizing, is usually impossible. You need to profile your 
> program. There is a profiler built into python:
> http://docs.python.org/lib/profile.html
> 
> I have no experience with the profiler, but the basic usage 
> is fairly simple.
> 
> However, the profiler will probably show you, that most of 
> the time is spent in some function of the Pyparsing library. 
> You expected it anyway, and it doesn't help you very much. 
> You might catch a parse action that consumes much time this 
> way; and you might spot parts of Pyparsing that need optimization.
> 
> So maybe you should start to write a profiling extension for 
> Pyparsing! I think it is feasible because the class 
> ParserElement contains some highlevel driver functions that 
> are executed for each parser object (_parseNoCache, 
> _parseCache). I think it could be done like this:
> 
> You create a class variable:
>     ParserElement.profileStats = {}
> It maps: 
>     <parser's name> : n_enter, n_success, n_fail, t_cumulative
> 
> Then at the start of _parseNoCache or _parseCache you locate 
> the matching entry,
>     ParserElement.profileStats[self.name]
> increment the the enter counter and store the current time. 
> 
> At the exit points you increase either the success or the 
> failure counters; compute the time spent in the parser and 
> add it to the cumulative time value. 
> 
> At the end of the program you convert the dict to a list and 
> store it to a text file. You should also add a sorting function.
> 
> I think _parseCache looks pretty simple; adding something 
> like my proposed profiling facility seems easy. I haven't 
> looked carefully at anything, nor did I write any code. (I 
> should have maybe, instead of writing this lengthy email.) As 
> packrat parsing gives you no big additional problems I think 
> you should just use  _parseCache because its more easy. 
> 
> I hope this helps you at least somewhat.
> Kind regards,
> Eike.
> 
> --------------------------------------------------------------
> -----------
> This SF.Net email is sponsored by the Moblin Your Move 
> Developer's challenge Build the coolest Linux based 
> applications with Moblin SDK & win great prizes Grand prize 
> is a trip for two to an Open Source event anywhere in the 
> world http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Pyparsing-users mailing list
> Pyp...@li...
> https://lists.sourceforge.net/lists/listinfo/pyparsing-users
>

Re: [Pyparsing] Speeding up a parse

From: Eike W. <eik...@gm...> - 2008-09-25 20:19:36

Hi David!

I've read that guessing, which parts of a program need optimizing, is 
usually impossible. You need to profile your program. There is a 
profiler built into python:
http://docs.python.org/lib/profile.html

I have no experience with the profiler, but the basic usage is fairly 
simple.

However, the profiler will probably show you, that most of the time is 
spent in some function of the Pyparsing library. You expected it 
anyway, and it doesn't help you very much. You might catch a parse 
action that consumes much time this way; and you might spot parts of 
Pyparsing that need optimization.

So maybe you should start to write a profiling extension for 
Pyparsing! I think it is feasible because the class ParserElement 
contains some highlevel driver functions that are executed for each 
parser object (_parseNoCache, _parseCache). I think it could be done 
like this:

You create a class variable:
    ParserElement.profileStats = {}
It maps: 
    <parser's name> : n_enter, n_success, n_fail, t_cumulative

Then at the start of _parseNoCache or _parseCache you locate the 
matching entry,
    ParserElement.profileStats[self.name]
increment the the enter counter and store the current time. 

At the exit points you increase either the success or the failure 
counters; compute the time spent in the parser and add it to the 
cumulative time value. 

At the end of the program you convert the dict to a list and store it 
to a text file. You should also add a sorting function.

I think _parseCache looks pretty simple; adding something like my 
proposed profiling facility seems easy. I haven't looked carefully at 
anything, nor did I write any code. (I should have maybe, instead of 
writing this lengthy email.) As packrat parsing gives you no big 
additional problems I think you should just use  _parseCache because 
its more easy. 

I hope this helps you at least somewhat.
Kind regards,
Eike.

Re: [Pyparsing] Speeding up a parse

From: <dav...@l-...> - 2008-09-24 23:28:14

One more thing, I also tried enablePackrat as well, but it had no
discernable affect on the parse speed (but it did suck up a huge chunk
of memory :) )



> _____________________________________________ 
> From: 	Weber, David C @ Link  
> Sent:	Wednesday, September 24, 2008 5:55 PM
> To:	pyp...@li...
> Subject:	Speeding up a parse
> 
> All,
> 
> We've got a data file that we use for parsing "stuff".  Presently,
> this file is 80K lines long.  Presently, this file takes about 3.3
> minutes to parse, which is an awfully long time to wait for something
> like this.  There are 122 rules for parsing this file, and
> unfortunately the syntax of the data within is not very strict.  This
> leads to constructs such as:
> 
> Interaction = \
>     Keyword("(Interaction")              + \
>         INT_ID                           + \
>         INT_Name                         + \
>         INT_ISRType                      + \
>         OneOrMore(
>             INT_MOMInteraction   |
>             INT_Description      |
>             INT_DeliveryCategory |
>             INT_MessageOrdering  |
>             INT_RoutingSpace
>         ) + \
>         ZeroOrMore(InteractionComponent) + \
>     ")";
> 
> Where the intent of the OneOrMore section, is:
>     1.) All are optional
>     2.) They may appear in any order
> 
> 
> I've also tried Each([Optional(...), Optional(...)]) without much
> speedup success.
> 
> I'm pretty sure that these constructs are causing a significant amount
> of backtracking, but I'm not sure the best way to go about cleaning up
> the grammar.
> 
> 
> Also, I tried using psyco to speed up the parse, but I'm making use of
> "keepOriginalText" option within the setParseAction() call, so that I
> can get a copy of the original text within my parse action.  This
> seems to break psyco, based on one of the imports that is done.
> 
> So two things:
> 
> 1.) Any grammar speed up rules for the above?
> 2.) Any ideas to get the orignal text, as well as make use of psyco?
> 
> 
> Thanks
> 
> --dw

[Pyparsing] Speeding up a parse

From: <dav...@l-...> - 2008-09-24 23:23:26

All,

We've got a data file that we use for parsing "stuff".  Presently, this
file is 80K lines long.  Presently, this file takes about 3.3 minutes to
parse, which is an awfully long time to wait for something like this.
There are 122 rules for parsing this file, and unfortunately the syntax
of the data within is not very strict.  This leads to constructs such
as:

Interaction = \
    Keyword("(Interaction")              + \
        INT_ID                           + \
        INT_Name                         + \
        INT_ISRType                      + \
        OneOrMore(
            INT_MOMInteraction   |
            INT_Description      |
            INT_DeliveryCategory |
            INT_MessageOrdering  |
            INT_RoutingSpace
        ) + \
        ZeroOrMore(InteractionComponent) + \
    ")";

Where the intent of the OneOrMore section, is:
    1.) All are optional
    2.) They may appear in any order


I've also tried Each([Optional(...), Optional(...)]) without much
speedup success.

I'm pretty sure that these constructs are causing a significant amount
of backtracking, but I'm not sure the best way to go about cleaning up
the grammar.


Also, I tried using psyco to speed up the parse, but I'm making use of
"keepOriginalText" option within the setParseAction() call, so that I
can get a copy of the original text within my parse action.  This seems
to break psyco, based on one of the imports that is done.

So two things:

1.) Any grammar speed up rules for the above?
2.) Any ideas to get the orignal text, as well as make use of psyco?


Thanks

--dw

[Pyparsing] Problems with indentedBlock and enablePackrat

From: Eike W. <eik...@gm...> - 2008-09-23 16:03:42

Packrat parsing seems to interfere with the indentedBlock parser. Some 
legitimate input is no longer recognized when packrat parsing is 
switched on. 

A script to demonstrate the problem is here:
http://pastebin.com/f5b006f4d

When line 7 is uncommented an exception is raised while parsing the 
3rd example 'program'.

Maybe I'm using indentedBlock somehow wrongly?


Kind regards,
Eike.

Re: [Pyparsing] operatorPrecedence: Problems with Power and Unary Minus

From: Paul M. <pt...@au...> - 2008-09-07 15:15:47

Eike -

Well, this came together a bit faster than I'd thought.  The solution begins
with an understanding of operatorPrecedence.  Here is the expression that is
published in the online example simpleArith.py:

integer = Word(nums).setParseAction(lambda t:int(t[0]))
variable = Word(alphas,exact=1)
operand = integer | variable

expop = Literal('**')
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
factop = Literal('!')

expr = operatorPrecedence( operand,
    [(factop, 1, opAssoc.LEFT),
     (expop,  2, opAssoc.RIGHT),
     (signop, 1, opAssoc.RIGHT),
     (multop, 2, opAssoc.LEFT),
     (plusop, 2, opAssoc.LEFT),]
    )

As I mentioned earlier, this will only recognize "-a**-b" if given as
"-a**(-b)".  The reason for this is that operator precedence cannot look
"down" the precedence chain unless the next term is enclosed in parentheses.
My first attempt at fixing this was to change the definition of operand to
include an optional leading sign, and this parsed successfully.  But then it
dawned on me that this same thing can be accomplished by inserting another
signop operator *above* expop, as in:

expr = operatorPrecedence( operand,
    [(factop, 1, opAssoc.LEFT),
     (signop, 1, opAssoc.RIGHT),
     (expop,  2, opAssoc.RIGHT),
     (signop, 1, opAssoc.RIGHT),
     (multop, 2, opAssoc.LEFT),
     (plusop, 2, opAssoc.LEFT),]
    )

I think this is the correct solution, rather than mucking about with
operatorPrecedence, or requiring strange embellishments to one's operand
expression.  I don't think this is a general issue with unary operators or
right-associated operations, I think this is just a special case borne out
of the definition of exponentiation with respect to leading sign operators.

I'll fix simpleArith.py to correctly include the extra unary sign operator
precedence level, and also include some examples.  I'll also include
simpleArith2.py, which actually evaluates the parsed expression.

-- Paul

Re: [Pyparsing] operatorPrecedence: Problems with Power and Unary Minus

From: Paul M. <pt...@au...> - 2008-09-07 14:32:02

I worked on this problem a bit last night.  The opAssoc.RIGHT parameter is
there to address this problem, and in fact, this expression:

a**b**c

Does get correctly evaluated as (a**(b**c)).  But things get muddled when
unary signs are added, and I think this points up a bug in
operatorPrecedence.  As your pastebin code comments state:

#Power and unary operations are intertwined to get correct operator
precedence:
#   -a**-b == -(a ** (-b))

Which should be supported using this form of operatorPrecedence:

u_expr = Word(nums) | Word(alphas)
expression = operatorPrecedence( u_expr,
    [('**',           2, opAssoc.RIGHT),
     (oneOf('+ -'),   1, opAssoc.RIGHT),
     (oneOf('* /'),   2, opAssoc.LEFT),
     (oneOf('+ -'),   2, opAssoc.LEFT),
    ])

For the moment, this expression correctly parses "a**b**c" as "(a**(b**c))",
but can only parse "-a**-b" if it is given as "-a**(-b)".

Ideally, your request should be handled by the operatorPrecedence API as it
exists today - I don't think any more specialized interface should be
needed.  But I'll see if I make any progress with finding this problem in
the next day or so.

-- Paul

20 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 15 16 17 18 19 .. 31 > >> (Page 17 of 31)