[Pyre2-devel] Re: changes to re2 library
Status: Beta
Brought to you by:
ottrey
From: <ot...@py...> - 2005-04-08 12:12:48
|
Hi Pierre!, On 8/4/2005, "Pierre Barbier de Reuille" <pie...@ci...> wrote: >Well, first feedback is a bit sad :I found a bug :S > >Here is the expression cuasing the bug : > >"(((?:\s*)(?P<num>\d+))+,)+\s*(?P<logic>[^ ]+)((?P<lastnum>\s*\d+)+)" > >.. you cannot compile it with re2 but you can do it with re ! That's not sad! I'm suprised it coped at all with that monster! ;-) I notice from experimenting with that expression myself that the bug must have something to do with the '(?:', I notice when you change it to '(' or '(?P<x>' it works. FYI the re2 code has been hacked together in an eXtreme Programming type of way. ie. think up some unit tests then get the code to pass those tests as quickly and simply as you can - making sure nothing is "over designed". I think true believers in eXtreme Programming also suggest you re-factor the code and make it more efficient on successive iterations, making sure all the unit tests still pass. Ummm... let's just say I haven't got up to that bit yet. ;-) You should look at the nasty bit of code that splits up the expression into its subgroups! (which btw is where that bug you mention will be) (Hey, are you any good with state machines or parsers? You don't want to try your hand at fixing it do you?) >Talking about efficiency: I think you first need a working module, > with good user/developper documentation. Yes. Exactly! > Then, time to optimize ! You'll want=20 > to profile your code and, if needed, code some parts (or all of it) in=20 > another language (most probably C) and export it in Python. But I think=20 > that will be in the future, when the interface and the functionnalities=20 > will have stabilize enough. Yes, true too. But, it's the _method_ that lacks efficiency. When there is no need for nested groups re will always be more efficient than re2. >By the way, if you could tell us as most as possible about the principle=20 >behind your modules ... you did part of it in this mail, but it would=20 >help us tracking bugs or even improving the code ... Will do. Although I first wanted to get the python-dev people interested; tempting them with small, yet intriguing pieces of a puzzle, hopefully allowing them to germinate some ideas, before thrusting upon them some overly verbose attempt at a solution. But I guess that's what PEPs are for. ;-) So BTW, how are you finding that new lack of a '_value' key? eg. >>> m=3Dre2.extract('(?P<x>([a-z]X)+)', 'bXcXdXeX') >>> print m bXcXdXeX >>> m {'x': {0: ['bX', 'cX', 'dX', 'eX']}} >>> print m.dump() --- x: 0: - bX - cX - dX - eX You can't really tell from the dump that 'x' actually has the value 'bXcXdXeX' can you? Perhaps it would be more intuative like this: >>> m {'x': ('bXcXdXeX', {0: ['bX', 'cX', 'dX', 'eX']})} >>> print m.dump() --- x: - bXcXdXeX - 0: - bX - cX - dX - eX I can't, off hand, think of an example where this representation would be ambiguous. Can you? However, the problem with doing it that way, is that it breaks this functionality: >>> m['x'][0][3] =3D=3D 'eX' Maybe the trick is to write our own dump() method. eg. so the output looks like this: >>> print m.dump() --- x (bXcXdXeX): 0: - bX - cX - dX - eX BTW, I've also been doing some background research: http://py.redsoft.be/pyre2/wiki/index.php?title=3DBackground_research I even notice someone started writing a PEP very similar to this last year. Chris. |