[Pyre2-devel] Re: changes to re2 library

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Well, first feedback is a bit sad :I found a bug :S

Here is the expression cuasing the bug :

"(((?:\s*)(?P<num>\d+))+,)+\s*(?P<logic>[^ ]+)((?P<lastnum>\s*\d+)+)"

.. you cannot compile it with re2 but you can do it with re !

Talking about efficiency: I think you first need a working module, with=20
good user/developper documentation. Then, time to optimize ! You'll want=20
  to profile your code and, if needed, code some parts (or all of it) in=20
another language (most probably C) and export it in Python. But I think=20
that will be in the future, when the interface and the functionnalities=20
will have stabilize enough.

By the way, if you could tell us as most as possible about the principle=20
behind your modules ... you did part of it in this mail, but it would=20
help us tracking bugs or even improving the code ...

Pierre

ot...@py... a =E9crit :
> On 4/7/2005, "Pierre Barbier de Reuille" <pie...@ci...>
> wrote:
>=20
>>Well, I really like your solution. It's even better than what I first
>>thought !
>=20
>=20
> Great.  Thanx!
>=20
>=20
>>I'll have a look at your latest code and test it a bit !
>=20
>=20
> Its always good to have someone else coming up with use cases, to get
> completely new and different ideas thrown at it.
>=20
>=20
>>You'll have feedback soon (I hope ...)
>=20
>=20
> That'd be great.
>=20
>=20
> I've also thought some more about how to merge with the re module.
> As I think (and everyone else seems to think) re2 shouldn't be a separa=
te
> module.
>=20
> The only problem with it is that the re2.compile() is actually a lot le=
ss
> efficient than the re.compile() function.  ( And therefore I don't thin=
k
> re2.compile() should replace re.compile() )
>=20
> The ineffeciency comes from the fact that re2.compile() actually splits
> the
> regular expression up into it's component groups at each level of the
> group hierarchy.  It then compiles a pattern for each node in the
> hierarchy,
> substituting nested groups at each node with non-groups '(?:'.
> It then uses that hierarchy of patterns to extract a hierarchy of resul=
ts.
> This is done by recursively matching each node, then passing the result=
s
> of
> that match down the hierarchy to be subsequently matched by its childre=
n.
>=20
> All this, I'd imagine (as I haven't actually done any performance
> testing),
> is going to ~really~ slow things down - probably mostly at the compile(=
)
> stage, but also during the matching, depending on the data set.
>=20
> Although if a match at the top of the hierarchy fails then subsequent
> attempts to match the descendants won't be performed - so that
> will provide ~some~ relief.  But I don't think re2 can or should ever b=
e
> a
> replacement for re.
>=20
> If it is going to get merged with the re library then I'm thinking it
> would
> best slot in as a new compile() method.
>=20
> So some names I have thought up are:
>  - compile2()
>  - recursive_compile()     - rcompile()  - compile_recursive()
>  - hierarchical_compile()  - hcompile()  - ...
>  - nested_compile()        - nc...       - ...
>=20
>=20
>=20
>=20
>>>I've made some changes to the re2 library based on your suggestion.
>>>
>>> now it ~actually~ behaves like this:
>>>
>>>
>>>>>>buf=3D"123 234 345, 123 256, and 123 289"
>>>>>>regex=3Dr'^(( *\d+)+,)+ *(?P<logic>[^ ]+)(( *\d+)+).*$'
>>>>>>pat2=3Dre2.compile(regex)
>>>>>>x=3Dpat2.extract(buf)
>>>>>>x
>>>
>>>{0: [{0: ['123', ' 234', ' 345']}, {0: [' 123', ' 256']}], 2:
>>>{0: [' 123', '
>>>289']}, 'logic': 'and'}
>>>
>>>
>>>>>>print x.dump()
>>>
>>>---
>>>0:
>>>    -
>>>        0:
>>>            - '123'
>>>            - ' 234'
>>>            - ' 345'
>>>    -
>>>        0:
>>>            - ' 123'
>>>            - ' 256'
>>>2:
>>>    0:
>>>        - ' 123'
>>>        - ' 289'
>>>logic: and
>>>
>>>
>>>
>>>>>>x[0]
>>>
>>>[{0: ['123', ' 234', ' 345']}, {0: [' 123', ' 256']}]
>>>
>>>
>>>>>>x[0][0]
>>>
>>>{0: ['123', ' 234', ' 345']}
>>>
>>>
>>>>>>str(x[0][0])
>>>
>>>'123 234 345,'
>>>
>>>
>>>>>>x[0][1]
>>>
>>>{0: [' 123', ' 256']}
>>>
>>>
>>>>>>str(x[0][1])
>>>
>>>' 123 256,'
>>>
>>>
>>>>>>x['logic']
>>>
>>>'and'
>>>
>>>
>>>>>>x.logic
>>>
>>>'and'
>>>
>>>
>>>>>>x[2]
>>>
>>>{0: [' 123', ' 289']}
>>>
>>>
>>>>>>str(x[2])
>>>
>>>' 123 289'
>=20
>=20

--=20
Pierre Barbier de Reuille

INRA - UMR Cirad/Inra/Cnrs/Univ.MontpellierII AMAP
Botanique et Bio-informatique de l'Architecture des Plantes
TA40/PSII, Boulevard de la Lironde
34398 MONTPELLIER CEDEX 5, France

tel   : (33) 4 67 61 65 77    fax   : (33) 4 67 61 56 68