[Pyre2-devel] Re: changes to re2 library
Status: Beta
Brought to you by:
ottrey
|
From: Pierre B. de R. <pie...@ci...> - 2005-04-08 10:22:28
|
Well, first feedback is a bit sad :I found a bug :S
Here is the expression cuasing the bug :
"(((?:\s*)(?P<num>\d+))+,)+\s*(?P<logic>[^ ]+)((?P<lastnum>\s*\d+)+)"
.. you cannot compile it with re2 but you can do it with re !
Talking about efficiency: I think you first need a working module, with=20
good user/developper documentation. Then, time to optimize ! You'll want=20
to profile your code and, if needed, code some parts (or all of it) in=20
another language (most probably C) and export it in Python. But I think=20
that will be in the future, when the interface and the functionnalities=20
will have stabilize enough.
By the way, if you could tell us as most as possible about the principle=20
behind your modules ... you did part of it in this mail, but it would=20
help us tracking bugs or even improving the code ...
Pierre
ot...@py... a =E9crit :
> On 4/7/2005, "Pierre Barbier de Reuille" <pie...@ci...>
> wrote:
>=20
>>Well, I really like your solution. It's even better than what I first
>>thought !
>=20
>=20
> Great. Thanx!
>=20
>=20
>>I'll have a look at your latest code and test it a bit !
>=20
>=20
> Its always good to have someone else coming up with use cases, to get
> completely new and different ideas thrown at it.
>=20
>=20
>>You'll have feedback soon (I hope ...)
>=20
>=20
> That'd be great.
>=20
>=20
> I've also thought some more about how to merge with the re module.
> As I think (and everyone else seems to think) re2 shouldn't be a separa=
te
> module.
>=20
> The only problem with it is that the re2.compile() is actually a lot le=
ss
> efficient than the re.compile() function. ( And therefore I don't thin=
k
> re2.compile() should replace re.compile() )
>=20
> The ineffeciency comes from the fact that re2.compile() actually splits
> the
> regular expression up into it's component groups at each level of the
> group hierarchy. It then compiles a pattern for each node in the
> hierarchy,
> substituting nested groups at each node with non-groups '(?:'.
> It then uses that hierarchy of patterns to extract a hierarchy of resul=
ts.
> This is done by recursively matching each node, then passing the result=
s
> of
> that match down the hierarchy to be subsequently matched by its childre=
n.
>=20
> All this, I'd imagine (as I haven't actually done any performance
> testing),
> is going to ~really~ slow things down - probably mostly at the compile(=
)
> stage, but also during the matching, depending on the data set.
>=20
> Although if a match at the top of the hierarchy fails then subsequent
> attempts to match the descendants won't be performed - so that
> will provide ~some~ relief. But I don't think re2 can or should ever b=
e
> a
> replacement for re.
>=20
> If it is going to get merged with the re library then I'm thinking it
> would
> best slot in as a new compile() method.
>=20
> So some names I have thought up are:
> - compile2()
> - recursive_compile() - rcompile() - compile_recursive()
> - hierarchical_compile() - hcompile() - ...
> - nested_compile() - nc... - ...
>=20
>=20
>=20
>=20
>>>I've made some changes to the re2 library based on your suggestion.
>>>
>>> now it ~actually~ behaves like this:
>>>
>>>
>>>>>>buf=3D"123 234 345, 123 256, and 123 289"
>>>>>>regex=3Dr'^(( *\d+)+,)+ *(?P<logic>[^ ]+)(( *\d+)+).*$'
>>>>>>pat2=3Dre2.compile(regex)
>>>>>>x=3Dpat2.extract(buf)
>>>>>>x
>>>
>>>{0: [{0: ['123', ' 234', ' 345']}, {0: [' 123', ' 256']}], 2:
>>>{0: [' 123', '
>>>289']}, 'logic': 'and'}
>>>
>>>
>>>>>>print x.dump()
>>>
>>>---
>>>0:
>>> -
>>> 0:
>>> - '123'
>>> - ' 234'
>>> - ' 345'
>>> -
>>> 0:
>>> - ' 123'
>>> - ' 256'
>>>2:
>>> 0:
>>> - ' 123'
>>> - ' 289'
>>>logic: and
>>>
>>>
>>>
>>>>>>x[0]
>>>
>>>[{0: ['123', ' 234', ' 345']}, {0: [' 123', ' 256']}]
>>>
>>>
>>>>>>x[0][0]
>>>
>>>{0: ['123', ' 234', ' 345']}
>>>
>>>
>>>>>>str(x[0][0])
>>>
>>>'123 234 345,'
>>>
>>>
>>>>>>x[0][1]
>>>
>>>{0: [' 123', ' 256']}
>>>
>>>
>>>>>>str(x[0][1])
>>>
>>>' 123 256,'
>>>
>>>
>>>>>>x['logic']
>>>
>>>'and'
>>>
>>>
>>>>>>x.logic
>>>
>>>'and'
>>>
>>>
>>>>>>x[2]
>>>
>>>{0: [' 123', ' 289']}
>>>
>>>
>>>>>>str(x[2])
>>>
>>>' 123 289'
>=20
>=20
--=20
Pierre Barbier de Reuille
INRA - UMR Cirad/Inra/Cnrs/Univ.MontpellierII AMAP
Botanique et Bio-informatique de l'Architecture des Plantes
TA40/PSII, Boulevard de la Lironde
34398 MONTPELLIER CEDEX 5, France
tel : (33) 4 67 61 65 77 fax : (33) 4 67 61 56 68
|