So we looked at both java.util.regex and somewhat more seriously at the JRuby regex engine (itself a port of a C-based engine). In the latter case, Nicholas Riley saw something like 30% (as I recall it) speedup over our existing engine.

But the semantics of the last "10%" makes it difficult.

The reality is that CPython's sre is quite good. We just need to put some more resources in finetuning our port, as well as fixing any gaping holes in performance, as seems to be suggested by what we are seeing with HTMLParser. In terms of the finetuning, we know of several opportunities to speed things up, it's just a question of spending some time on them.

- Jim

2009/11/3 Alex Grönholm <alex.gronholm@nextday.fi>
Eli Golovinsky kirjoitti:
> Isn't CPython regex implementation in C? Couldn't Jython's regex
> implementation use the Java regular expression engine after (possibly)
> some simple translation from Python syntax to Java syntax?
>
>
Java's regular expressions have different semantics.
Jython's REs are notoriously slow, but iirc it's being worked on and I'd
expect faster REs in the 2.5 line of releases already.
> ---
> gooli
>
>
>
> 2009/11/3 Jim Baker <jbaker@zyasoft.com>:
>
>> At least the problem is in a very small part of the code (which seems to be
>> the usual case for something so bad). The goahead, parse_starttag, and
>> parse_endtag methods in HTMLParser all use regexes extensively, so that
>> would be my first guess. Our regex implementation is a direct port of
>> CPython's, but it's certainly possible that we have not applied the same
>> subsequent performance optimizations for support of such things as
>> lookahead.
>>
>> So now we need to profile a little deeper with something like YourKit to see
>> what's really happening.
>>
>> 2009/11/3 Sébastien Boisgérault <Sebastien.Boisgerault@mines-paristech.fr>
>>
>>> Sébastien Boisgérault a écrit :
>>>
>>> Eli Golovinsky a écrit :
>>>
>>>
>>> Hi,
>>>
>>> I just tried to run BeautifulSoup (3.1.0.1) with Jython (2.5.1) and I
>>> was amazed to see how much slower it was than CPython (2.6). Parsing a
>>> page (http://www.fixprotocol.org/specifications/fields/5000-5999) with
>>> CPython took just under a second (0.844 second to be exact). With
>>> Jython it took 564 seconds - almost 700 times as much.
>>>
>>> Can anyone confirm this result? It's doesn't seem reasonable for
>>> Jython to run 700 times slower than CPython.
>>>
>>>
>>> CPython is about x380 faster on my box.
>>>
>>> ouch ...
>>>
>>> SB
>>>
>>>
>>>
>>> Attached below the execution profiles with CPython and Jython.
>>>
>>> AFAICT BeautifulSoup code performs OK with Jython (a few seconds tops
>>> spent in handle_* methods), but the HTMLParser code (goahead, parse_*
>>> methods) it calls is painfully slow.
>>>
>>>
>>>
>>> CPYTHON
>>>
>>> Tue Nov  3 14:24:46 2009    results
>>>
>>>          903568 function calls (903519 primitive calls) in 6.512 CPU
>>> seconds
>>>
>>>    Ordered by: cumulative time
>>>    List reduced from 137 to 20 due to restriction <20>
>>>
>>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>>         1    0.000    0.000    6.512    6.512
>>> profile:0(BeautifulSoup(data))
>>>         1    0.000    0.000    6.512    6.512 <string>:1(<module>)
>>>         1    0.000    0.000    6.512    6.512
>>> BeautifulSoup.py:1164(__init__)
>>>         1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1236(_feed)
>>>         1    0.000    0.000    6.512    6.512
>>> BeautifulSoup.py:1495(__init__)
>>>         1    0.000    0.000    6.484    6.484 HTMLParser.py:101(feed)
>>>         1    0.600    0.600    6.484    6.484 HTMLParser.py:132(goahead)
>>>     11083    0.556    0.000    3.784    0.000
>>> HTMLParser.py:224(parse_starttag)
>>>     11083    0.060    0.000    2.552    0.000
>>> BeautifulSoup.py:1013(handle_starttag)
>>>     11083    0.264    0.000    2.492    0.000
>>> BeautifulSoup.py:1397(unknown_starttag)
>>>      8351    0.240    0.000    1.404    0.000
>>> HTMLParser.py:305(parse_endtag)
>>>      8351    0.060    0.000    1.044    0.000
>>> BeautifulSoup.py:1019(handle_endtag)
>>>      8351    0.120    0.000    0.984    0.000
>>> BeautifulSoup.py:1427(unknown_endtag)
>>>      8349    0.484    0.000    0.912    0.000
>>> BeautifulSoup.py:1351(_smartPop)
>>>     11084    0.168    0.000    0.676    0.000
>>> BeautifulSoup.py:500(__init__)
>>>     12420    0.276    0.000    0.604    0.000
>>> BeautifulSoup.py:1329(_popToTag)
>>>     19441    0.216    0.000    0.528    0.000
>>> BeautifulSoup.py:1306(endData)
>>>     33250    0.244    0.000    0.368    0.000
>>> BeautifulSoup.py:1269(isSelfClosingTag)
>>>     66134    0.344    0.000    0.344    0.000 :0(match)
>>>    103566    0.280    0.000    0.280    0.000
>>> BeautifulSoup.py:554(__nonzero__)
>>>
>>> JYTHON
>>>
>>> Tue Nov  3 14:31:34 2009    results
>>>
>>>          383982 function calls (383944 primitive calls) in 390.007 CPU
>>> seconds
>>>
>>>    Ordered by: cumulative time
>>>    List reduced from 97 to 20 due to restriction <20>
>>>
>>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>>         1    0.003    0.003  390.007  390.007
>>> profile:0(BeautifulSoup(data))
>>>         1    0.000    0.000  390.004  390.004 <string>:0(<module>)
>>>         1    0.000    0.000  390.004  390.004
>>> BeautifulSoup.py:1495(__init__)
>>>         1    0.000    0.000  390.004  390.004
>>> BeautifulSoup.py:1164(__init__)
>>>         1    0.372    0.372  390.003  390.003 BeautifulSoup.py:1236(_feed)
>>>         1    0.000    0.000  389.553  389.553 HTMLParser.py:101(feed)
>>>         1  159.714  159.714  389.552  389.552 HTMLParser.py:132(goahead)
>>>     11083  112.086    0.010  159.921    0.014
>>> HTMLParser.py:224(parse_starttag)
>>>      8351   68.361    0.008   69.394    0.008
>>> HTMLParser.py:305(parse_endtag)
>>>     11083   45.443    0.004   45.443    0.004
>>> HTMLParser.py:275(check_for_whole_start_tag)
>>>     11083    0.084    0.000    2.363    0.000
>>> BeautifulSoup.py:1013(handle_starttag)
>>>     11083    0.536    0.000    2.278    0.000
>>> BeautifulSoup.py:1397(unknown_starttag)
>>>      8351    0.051    0.000    1.009    0.000
>>> BeautifulSoup.py:1019(handle_endtag)
>>>      8351    0.077    0.000    0.958    0.000
>>> BeautifulSoup.py:1427(unknown_endtag)
>>>     19441    0.438    0.000    0.761    0.000
>>> BeautifulSoup.py:1306(endData)
>>>     11084    0.374    0.000    0.726    0.000
>>> BeautifulSoup.py:500(__init__)
>>>      8349    0.498    0.000    0.630    0.000
>>> BeautifulSoup.py:1351(_smartPop)
>>>     38924    0.435    0.000    0.435    0.000 markupbase.py:49(updatepos)
>>>     12420    0.251    0.000    0.334    0.000
>>> BeautifulSoup.py:1329(_popToTag)
>>>     17252    0.220    0.000    0.245    0.000 BeautifulSoup.py:118(setup)
>>>
>>>
>>>
>>>
>>> Perhaps something is
>>> wrong with my setup.
>>>
>>> Here's the code I used:
>>>
>>> import time
>>> from BeautifulSoup import BeautifulSoup
>>> data = open("fix-5000-5999.html").read()
>>> start = time.time()
>>> soup = BeautifulSoup(data)
>>> print time.time() - start
>>>
>>> ---
>>> gooli
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Jython-users mailing list
>>> Jython-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Jython-users mailing list
>>> Jython-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Jython-users mailing list
>>> Jython-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>>
>>>
>>
>> --
>> Jim Baker
>> jbaker@zyasoft.com
>>
>>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Jython-users mailing list
> Jython-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/jython-users
>


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/jython-users



--
Jim Baker
jbaker@zyasoft.com