Re: [Jython-users] Performance of urlparse.unquote / string slicing

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Toby,

What I observed when running a slightly modified version of the unquote
test code is that it is quadratic (or O(n^2)) with respect to the length of
the string (n). (Please note, you didn't specify Jython version, but 2.7 is
what we would urge anyone to use now, and certainly what we would work on
for performance updates.)

But no surprise there: urllib.unquote is using a pattern of appending to a
string with in-place string concatenation (using +=)  which does not work
well on Jython, and in general on Java. (The fact that indexing is somewhat
slower is probably not relevant here.) For the underlying Java API, we
would need to keep the string in a StringBuilder; for Python, we would
typically use StringIO. Please note that PEP 8 even specifically recommends
against this usage for Python code:

For example, do not rely on CPython's efficient implementation of in-place
> string concatenation for statements in the form a += b or a = a + b . This
> optimization is fragile even in CPython (it only works for some types) and
> isn't present at all in implementations that don't use refcounting. In
> performance sensitive parts of the library, the ''.join() form should be
> used instead. This will ensure that concatenation occurs in linear time
> across various implementations.

(https://www.python.org/dev/peps/pep-0008/#programming-recommendations)

I don't believe this issue has come up until now however because
urllib.unquote is most likely used for short strings in many/most
applications. But the poor performance could be readily fixed by using
StringIO instead.

- Jim

On Wed, Dec 9, 2015 at 8:27 PM, Toby Collett <Tob...@cr...> wrote:

> Hi all,
> I have been investigating a performance issue with a large data chunk
> being sent via an http POST request. The outcome of which is that the
> bottleneck seems to be in urlparse.unquote. The data is about 6MB of XML,
> which takes in the order of 10 mins to run through urlparse.unquote(). This
> occurs in seconds in the cPython implementation.
>
> The performance difference seems to come down to slicing performance...For
> reference the unquote code is included below
> def unquote(s):
>     """unquote('abc%20def') -> 'abc def'."""
>     res = s.split('%')
>     # fastpath
>     if len(res) == 1:
>         return s
>     s = res[0]
>     for item in res[1:]:
>         try:
>             s += _hextochr[item[:2]] + item[2:]
>         except KeyError:
>             s += '%' + item
>         except UnicodeDecodeError:
>             s += unichr(int(item[:2], 16)) + item[2:]
>     return s
>
> To investigate the slicing performance I ran some simple test code through
> jython and python, timing is very adhoc (no efforts made to ensure same
> level of background activity etc, but the magnitude of the difference is
> large (3 times or more), and it is even more obvious when running the full
> unquote code.
>
> Is this the expected performance of jython slicing, or is there something
> that can be improved?
>
> Regards,
> Toby
>
> The test code
> ======
> import time
> for x in range(4,7):
>     start = time.time()
>     print x
>     len([b[:2] + b[2:] for b in  [u'%3casdf%3effff']*10**x])
>     print time.time() - start
> =======
> results
> $ python /tmp/a.py
> 4
> 0.00148105621338
> 5
> 0.0163309574127
> 6
> 0.152363061905
>
> $ jython /tmp/a.py
> 4
> 0.0190000534058
> 5
> 0.0709998607635
> 6
> 0.469000101089
>
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Jython-users mailing list
> Jyt...@li...
> https://lists.sourceforge.net/lists/listinfo/jython-users
>
>