Menu

#206 Improve SmartQuote performance

None
closed-fixed
nobody
None
5
2024-04-09
2023-08-16
No

Performing a representative sphinx-build (10 x docutils/docs/ref/rst/restructuredtext.txt, dummy builder), and analysing with py-spy, you can see from the attached flamegraph that the smartquote transform accouts for over 22% of the build time!

This PR attempts to improve that situation (at least down to 18%) by caching regex compilation

3 Attachments

Discussion

  • Günter Milde

    Günter Milde - 2023-08-18

    Thank you for the patch.

    I wonder, why there is a considerable performance hit despite the documentation saying that

    The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

    and Docutils uses far less than re._MAXCACHE == 512 regular expressions.
    Maybe "re.sub" is no module-level matching function? OTOH, the doc says:

    Pattern.sub(repl, string, count=0)
    Identical to the sub() function, using the compiled pattern.

    The 22% refer to the parse (or more precise parse+transform) time rather than the build time in a real-world use case (as the dummy builder is more efficient than a HTML builder, say). Still, 22% of the time to create a document tree is impressive.

    Could you test the attached simplified version of your patch.
    (If the unconditional pre-compilation is considered too wasteful in case "smartquotes" are switched off, I'd rather consider a conditional import of the "smartquotes" module.)

    Another improvement may be achieved by simplifying the regexps themselves:
    The current version is taken from the "SmartyPants" module that also cares for HTML input and checks for character entities like – or  .

     
    • Günter Milde

      Günter Milde - 2023-11-19

      It turned out that actually "smartquotes" mostly used pre-compiled regexes already but re-did the recompilation with every call to educateQuotes(). Replacing these pre-compilations with direct calls allowed Python's caching to kick in and improve performance a bit. Pre-compiling at module import (as proposed in the patch) turned out to further improve, as did simplifying the regular expressions and introducing a preliminary thest for quotes to "educate".
      After theres optimizations, time spent on "smartquotes" went down from 20% of the time "buildhtml.py" requires to build the Docutils documentation to 10%. (Tested with py-spy before the changes, after the changes and with option --smart-quotes=no.)

       
  • Günter Milde

    Günter Milde - 2023-11-19
    • status: open --> open-fixed
     
  • Günter Milde

    Günter Milde - 2024-04-09
    • Status: open-fixed --> closed-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB