Menu

#159 Bad performance compared to PyYAML

open
nobody
None
major
enhancement
2020-01-27
2020-01-23
No

I just tried to convert my project from PyYAML to ruamel.yaml in the hope of using a better maintained alternative - see this branch if you're curious.

Unfortunately, it seemed to me like my testsuite took quite a bit longer than usual, and a closer look confirmed that suspicion: Only running the part of the testsuite related to the config took 34s instead of 14s, and a benchmark test which reads configdata.yml takes 705ms median instead of 28ms.

I did a quick test with timeit with that file, and while the differences don't seem to be as big, it's definitely noticable as well:

$ python3 -m timeit -s 'import yaml' 'with open("configdata.yml") as f: yaml.load(f)'                                     
10 loops, best of 3: 202 msec per loop
$ python3 -m timeit -s 'from ruamel import yaml; import pathlib' 'y = yaml.YAML(); y.load(pathlib.Path("configdata.yml"))'
10 loops, best of 3: 678 msec per loop

This is with Python 3.6.2, ruamel.yaml 0.15.33 installed via pip, and I think with the C extensions (or at least python3 -c "from ruamel.yaml import CLoader" works fine).

(originally posted on 2017-09-20 at 05:34:00 by Florian Bruhin <The-Compiler@bitbucket>)

Discussion

  • Anthon van der Neut

    Thanks for considering ruamel.yaml and the easily reproducible issue.

    Initially I thought the problem was in the invocation YAML(), without parameters, uses the round-trip-loader and that has overhead compared to the "normal" loaders that don't preserve comments etc. A more appropriate comparison would be to use:

    python3 -m timeit -s 'from ruamel import yaml; import pathlib' 'y = yaml.YAML(typ="unsafe", pure=True); y.load(pathlib.Path("configdata.yml"))'
    

    this gives a 1.18s per loop (your machine is faster than mine: your PyYAML timeit runs in 384ms and your ruamel.yaml (using the default round-trip loader) runs in 1.49s).

    But you should be using the parameter typ='safe' in this case with ruamel.yaml:

    python3 -m timeit -s 'from ruamel import yaml; import pathlib' 'y = yaml.YAML(typ="safe"); y.load(pathlib.Path("configdata.yml"))'
    10 loops, best of 3: 41.1 msec per loop
    

    This gives you the Cloader (which currently only supports YAML 1.1, but that should be OK for your source). This could give you around 21.6s ( (202/384) * 41.1 ) on your machine.

    Please note that there is no CRoundTripLoader (yet), but that is definitely planned. In other words yaml=YAML() is currently always a pure Python loader.

    I have not looked at speed that much, and I am not sure which of my changes makes YAML(typ='unsafe', pure=True) slower that the equivalent PyYAML. It seems it partly has to do with the new API because:

    python3 -m timeit -s 'from ruamel import yaml; import pathlib' 'with open("configdata.yml") as f: yaml.safe_load(f)'
    

    gives me 815ms. So it looks like I need to get up to speed with profiling. There might be some round-trip specific 'stuff' that needs to be moved out of the more basic loader classes (probably at the cost of some code duplication)

    BTW in your PyYAML code you are using yaml.load() which is documented to be unsafe on uncontrolled input. If you continue to use PyYAML at least switch to using safe_load().

    (originally posted on 2017-09-20 at 07:05:07)

     
  • Anthon van der Neut

    None
    (originally posted on 2017-09-20 at 07:05:48)

     
  • Anthon van der Neut

    some minor speed ups through removal of indirection overhead, re #159

    → <<cset c71d3e512c00="">></cset>

    (originally posted on 2018-08-20 at 22:09:23)

     
  • Anthon van der Neut

    caching indirected method call for minor speed improvements on reading, re #159

    → <<cset 2902663179a2="">></cset>

    (originally posted on 2018-09-01 at 15:54:38)

     
  • Anthon van der Neut

    • Status: unread --> open
     

Log in to post a comment.