Thread: [Docutils-users] reStructuredText parser doesn't scale linearly with the input file size

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello all,

I have been trying to use rst2pdf to produce potentially very large
documents containing patches, such as those produced by a ``cvs diff''
command.

Unfortunately, I have observed that the rst2pdf code doesn't seem to
scale linearly with the size of the input file size, and when files
get large, it starts taking so much time to run that it becomes
completely unusable for me (I had one such run taking ~9 hours). This
behaviour gets easy to observe using ~2MB .rst files, at least with
the hardware I'm using. Here are some more data points :

$ ls -l foo*.rst
-rw-r--r-- 1 mhe dev 3900918 Jan  3 15:34 foo2.rst
-rw-r--r-- 1 mhe dev 1950459 Jan  3 15:33 foo.rst

The foo2.rst file is the foo.rst file concatenated with itself, and is
thus exactly two times the size of foo.rst.

$ time rst2pdf -c -o foo.pdf -s doc/sdlc.stylesheet
--custom-cover=doc/cover.tmpl foo.rst

real 9m1.998s
user 8m27.206s
sys 0m33.631s

$ time rst2pdf -c -o foo.pdf -s doc/sdlc.stylesheet
--custom-cover=doc/cover.tmpl foo2.rst

real 39m6.937s
user 37m17.725s
sys 1m45.256s

While trying to pinpoint where exactly the problem lies, I have seen
that using rst2latex or rst2html instead of rst2pdf yields a similar
behaviour, so that seems to indicate a problem within the
reStructuredText parser itself, which is why I'm actually mailing you
guys.

I used cProfile to profile these rst2pdf runs, and here are the first
few lines of the profiling statistics sorted by time, and then by
cumulative time :

Wed Jan  4 10:49:32 2012    rst2pdf.prof

        69453616 function calls (67946896 primitive calls) in 628.887
CPU seconds

  Ordered by: internal time, cumulative time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  541748  139.312    0.000  192.712    0.000 statemachine.py:1110(__getitem__)
   21410   86.406    0.004  375.675    0.018 states.py:2278(explicit_list)
  125996   51.815    0.000   52.821    0.000 statemachine.py:1054(__init__)
  707880   46.462    0.000   89.487    0.000
statemachine.py:690(make_transitions)
   21535   46.456    0.002   46.456    0.002 {method 'index' of 'list' objects}
   21715   39.820    0.002   39.820    0.002 {method 'remove' of
'list' objects}
 4058512   13.771    0.000   40.898    0.000
statemachine.py:657(make_transition)
 3742055   10.998    0.000   14.382    0.000 re.py:229(_compile)
  707880   10.829    0.000   12.005    0.000
statemachine.py:613(add_transitions)
 6067804    9.313    0.000    9.685    0.000 {hasattr}
    2931    8.950    0.003   46.976    0.016 states.py:1516(line_block)
  202437    6.270    0.000    6.270    0.000 {range}
   21410    5.165    0.000   98.773    0.005 misc.py:44(apply)
  140863    5.094    0.000   11.796    0.000
basenodehandler.py:169(findsubclass)
 4945645    4.641    0.000    5.049    0.000 {getattr}
  353940    4.547    0.000  118.871    0.000 states.py:212(__init__)
 3515957    4.294    0.000   17.725    0.000 re.py:188(compile)
  353940    3.980    0.000  112.378    0.000 statemachine.py:559(__init__)
  353940    3.899    0.000   90.450    0.000
statemachine.py:606(add_initial_transitions)
 2826759    3.851    0.000    3.851    0.000 {isinstance}
955648/16    3.373    0.000    4.450    0.278 nodes.py:189(_fast_traverse)
 4039749    3.254    0.000    3.254    0.000 {method 'get' of 'dict' objects}
  109975    3.246    0.000    4.496    0.000 nodes.py:436(__init__)

The explicit_list() method looks particularly suspicious to me; more
specifically this code :

       newline_offset, blank_finish = self.nested_list_parse(
             self.state_machine.input_lines[offset:],
             input_offset=self.state_machine.abs_line_offset() + 1,
             node=self.parent, initial_state='Explicit',
             blank_finish=blank_finish,
             match_titles=self.state_machine.match_titles)

...looks like it slices the whole input file from the current offset
to the *end*, thus copying a lot of data in the process, and it's
getting called a lot in my case. As far as I understand this code, it
would definitely explain the behaviour I'm observing. Also, there are
several other places in the code that do similar things with the
input.

This is mostly a guess though, and I'm not claiming to have understood
exactly what the problem is. Unfortunately, I see no easy way to test
this theory without making large changes throughout the code...

Of course, I can send the .rst files mentioned above, or the data from
the profiling run if necessary; just ask me.

Thanks in advance,
Maxime Henrion

PS: I'm not subscribed to the mailing list, so please keep me Cc'ed in
your replies.

Thread: [Docutils-users] reStructuredText parser doesn't scale linearly with the input file size

docutils-users