#374 man: reduce memory usage as much as possible

output: manpages
closed
XSL (399)
5
2008-03-27
2007-02-16
No

Hi,

I am getting HUGE amounts of memory usage when using xsltproc with the latest releases of docbook-xsl.

The input XML file is ~563Kb in size.

Versions 1.72.0 & 1.70.1 of docbook-xsl cause xsltproc to consume MORE THAN 1.5 GIGABYTES of memory when outputting in manpage format.

Compare this with version 1.68.1-1 of docbook-xsl installed from an .rpm for my Fedora 4 system) which consumes ONLY 57 MEGABYTES of memory.

I'm happy to assist debugging this issue, but I'll need some guidance - email me if I can be of use.

The input file I used is here:
http://www.aao.gov.au/local/www/ss/tmp/fvwm.1.xml

LaPSS>> xsltproc -version
Using libxml 20619, libxslt 10114 and libexslt 812
xsltproc was compiled against libxml 20619, libxslt 10114 and libexslt 812
libxslt 10114 was compiled against libxml 20619
libexslt 812 was compiled against libxml 20619

SCoTT. :)

Discussion

  • Scott Smedley

    Scott Smedley - 2007-02-16

    Logged In: YES
    user_id=370510
    Originator: YES

    If I comment out the 28 <substitution ...> lines in manpages/param.xsl the memory usage reduces to ~80Mb. Obviously the output is incorrect, but it indicates the area that is using so much memory. I've no idea how to fix it.

    Scott.

     
  • Michael(tm) Smith

    • assigned_to: nobody --> xmldoc
    • summary: HUGE memory usage with latest XSL releases --> man: HUGE memory usage with latest XSL releases
     
  • Michael(tm) Smith

    Logged In: YES
    user_id=118135
    Originator: NO

    Scott,

    Thanks for the heads-up about this. Unfortuately, I don't have any good idea how to fix it either. :(

    What those <substitution> instances are doing is causing the XSLT processor to read in the entire contents of your output and to do string search and replacement on those contents (in order to deal with some characters that are treated as special characters by troff/groff and to do some cleanup that would otherwise be very difficult to do). The problem is that there is no standard way to do that kind of string replacement in XSLT 1.0 and the way that I have the stylesheet doing it now is the only way I know how to do it -- though I recognize that it's very inefficient. It seems to cause the XSLT processor to use as much memory as it can get.

    I will do some profiling to see if I can mitigate some of the memory issues. But going back to the way that the 1.68.1 stylesheet was doing it is not an option -- because it produced output bugs in many cases, and the current string-substitution mechanism is designed to fix those bugs. Unfortunately, I think that the limitations of XSLT 1.0 make this a real PITA to deal with.

     
  • Michael(tm) Smith

    Logged In: YES
    user_id=118135
    Originator: NO

    It looks like support for the EXSLT str:replace function has recently been added to libxslt/xsltproc -

    http://article.gmane.org/gmane.comp.gnome.lib.xslt/3267

    But I have no idea how long it will be before a new libxslt/xsltproc release with that support will be available (nor how long it will be after that before a new Debian package for it is available).

    So in the mean time, I'll look at my current string-substitution code and see if there's anything I can do to make it use less memory. I doubt that there is. I think it's just inevitable side effect of the approach I'm using -- which requires reading the entire rendered output into memory multiple times and iterating over the entire contents each time to do the string replacements.

    The unfortunate fact is that XSLT 1.0 is not really designed to do string replacments efficiently. It would be much more efficient to do post-processing using perl or sed or something. But the problem with that is that we have a requirement that the DocBook Project stylesheets have no dependencies other than an XSLT engine. So I can't introduce a required post-processing step using other tools. What I /could/ potentially do is ship a perl or sed script in the distribution that you could optionally instead (and provide a parameter in the stylesheets for easily disabling the XSLT-based string substitution).

    Let me know what you think of that idea.

     
  • Scott Smedley

    Scott Smedley - 2007-02-20

    Logged In: YES
    user_id=370510
    Originator: YES

    Hi Michael,

    > What I /could/ potentially do is
    > ship a perl or sed script in the distribution that you could optionally
    > instead (and provide a parameter in the stylesheets for easily disabling
    > the XSLT-based string substitution).

    Yes, that's effectively what I'm doing as a (temporary?) workaround. I hacked the stylesheets to bypass string substitution & wrote a simple perl script to do it afterwards. See: http://www.aao.gov.au/local/www/ss/tmp/string.subst.pl

    A parameter to turn it on/off would be better though.

    > Let me know what you think of that idea.

    It sounds like it is the best possible solution, given the constraints. Personally, I'd use it. Given the amount of memory the XSLT string substitution performs I'm sure other users would prefer (require?) a post-processing option too.

    Scott. :)

     
  • Michael(tm) Smith

    Logged In: YES
    user_id=118135
    Originator: NO

    Scott,

    If possible, can you please upload/attach your XML source file? I would like to test with it if I can.

    --Mike

     
  • Michael(tm) Smith

    Logged In: YES
    user_id=118135
    Originator: NO

    Sorry, I didn't read your description... I'll just download the source from the URL your provided.

    --Mike

     
  • Michael(tm) Smith

    Logged In: YES
    user_id=118135
    Originator: NO

    Scott,

    The latest snapshot contains a significant change I made to try to resolve this.
    If possible, can you please try running your source through the latest snapshot.

    You can test with the snapshot by doing the following:

    - Change to some directory where you have perms to write files, e.g. /opt/scratch, download the snapshot into it, unzip the snapshot, run the install.sh script, then run your transform.

    For example,

    cd /opt/scratch
    http://docbook.sourceforge.net/snapshots/docbook-xsl-snapshot.zip
    unzip docbook-xsl-snapshot.zip
    ./docbook-xsl-snapshot/install.sh --batch
    . /opt/scratch/docbook-xsl-snapshot/.profile.incl

    That will point your catalog system at the snapshot. Then you can run your transformation by doing this:

    xsltproc http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl fvwm.1.xml

    ... and that /should/ cause xsltproc to map/resolve the remote stylesheet URL to /opt/scratch/manpages/docbook.xsl
    ... in which case if you vim/less your ./fvwm.1 output file, you should see this near the top:

    Generator: DocBook XSL Stylesheets vsnapshot_6668

    (or whatver the current snapshot build number is at the time you read this)

    But if that doesn't work, then you can always just do:

    xsltproc /opt/scratch/docbook-xsl-snapshot/manpages/docbook.xsl fvwm.1.xml

    Anyway, please try if/when you have time, and let me know if it works better (without eating up all your available RAM...)

    --Mike

     
  • Scott Smedley

    Scott Smedley - 2007-03-14

    Logged In: YES
    user_id=370510
    Originator: YES

    Hi Michael,

    With the snapshot I downloaded on 28-Feb-2007, xsltproc is consuming ~135Mb of memory.

    Would you still like me to try the latest snapshot?

    Scott. :)

     
  • Michael(tm) Smith

    Logged In: YES
    user_id=118135
    Originator: NO

    Scott,

    I think no need to try with the latest snapshot if you're already
    testing with a snapshot numbered 6657 or later -- because the latest
    relevant change I made to that code was at rev 6657 of the
    project svn repository.

    So my question now is whether the performance you're getting with the
    snapshot is acceptable to you? I know ~135Mb is still a lot of memory
    for just transforming a 0.5Mb document. Is it enough that you can
    turn off the string-replace post-processing step you were using
    as an alternative? Or will you continue to use that?

    There may be some other places I can mess with to get it down a
    bit more. I don't really have a clear idea at this point about
    how much of a difference those changes might make.

    One thing to note is that the manpages stylesheet had to import
    the whole DocBook HTML stylesheet in order to run a transform.
    When I run a transform with the HTML stylesheet, it takes about
    75Mb of RAM. When I run one with the manpages stylesheet,
    it takes about 122Mb. So the maximum amount of memory I'd ever
    be able to reduce it by is some portion of the 47Mb difference
    between those two. Assuming I could figure out how to reduce
    it by, say, half (I'm thinking optimistically), that would
    amount to a reduction of maybe 24Mb.

    So in your environment, I guess even if I made the needed
    changes, you'd still be using 100Mb or so (instead of 135Mb)
    to run the transform.

    But if that amount of reduction in the RAM usage is important
    to you (to be able to not have to use/maintain your
    post-processing scrip), I can spend some time to try it.

    Also, can I ask how long the XSLT transformation takes to
    run in your environment? On my machine, it takes about 8 to
    10 seconds -- which is in range for what I see with other
    source docs of similar size. (Smaller docs can take just
    2 seconds or so).

    --Mike

     
  • Scott Smedley

    Scott Smedley - 2007-03-14

    Logged In: YES
    user_id=370510
    Originator: YES

    Hi Michael,

    I just realised that the ~135Mb figure I gave you was on a hacked version of the 27-Feb-07 snapshot. Unhacked, the memory usage is ~190Mb.

    So what did I hack? I commented out all the *ahem* crap in common/l10n.xml. ie. I just kept the "en" stuff. That's all I changed.

    I just tried the same trick on the latest snapshot (13-March-2007). It gives the same numbers. ie. ~190Mb of memory, or ~135Mb when I get rid of the superfluous l10n stuff.

    > So my question now is whether the performance you're getting with the
    > snapshot is acceptable to you? I know ~135Mb is still a lot of memory
    > for just transforming a 0.5Mb document.

    Yes. I have to confess it still sounds rather excessive. I'm a developer of the FVWM Window Manager which consumes ~7Mb of RAM - so perhaps that has biased me! But, it would be unfair of me to criticise given that I don't understand the intricacies of what xsltproc is actually doing.

    A little more information about my particular usage:

    The fvwm.1.xml file we're talking about here is just a test case. (I simply ran ESR's doclifter on the fvwm.1 man page.) My actual
    input files total ~1.2Mb. So that's more than twice the size of the fvwm.1.xml file.

    When I run xsltproc with my hacked version of the stylesheets, it actually consumes ~230Mb of memory. (I haven't checked how much it would be unhacked.) So, in short, I would _really_ like to obtain further memory reductions if it is at all possible.

    > One thing to note is that the manpages stylesheet had to import
    > the whole DocBook HTML stylesheet in order to run a transform.
    > When I run a transform with the HTML stylesheet, it takes about
    > 75Mb of RAM. When I run one with the manpages stylesheet,
    > it takes about 122Mb. So the maximum amount of memory I'd ever
    > be able to reduce it by is some portion of the 47Mb difference
    > between those two. Assuming I could figure out how to reduce
    > it by, say, half (I'm thinking optimistically), that would
    > amount to a reduction of maybe 24Mb.
    >
    > So in your environment, I guess even if I made the needed
    > changes, you'd still be using 100Mb or so (instead of 135Mb)
    > to run the transform.
    >
    > But if that amount of reduction in the RAM usage is important
    > to you (to be able to not have to use/maintain your
    > post-processing scrip), I can spend some time to try it.

    I would appreciate memory reductions of this size _enormously_!

    > Also, can I ask how long the XSLT transformation takes to
    > run in your environment?

    Ah! This is something else I was going to ask you about.

    > On my machine, it takes about 8 to 10 seconds

    What spec is your machine?

    It takes ~22 seconds on my system that has plenty (1Gb) of RAM:

    LaPSS>> /bin/grep 'model name' /proc/cpuinfo
    model name : Intel(R) Pentium(R) M processor 1.73GHz

    Do you have any suggestions for how I might be able to make it run faster?

    Thanks again for all your help - it is muchly appreciated.

    Scott. :)

     
  • Scott Smedley

    Scott Smedley - 2007-03-14

    Logged In: YES
    user_id=370510
    Originator: YES

    Argh! Wait. I didn't read your last post properly.

    All I did was unzip the snapshot & run:

    xsltproc docbook-xsl-snapshot/manpages/docbook.xsl ./fvwm.1.xml

    Let me install it as you described & retry.

    Scott.

     
  • Michael(tm) Smith

    Logged In: YES
    user_id=118135
    Originator: NO

    Scott,

    Actually, just running "xsltproc docbook-xsl-snapshot/manpages/docbook.xsl ./fvwm.1.xml"
    is all you need to do. I think you'll get the same results if you use the steps I
    suggested.

    I just normally suggest those because using the remote URL and mapping it through
    catalogs is the best way to do it if you want to script it and want to switch
    between production and snapshot versions of the stylesheets. Because if you
    afterwards just run "docbook-xsl-snapshot/uninstall.sh --batch", it will
    revert all pointers to the snapshot and so get you back to however you
    had your environment set up before installing the snapshot.

    --Mike

     
  • Scott Smedley

    Scott Smedley - 2007-03-14

    Logged In: YES
    user_id=370510
    Originator: YES

    Hi Mike,

    > Actually, just running "xsltproc docbook-xsl-snapshot/manpages/docbook.xsl
    > ./fvwm.1.xml"
    > is all you need to do.

    I thought so too until I ran:

    strace -etrace=open -o /tmp/o xsltproc docbook-xsl-snapshot/manpages/docbook.xsl ./fvwm.1.xml

    & saw that xsltproc was reading files in /usr/share/sgml/docbook/.

    I tried to remove the docbook-dtds package but ran into dependency hell. Instead, I just ran xsltproc again with --novalid & apart from a few (probably unimportant) parse errors it still seemed to work ok. It's a bit strange to me why xsltproc should want to read files in /usr/share/sgml/docbook/ but it still took just as long & consumed roughly the same amount of memory so ...

    In short, everything I said previously still stands.

    Time for bed methinks.

    Scott. :)

     
  • Michael(tm) Smith

    Logged In: YES
    user_id=118135
    Originator: NO

    Moving to feature requests as this is not strictly a bug.

    I would like to try to get the RAM usage down more if I could, but have not had time to work on it. At best, as I mention, I reckon I could probably only get it down to 100Mb at best for this particular test case (and that is thinking really optimisticallly).

     
  • Michael(tm) Smith

    • labels: 321159 --> XSL
    • milestone: 541245 --> output: manpages
    • summary: man: HUGE memory usage with latest XSL releases --> man: reduce memory usage as much as possible
     
  • Michael(tm) Smith

    Logged In: YES
    user_id=118135
    Originator: NO

    Scott,

    This has been open for a year now and I've not come up with any brilliant ideas for getting the memory size down, so I'm going to close this for now.

    --Mike

     
  • Michael(tm) Smith

    • status: open --> closed
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks