#71 djvutoxml is very slow

libdjvu (18)
Tim Starling

djvutoxml is very slow, because it decodes every single image in the document. This is unnecessary to obtain the information it outputs. DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with sync=false, which causes the page to be decoded to an image.

djvudump provides almost identical information, but is implemented efficiently. DjVuDocument::writeDjVuXML() should be rewritten to work chunk-by-chunk, like DjVuDumpHelper::dump(), instead of image by image.

Simple demonstrative benchmark on a 210 page 17MB file:

[0311][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time djvutoxml Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu > ~/drei.xml

real 0m52.010s
user 0m50.051s
sys 0m0.283s

[0314][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time djvudump Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu > ~/drei.dump

real 0m0.268s
user 0m0.179s
sys 0m0.008s


  • Tim Starling
    Tim Starling

    Logged In: YES
    Originator: YES

    Sorry, that should have read "DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with sync=true"

  • Leon Bottou
    Leon Bottou

    Logged In: YES
    Originator: NO

    Assigned to docbill who is the author of djvutoxml.
    Unfortunately I am not sure he has much time to work on this.

    A potential workaround is using djvused,
    but the output format will be slightly different.
    I understand that this can disrupt your workflow.

    * "djvused file.djvu -e ls"
    gives the page list.
    * "djvused file.djvu -e dump"
    gives the full structure.
    * "djvused file.djvu -e 'select <n> ; print-txt'"
    gives the hidden text with location for page <n>.
    * "djvused file.djvu -e 'select <n> ; print-ant'"
    gives the annotations and hyperlinks for page <n>.
    * "djvused file.djvu -e 'print-outline'"
    gives the document outline.

    The latter three commands use a s-expression format
    that is as expressive as xml (and much older in fact.).
    There are two easy ways to parse these s-expressions:
    * Use a lisp interpreter, such as the one in djvulibre-3.5/doc/minilisp.
    * Directly use the files miniexp.{h,cpp} from djvulibre-3.5/libdjvu.
    These files can be compiled without the rest of the library.

    Djvused can also do the opposite, that is import text,
    annotations and outlines into an existing document.

    Useful pointers:
    - See man page djvused(1).
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/tools/djvutxt.cpp
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/ddjvuapi.h
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/miniexp.h
    - See the ddjvu_anno_get_xxx functions in ddjvuapi.cpp

  • Logged In: YES
    Originator: NO

    Thanks for the suggestion. I have approximately 16 hours a month, dedicate to spend on open source projects. With my current schedule of commitments, I expect to get to be able to resolve this within the next 6 weeks. If you need something sooner, let me know and I will see if I can swap some of my other commitments.


  • Tim Starling
    Tim Starling

    Logged In: YES
    Originator: YES

    Thanks for the workaround suggestions. I have decided to use djvudump for the time being. I have PHP function which takes the djvudump output and converts it to an XML format similar to the output of djvutoxml. It is convertDumpToXML(), available here:


    I imagine we will switch back to using djvutoxml once it is fixed, since it is a better-documented, more stable and more easily parseable format, so it should be superior in the long run. I would prefer to see it fixed before you make any potentially incompatible changes to the djvudump output, but other than that, it's not urgent.