#71 djvutoxml is very slow

libdjvu (18)

djvutoxml is very slow, because it decodes every single image in the document. This is unnecessary to obtain the information it outputs. DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with sync=false, which causes the page to be decoded to an image.

djvudump provides almost identical information, but is implemented efficiently. DjVuDocument::writeDjVuXML() should be rewritten to work chunk-by-chunk, like DjVuDumpHelper::dump(), instead of image by image.

Simple demonstrative benchmark on a 210 page 17MB file:

[0311][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time djvutoxml Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu > ~/drei.xml

real 0m52.010s
user 0m50.051s
sys 0m0.283s

[0314][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time djvudump Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu > ~/drei.dump

real 0m0.268s
user 0m0.179s
sys 0m0.008s


  • Tim Starling

    Tim Starling - 2007-04-20

    Logged In: YES
    Originator: YES

    Sorry, that should have read "DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with sync=true"

  • Leon Bottou

    Leon Bottou - 2007-04-20

    Logged In: YES
    Originator: NO

    Assigned to docbill who is the author of djvutoxml.
    Unfortunately I am not sure he has much time to work on this.

    A potential workaround is using djvused,
    but the output format will be slightly different.
    I understand that this can disrupt your workflow.

    * "djvused file.djvu -e ls"
    gives the page list.
    * "djvused file.djvu -e dump"
    gives the full structure.
    * "djvused file.djvu -e 'select <n> ; print-txt'"
    gives the hidden text with location for page <n>.
    * "djvused file.djvu -e 'select <n> ; print-ant'"
    gives the annotations and hyperlinks for page <n>.
    * "djvused file.djvu -e 'print-outline'"
    gives the document outline.

    The latter three commands use a s-expression format
    that is as expressive as xml (and much older in fact.).
    There are two easy ways to parse these s-expressions:
    * Use a lisp interpreter, such as the one in djvulibre-3.5/doc/minilisp.
    * Directly use the files miniexp.{h,cpp} from djvulibre-3.5/libdjvu.
    These files can be compiled without the rest of the library.

    Djvused can also do the opposite, that is import text,
    annotations and outlines into an existing document.

    Useful pointers:
    - See man page djvused(1).
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/tools/djvutxt.cpp
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/ddjvuapi.h
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/miniexp.h
    - See the ddjvu_anno_get_xxx functions in ddjvuapi.cpp

  • Dr Bill C Riemers

    Logged In: YES
    Originator: NO

    Thanks for the suggestion. I have approximately 16 hours a month, dedicate to spend on open source projects. With my current schedule of commitments, I expect to get to be able to resolve this within the next 6 weeks. If you need something sooner, let me know and I will see if I can swap some of my other commitments.


  • Tim Starling

    Tim Starling - 2007-04-21

    Logged In: YES
    Originator: YES

    Thanks for the workaround suggestions. I have decided to use djvudump for the time being. I have PHP function which takes the djvudump output and converts it to an XML format similar to the output of djvutoxml. It is convertDumpToXML(), available here:


    I imagine we will switch back to using djvutoxml once it is fixed, since it is a better-documented, more stable and more easily parseable format, so it should be superior in the long run. I would prefer to see it fixed before you make any potentially incompatible changes to the djvudump output, but other than that, it's not urgent.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks