#71 djvutoxml is very slow

djvulibre
open
libdjvu (18)
5
2012-11-08
2007-04-20
Tim Starling
No

djvutoxml is very slow, because it decodes every single image in the document. This is unnecessary to obtain the information it outputs. DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with sync=false, which causes the page to be decoded to an image.

djvudump provides almost identical information, but is implemented efficiently. DjVuDocument::writeDjVuXML() should be rewritten to work chunk-by-chunk, like DjVuDumpHelper::dump(), instead of image by image.

Simple demonstrative benchmark on a 210 page 17MB file:

[0311][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time djvutoxml Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu > ~/drei.xml

real 0m52.010s
user 0m50.051s
sys 0m0.283s

[0314][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time djvudump Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu > ~/drei.dump

real 0m0.268s
user 0m0.179s
sys 0m0.008s

Discussion

  • Tim Starling
    Tim Starling
    2007-04-20

    Logged In: YES
    user_id=758207
    Originator: YES

    Sorry, that should have read "DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with sync=true"

     
  • Leon Bottou
    Leon Bottou
    2007-04-20

    Logged In: YES
    user_id=42774
    Originator: NO

    Assigned to docbill who is the author of djvutoxml.
    Unfortunately I am not sure he has much time to work on this.

    A potential workaround is using djvused,
    but the output format will be slightly different.
    I understand that this can disrupt your workflow.

    * "djvused file.djvu -e ls"
    gives the page list.
    * "djvused file.djvu -e dump"
    gives the full structure.
    * "djvused file.djvu -e 'select <n> ; print-txt'"
    gives the hidden text with location for page <n>.
    * "djvused file.djvu -e 'select <n> ; print-ant'"
    gives the annotations and hyperlinks for page <n>.
    * "djvused file.djvu -e 'print-outline'"
    gives the document outline.

    The latter three commands use a s-expression format
    that is as expressive as xml (and much older in fact.).
    There are two easy ways to parse these s-expressions:
    * Use a lisp interpreter, such as the one in djvulibre-3.5/doc/minilisp.
    * Directly use the files miniexp.{h,cpp} from djvulibre-3.5/libdjvu.
    These files can be compiled without the rest of the library.

    Djvused can also do the opposite, that is import text,
    annotations and outlines into an existing document.

    Useful pointers:
    - See man page djvused(1).
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/tools/djvutxt.cpp
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/ddjvuapi.h
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/miniexp.h
    - See the ddjvu_anno_get_xxx functions in ddjvuapi.cpp

     
  • Logged In: YES
    user_id=290784
    Originator: NO

    Thanks for the suggestion. I have approximately 16 hours a month, dedicate to spend on open source projects. With my current schedule of commitments, I expect to get to be able to resolve this within the next 6 weeks. If you need something sooner, let me know and I will see if I can swap some of my other commitments.

    Bill

     
  • Tim Starling
    Tim Starling
    2007-04-21

    Logged In: YES
    user_id=758207
    Originator: YES

    Thanks for the workaround suggestions. I have decided to use djvudump for the time being. I have PHP function which takes the djvudump output and converts it to an XML format similar to the output of djvutoxml. It is convertDumpToXML(), available here:

    http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/DjVuImage.php?revision=21411&view=markup

    I imagine we will switch back to using djvutoxml once it is fixed, since it is a better-documented, more stable and more easily parseable format, so it should be superior in the long run. I would prefer to see it fixed before you make any potentially incompatible changes to the djvudump output, but other than that, it's not urgent.