Menu

#71 djvutoxml is very slow

djvulibre
open
libdjvu (18)
5
2021-06-03
2007-04-20
No

djvutoxml is very slow, because it decodes every single image in the document. This is unnecessary to obtain the information it outputs. DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with sync=false, which causes the page to be decoded to an image.

djvudump provides almost identical information, but is implemented efficiently. DjVuDocument::writeDjVuXML() should be rewritten to work chunk-by-chunk, like DjVuDumpHelper::dump(), instead of image by image.

Simple demonstrative benchmark on a 210 page 17MB file:

[0311][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time djvutoxml Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu > ~/drei.xml

real 0m52.010s
user 0m50.051s
sys 0m0.283s

[0314][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time djvudump Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu > ~/drei.dump

real 0m0.268s
user 0m0.179s
sys 0m0.008s

Related

Bugs: #71

Discussion

  • Tim Starling

    Tim Starling - 2007-04-20

    Logged In: YES
    user_id=758207
    Originator: YES

    Sorry, that should have read "DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with sync=true"

     
  • Leon Bottou

    Leon Bottou - 2007-04-20

    Logged In: YES
    user_id=42774
    Originator: NO

    Assigned to docbill who is the author of djvutoxml.
    Unfortunately I am not sure he has much time to work on this.

    A potential workaround is using djvused,
    but the output format will be slightly different.
    I understand that this can disrupt your workflow.

    * "djvused file.djvu -e ls"
    gives the page list.
    * "djvused file.djvu -e dump"
    gives the full structure.
    * "djvused file.djvu -e 'select <n> ; print-txt'"
    gives the hidden text with location for page <n>.
    * "djvused file.djvu -e 'select <n> ; print-ant'"
    gives the annotations and hyperlinks for page <n>.
    * "djvused file.djvu -e 'print-outline'"
    gives the document outline.

    The latter three commands use a s-expression format
    that is as expressive as xml (and much older in fact.).
    There are two easy ways to parse these s-expressions:
    * Use a lisp interpreter, such as the one in djvulibre-3.5/doc/minilisp.
    * Directly use the files miniexp.{h,cpp} from djvulibre-3.5/libdjvu.
    These files can be compiled without the rest of the library.

    Djvused can also do the opposite, that is import text,
    annotations and outlines into an existing document.

    Useful pointers:
    - See man page djvused(1).
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/tools/djvutxt.cpp
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/ddjvuapi.h
    - See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/miniexp.h
    - See the ddjvu_anno_get_xxx functions in ddjvuapi.cpp

     
  • Dr Bill C Riemers

    Logged In: YES
    user_id=290784
    Originator: NO

    Thanks for the suggestion. I have approximately 16 hours a month, dedicate to spend on open source projects. With my current schedule of commitments, I expect to get to be able to resolve this within the next 6 weeks. If you need something sooner, let me know and I will see if I can swap some of my other commitments.

    Bill

     
  • Tim Starling

    Tim Starling - 2007-04-21

    Logged In: YES
    user_id=758207
    Originator: YES

    Thanks for the workaround suggestions. I have decided to use djvudump for the time being. I have PHP function which takes the djvudump output and converts it to an XML format similar to the output of djvutoxml. It is convertDumpToXML(), available here:

    http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/DjVuImage.php?revision=21411&view=markup

    I imagine we will switch back to using djvutoxml once it is fixed, since it is a better-documented, more stable and more easily parseable format, so it should be superior in the long run. I would prefer to see it fixed before you make any potentially incompatible changes to the djvudump output, but other than that, it's not urgent.

     
  • Tim Starling

    Tim Starling - 2021-05-18

    I see that it's still not fixed, 14 years later. Maybe it's time to remove the comments I added to the MediaWiki source about djvudump being temporary.

     
    • Leon Bottou

      Leon Bottou - 2021-05-21

      djvutoxml is essentially unmaintained.

      It relies on a nonstandard xml code that was coded by Lizardtech a very
      long time ago and that I find very hard to debug.

      The best supported tool for this kind of tasks is djvused (see my
      previous comment.)

      Djvudump was mostly a debugging tool, but a reliable and useful one.

      • Leon

      On 5/18/21 2:53 AM, Tim Starling wrote:

      I see that it's still not fixed, 14 years later. Maybe it's time to
      remove the comments I added to the MediaWiki source about djvudump
      being temporary.


      [bugs:#71] https://sourceforge.net/p/djvu/bugs/71/ djvutoxml is
      very slow

      Status: open
      Group: djvulibre
      Labels: libdjvu
      Created: Fri Apr 20, 2007 03:22 AM UTC by Tim Starling
      Last Updated: Thu Nov 08, 2012 07:49 PM UTC
      Owner: Dr Bill C Riemers

      djvutoxml is very slow, because it decodes every single image in the
      document. This is unnecessary to obtain the information it outputs.
      DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with
      sync=false, which causes the page to be decoded to an image.

      djvudump provides almost identical information, but is implemented
      efficiently. DjVuDocument::writeDjVuXML() should be rewritten to work
      chunk-by-chunk, like DjVuDumpHelper::dump(), instead of image by image.

      Simple demonstrative benchmark on a 210 page 17MB file:

      [0311][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time
      djvutoxml Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu >
      ~/drei.xml

      real 0m52.010s
      user 0m50.051s
      sys 0m0.283s

      [0314][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time
      djvudump Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu >
      ~/drei.dump

      real 0m0.268s
      user 0m0.179s
      sys 0m0.008s


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/djvu/bugs/71/
      https://sourceforge.net/p/djvu/bugs/71/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #71

  • Dr Bill C Riemers

    You can consider this abandoned software. if i were to do anything with this it would be to rewrite something that uses an off the shelve xml library. This tool was never intended as a long term solution. Just more a proof of concept that met immediate needs.

    But I would not actually do the rewrite because I have no use for it. I also have no test bed to assure quality, and i have no evidence anyone else needs this updated badly enough to help support the effort it would take to ensure quality.

     

    Last edit: Dr Bill C Riemers 2021-05-27
  • Janusz

    Janusz - 2021-05-22

    A question to OP: what do you need XML for? Perhaps djvu2hocr (https://jwilk.net/software/ocrodjvu and some Linux distributions) would be of some use for you.

     
    • Tim Starling

      Tim Starling - 2021-06-03

      Feel free to close the task, we don't really need it that much.

      In 2006, DjVu support was added to MediaWiki. Now commons.wikimedia.org has about 278,000 DjVu files, so it's been pretty popular and successful. But the code in MediaWiki has not changed very much. We had a place in the database for type-specific metadata, and the original author simply dumped the complete output of djvutoxml into it. When we need page dimensions or text for search indexing, the media type handler is pulling up that XML and parsing it to extract the data it needs.

      In 2007, as a workaround for this bug, I just took the output of djvudump and massaged it into an XML format equivalent to what djvutoxml produces, so that there was no format change. So we have metadata for 278,000 images in a format equivalent to djvutoxml.

      I'm revisiting image metadata storage at the moment. I'll probably get rid of the XML format and store the few things we need as JSON instead. This is an opportunity to figure out what we actually need from the XML. I think that's just page dimensions and the text layer. Most image formats have a header parser written in PHP, so that might be an option for DjVu.

       

Log in to post a comment.