DjVuLibre / Bugs / #71 djvutoxml is very slow

Tim Starling - 2007-04-20

Logged In: YES
user_id=758207
Originator: YES

Sorry, that should have read "DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with sync=true"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leon Bottou - 2007-04-20

Logged In: YES
user_id=42774
Originator: NO

Assigned to docbill who is the author of djvutoxml.
Unfortunately I am not sure he has much time to work on this.

A potential workaround is using djvused,
but the output format will be slightly different.
I understand that this can disrupt your workflow.

* "djvused file.djvu -e ls"
gives the page list.
* "djvused file.djvu -e dump"
gives the full structure.
* "djvused file.djvu -e 'select <n> ; print-txt'"
gives the hidden text with location for page <n>.
* "djvused file.djvu -e 'select <n> ; print-ant'"
gives the annotations and hyperlinks for page <n>.
* "djvused file.djvu -e 'print-outline'"
gives the document outline.

The latter three commands use a s-expression format
that is as expressive as xml (and much older in fact.).
There are two easy ways to parse these s-expressions:
* Use a lisp interpreter, such as the one in djvulibre-3.5/doc/minilisp.
* Directly use the files miniexp.{h,cpp} from djvulibre-3.5/libdjvu.
These files can be compiled without the rest of the library.

Djvused can also do the opposite, that is import text,
annotations and outlines into an existing document.

Useful pointers:
- See man page djvused(1).
- See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/tools/djvutxt.cpp
- See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/ddjvuapi.h
- See http://djvu.cvs.sourceforge.net/djvu/djvulibre-3.5/libdjvu/miniexp.h
- See the ddjvu_anno_get_xxx functions in ddjvuapi.cpp

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dr Bill C Riemers - 2007-04-21

Logged In: YES
user_id=290784
Originator: NO

Thanks for the suggestion. I have approximately 16 hours a month, dedicate to spend on open source projects. With my current schedule of commitments, I expect to get to be able to resolve this within the next 6 weeks. If you need something sooner, let me know and I will see if I can swap some of my other commitments.

Bill

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tim Starling - 2007-04-21

Logged In: YES
user_id=758207
Originator: YES

Thanks for the workaround suggestions. I have decided to use djvudump for the time being. I have PHP function which takes the djvudump output and converts it to an XML format similar to the output of djvutoxml. It is convertDumpToXML(), available here:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/DjVuImage.php?revision=21411&view=markup

I imagine we will switch back to using djvutoxml once it is fixed, since it is a better-documented, more stable and more easily parseable format, so it should be superior in the long run. I would prefer to see it fixed before you make any potentially incompatible changes to the djvudump output, but other than that, it's not urgent.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tim Starling - 2021-05-18

I see that it's still not fixed, 14 years later. Maybe it's time to remove the comments I added to the MediaWiki source about djvudump being temporary.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Leon Bottou - 2021-05-21
  
  djvutoxml is essentially unmaintained.
  
  It relies on a nonstandard xml code that was coded by Lizardtech a very
  long time ago and that I find very hard to debug.
  
  The best supported tool for this kind of tasks is djvused (see my
  previous comment.)
  
  Djvudump was mostly a debugging tool, but a reliable and useful one.
  
  Leon
  
  On 5/18/21 2:53 AM, Tim Starling wrote:
  
  I see that it's still not fixed, 14 years later. Maybe it's time to
  remove the comments I added to the MediaWiki source about djvudump
  being temporary.
  
  [bugs:#71] https://sourceforge.net/p/djvu/bugs/71/ djvutoxml is
  very slow
  
  Status: open
  Group: djvulibre
  Labels: libdjvu
  Created: Fri Apr 20, 2007 03:22 AM UTC by Tim Starling
  Last Updated: Thu Nov 08, 2012 07:49 PM UTC
  Owner: Dr Bill C Riemers
  
  djvutoxml is very slow, because it decodes every single image in the
  document. This is unnecessary to obtain the information it outputs.
  DjVuDocument::writeDjVuXML() calls DjVuImage::get_page() with
  sync=false, which causes the page to be decoded to an image.
  
  djvudump provides almost identical information, but is implemented
  efficiently. DjVuDocument::writeDjVuXML() should be rewritten to work
  chunk-by-chunk, like DjVuDumpHelper::dump(), instead of image by image.
  
  Simple demonstrative benchmark on a 210 page 17MB file:
  
  [0311][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time
  djvutoxml Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu >
  ~/drei.xml
  
  real 0m52.010s
  user 0m50.051s
  sys 0m0.283s
  
  [0314][tstarling@zwinger:/mnt/upload3/wikipedia/commons/c/c1]$ time
  djvudump Drei_Register_Arithmetischer_ahnfeng_zur_Practic.djvu >
  ~/drei.dump
  
  real 0m0.268s
  user 0m0.179s
  sys 0m0.008s
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/djvu/bugs/71/
  https://sourceforge.net/p/djvu/bugs/71/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Bugs: #71
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dr Bill C Riemers - 2021-05-22

You can consider this abandoned software. if i were to do anything with this it would be to rewrite something that uses an off the shelve xml library. This tool was never intended as a long term solution. Just more a proof of concept that met immediate needs.

But I would not actually do the rewrite because I have no use for it. I also have no test bed to assure quality, and i have no evidence anyone else needs this updated badly enough to help support the effort it would take to ensure quality.

Last edit: Dr Bill C Riemers 2021-05-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Janusz - 2021-05-22

A question to OP: what do you need XML for? Perhaps djvu2hocr (https://jwilk.net/software/ocrodjvu and some Linux distributions) would be of some use for you.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tim Starling - 2021-06-03
  
  Feel free to close the task, we don't really need it that much.
  
  In 2006, DjVu support was added to MediaWiki. Now commons.wikimedia.org has about 278,000 DjVu files, so it's been pretty popular and successful. But the code in MediaWiki has not changed very much. We had a place in the database for type-specific metadata, and the original author simply dumped the complete output of djvutoxml into it. When we need page dimensions or text for search indexing, the media type handler is pulling up that XML and parsing it to extract the data it needs.
  
  In 2007, as a workaround for this bug, I just took the output of djvudump and massaged it into an XML format equivalent to what djvutoxml produces, so that there was no format change. So we have metadata for 278,000 images in a format equivalent to djvutoxml.
  
  I'm revisiting image metadata storage at the moment. I'll probably get rid of the XML format and store the few things we need as JSON instead. This is an opportunity to figure out what we actually need from the XML. I think that's just page dimensions and the text layer. Most image formats have a header parser written in PHP, so that might be an option for DjVu.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

djvutoxml is very slow

Group

Searches

Help

#71 djvutoxml is very slow

Related

Discussion

Related