Menu

#131 Incorect djvutoxml output

open
utilities (27)
5
2012-11-08
2009-01-23
No

From http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=512659

It is possible that djvutoxml for a perfectly valid DjVu file)
an XML file which is not well-formed:

$ printf 'P1 3 3 0 0 0 0 0 0 0 0 0' > dummy.pbm
$ cjb2 dummy.pbm dummy.djvu
$ printf 'select 1\nset-txt\n(page 0 0 1 1 (line 0 0 1 1 "foo\\fbar"))\n.\n' > sedscript
$ cat sedscript
select 1
set-txt
(page 0 0 1 1 (line 0 0 1 1 "foo\fbar"))
.
$ djvused -s -f sedscript dummy.djvu
$ djvutoxml dummy.djvu > dummy.xml
$ xmllint --noout dummy.xml
dummy.xml:14: parser error : xmlParseCharRef: invalid xmlChar value 12
foo&#12;bar </LINE>

Discussion

  • Leon Bottou

    Leon Bottou - 2009-01-23

    Correct. The XML-1.0 spec says that most control
    characters are disallowed, even using the &#xxx; notation.

    Remedies include defining a new tag <char code="xxx/">
    or remapping these characters elsewhere in the unicode space as suggested
    in http://lists.xml.org/archives/xml-dev/200006/msg00480.html.

    I dislike both options.

    If our djvuxml guru docbill wants to fix, fine.

    Otherwise, so be it.
    A good reason for not fixing is to maintain bug-for-bug compatibility
    with the xml utilities of the commercial djvu products.

     
  • Leon Bottou

    Leon Bottou - 2009-01-23

    After conferring with docbill:

    - The bug is unlikely to happen in practice
    as such control characters in the text layer or in
    the annotations do not mean anything useful.

    In the unlikely case where the situation occurs,
    we have to choose between two evils:

    1-if we do not fix the bug, some xml utilities
    may choke on the djvuxml output

    2-if we fix the bug, the xml utilities of the
    commercial djvu package will choke on
    the djvulibre djvuxml output

    My intuition is that (2) is a greater risk for djvu users.
    But my intuition is often wrong with xml.
    For instance I do not know why people should ever
    use such a complicated contraption when simpler
    alternatives (s-expressions) have existed for a long time.

    - L.

     
  • roucaries bastien

    Could we open a bug against commercial djvu package.

    Could we sue namespace in order to avoid compatibily problem. In case of problem we skip the offender.

    I strongly believe the 2. is the solution.

     
  • Dr Bill C Riemers

    It just occurred to me, #x01-#x1f; is allowed with XML 1.1. So this could be fixed simply by changing the version of XML.

    Bill

     
  • U

    U - 2009-10-20

    It is not true that this bug is unlikely to happen in practice. In fact, I found characters with codes 11 and 31 in real-world DjVu files, presumably with OCR generated by Lizardtech software. Moreover, the DjVu specification (page 19) explicitly mentions 11, 29, 30 and 31 as characters that may appear in the text layer.

    The problem with XML 1.1 is that is (still) not widely supported.

     
  • Dr Bill C Riemers

    Sounds like the best solution is to add options to remap control characters as specified in http://lists.xml.org/archives/xml-dev/200006/msg00480.html or to output to XML 1.1. I'm still on partial disability, so I can not make this change anytime soon...

     

Log in to post a comment.