DjVuLibre / Bugs / #131 Incorect djvutoxml output

#131 Incorect djvutoxml output

Status: open

Owner: Dr Bill C Riemers

Labels: utilities (27)

Priority: 5

Updated: 2012-11-08

Created: 2009-01-23

Creator: roucaries bastien

Private: No

From http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=512659

It is possible that djvutoxml for a perfectly valid DjVu file)
an XML file which is not well-formed:

$ printf 'P1 3 3 0 0 0 0 0 0 0 0 0' > dummy.pbm
$ cjb2 dummy.pbm dummy.djvu
$ printf 'select 1\nset-txt\n(page 0 0 1 1 (line 0 0 1 1 "foo\\fbar"))\n.\n' > sedscript
$ cat sedscript
select 1
set-txt
(page 0 0 1 1 (line 0 0 1 1 "foo\fbar"))
.
$ djvused -s -f sedscript dummy.djvu
$ djvutoxml dummy.djvu > dummy.xml
$ xmllint --noout dummy.xml
dummy.xml:14: parser error : xmlParseCharRef: invalid xmlChar value 12
foobar </LINE>

Discussion

Leon Bottou - 2009-01-23

Correct. The XML-1.0 spec says that most control
characters are disallowed, even using the &#xxx; notation.

Remedies include defining a new tag <char code="xxx/">
or remapping these characters elsewhere in the unicode space as suggested
in http://lists.xml.org/archives/xml-dev/200006/msg00480.html.

I dislike both options.

If our djvuxml guru docbill wants to fix, fine.

Otherwise, so be it.
A good reason for not fixing is to maintain bug-for-bug compatibility
with the xml utilities of the commercial djvu products.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leon Bottou - 2009-01-23

After conferring with docbill:

- The bug is unlikely to happen in practice
as such control characters in the text layer or in
the annotations do not mean anything useful.

In the unlikely case where the situation occurs,
we have to choose between two evils:

1-if we do not fix the bug, some xml utilities
may choke on the djvuxml output

2-if we fix the bug, the xml utilities of the
commercial djvu package will choke on
the djvulibre djvuxml output

My intuition is that (2) is a greater risk for djvu users.
But my intuition is often wrong with xml.
For instance I do not know why people should ever
use such a complicated contraption when simpler
alternatives (s-expressions) have existed for a long time.

- L.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

roucaries bastien - 2009-01-23

Could we open a bug against commercial djvu package.

Could we sue namespace in order to avoid compatibily problem. In case of problem we skip the offender.

I strongly believe the 2. is the solution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dr Bill C Riemers - 2009-01-23

It just occurred to me, #x01-#x1f; is allowed with XML 1.1. So this could be fixed simply by changing the version of XML.

Bill

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

U - 2009-10-20

It is not true that this bug is unlikely to happen in practice. In fact, I found characters with codes 11 and 31 in real-world DjVu files, presumably with OCR generated by Lizardtech software. Moreover, the DjVu specification (page 19) explicitly mentions 11, 29, 30 and 31 as characters that may appear in the text layer.

The problem with XML 1.1 is that is (still) not widely supported.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dr Bill C Riemers - 2009-10-20

Sounds like the best solution is to add options to remap control characters as specified in http://lists.xml.org/archives/xml-dev/200006/msg00480.html or to output to XML 1.1. I'm still on partial disability, so I can not make this change anytime soon...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.