From http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=512659
It is possible that djvutoxml for a perfectly valid DjVu file)
an XML file which is not well-formed:
$ printf 'P1 3 3 0 0 0 0 0 0 0 0 0' > dummy.pbm
$ cjb2 dummy.pbm dummy.djvu
$ printf 'select 1\nset-txt\n(page 0 0 1 1 (line 0 0 1 1 "foo\\fbar"))\n.\n' > sedscript
$ cat sedscript
select 1
set-txt
(page 0 0 1 1 (line 0 0 1 1 "foo\fbar"))
.
$ djvused -s -f sedscript dummy.djvu
$ djvutoxml dummy.djvu > dummy.xml
$ xmllint --noout dummy.xml
dummy.xml:14: parser error : xmlParseCharRef: invalid xmlChar value 12
foobar </LINE>
Correct. The XML-1.0 spec says that most control
characters are disallowed, even using the &#xxx; notation.
Remedies include defining a new tag <char code="xxx/">
or remapping these characters elsewhere in the unicode space as suggested
in http://lists.xml.org/archives/xml-dev/200006/msg00480.html.
I dislike both options.
If our djvuxml guru docbill wants to fix, fine.
Otherwise, so be it.
A good reason for not fixing is to maintain bug-for-bug compatibility
with the xml utilities of the commercial djvu products.
After conferring with docbill:
- The bug is unlikely to happen in practice
as such control characters in the text layer or in
the annotations do not mean anything useful.
In the unlikely case where the situation occurs,
we have to choose between two evils:
1-if we do not fix the bug, some xml utilities
may choke on the djvuxml output
2-if we fix the bug, the xml utilities of the
commercial djvu package will choke on
the djvulibre djvuxml output
My intuition is that (2) is a greater risk for djvu users.
But my intuition is often wrong with xml.
For instance I do not know why people should ever
use such a complicated contraption when simpler
alternatives (s-expressions) have existed for a long time.
- L.
Could we open a bug against commercial djvu package.
Could we sue namespace in order to avoid compatibily problem. In case of problem we skip the offender.
I strongly believe the 2. is the solution.
It just occurred to me, #x01-#x1f; is allowed with XML 1.1. So this could be fixed simply by changing the version of XML.
Bill
It is not true that this bug is unlikely to happen in practice. In fact, I found characters with codes 11 and 31 in real-world DjVu files, presumably with OCR generated by Lizardtech software. Moreover, the DjVu specification (page 19) explicitly mentions 11, 29, 30 and 31 as characters that may appear in the text layer.
The problem with XML 1.1 is that is (still) not widely supported.
Sounds like the best solution is to add options to remap control characters as specified in http://lists.xml.org/archives/xml-dev/200006/msg00480.html or to output to XML 1.1. I'm still on partial disability, so I can not make this change anytime soon...