Hi
when running jtidy on a document having unicode characters with numbers, it gets upset and produces a wrong output. The problem is the unicode character before the number 5 in the input document. (see attachment)
The location is here:
<p class="Grundschrift-nicht-registerhaltig para-style-override-2">–5</p>
As you can see, it produces a swapped output of the number, right after the unicode hyphen
<p class="Grundschrift-nicht-registerhaltig para-style-override-2">–<
5</p>
Interestingly, this does not happen if for instance the class is not in the p tag.
This seems to be an old problem. I found a posting more than 5 years ago with a similar problem.
Any help is appreciated.
This is the input file
This is the wrong output file
And the version is the latest one, HTML Tidy for Java (vers. 2009-12-01)
Same problem also with hyphens in the text. tidy introduces a unicode 0 after the hyphen, but only on certain places. Could be the case when a new block is read, and the buffer is misaligned.
Cause: An invalid XML character (Unicode: 0x0) was found in the element content of the document.