Menu

#260 Unicode characters lets jtidy mangle characters

open
nobody
None
1
2013-07-31
2013-06-25
No

Hi

when running jtidy on a document having unicode characters with numbers, it gets upset and produces a wrong output. The problem is the unicode character before the number 5 in the input document. (see attachment)

The location is here:

<p class="Grundschrift-nicht-registerhaltig para-style-override-2">–5</p>

As you can see, it produces a swapped output of the number, right after the unicode hyphen

<p class="Grundschrift-nicht-registerhaltig para-style-override-2">–<
5</p>

Interestingly, this does not happen if for instance the class is not in the p tag.
This seems to be an old problem. I found a posting more than 5 years ago with a similar problem.

Any help is appreciated.

Discussion

  • Daniel Stainhauser

    This is the input file

     
  • Daniel Stainhauser

    This is the wrong output file

     
  • Daniel Stainhauser

    And the version is the latest one, HTML Tidy for Java (vers. 2009-12-01)

     
  • Daniel Stainhauser

    Same problem also with hyphens in the text. tidy introduces a unicode 0 after the hyphen, but only on certain places. Could be the case when a new block is read, and the buffer is misaligned.
    Cause: An invalid XML character (Unicode: 0x0) was found in the element content of the document.

     

Log in to post a comment.