HTMLScanner uses String.toLowerCase() and String.toUpperCase() without specifying a Locale (Locale.ENGLISH or better Locale.ROOT), so in the Turkish default locale, the uppercasing and lowercasing of element names breaks. See also http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html about the problem.
This can be tested by setting the default using Locale.setDefault(new Locale("tr", "TR") and then trying to parse a HTML document. <title> gets reported to SAX as <TİTLE> when uppercasing element names (the default).
The correct way to fix this is described in my blog, with Java 1.6 pass Locale.ROOT to alle (throughout the codebase of NEKOHTML) to String toUpperCase or String.toLowerCase. In previous Java versions a workaround is unsing Locale.ENGLISH.
Log in to post a comment.