[Htmlparser-user] Non-charset=utf-8 Characters in Parsed Text?
Brought to you by:
derrickoswald
From: Jeffery B. <jef...@gm...> - 2007-12-11 23:46:38
|
I'm running into an issue where I'm getting question mark characters in place of quotes, apostrophes, hyphens, etc. I know this has to do with the website using characters outside those defined by the specification. Is there a way to correct this in the htmlparser? I started trying to do a simple character replacement on the parsed text, but whenever I do an "(int) string.charAt(n)" for any special character I'm getting a 65533, and if I do a "Character.getNumericValue( string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far "downstream" to fix the problem. Also I've just been using the Parser.parse method to return nodelists and have been working my way through the documents that way rather than try any of the other htmlparser features (which may already account for this??). Thanks in advance for any help. I'm really enjoying working with the parser and thanks to everyone who built this thing. Jeff |