Re: [Htmlparser-user] extracting plain text in original encoding/charset
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-01-28 14:42:27
|
Jan, In general, a lot of care has been taken to ensure that the correct character set (according to the web page meta data) is being used. The appearance of question marks may be just a function of the System.out.println() that it's doing. Have you tried examining the errant characters in a debugger or writing the strings returned from the StringBean (used by the stringextractor command) to a PrintWriter with an encoding that can handle those characters? Derrick Jan wrote: > Dear Members! > > Is it possible using htmlparser to extract plain text in original > encoding/charset? > > I tried the sample stringextractor.cmd. > It worked nicely, but non-common characters are replaced with question > marks (?). I would like to keep the original byte sequence. > > Thanks, > > Jan |