Re: [Htmlparser-user] extracting plain text in original encoding/charset

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Jan,

In general, a lot of care has been taken to ensure that the correct 
character set (according to the web page meta data) is being used.
The appearance of question marks may be just a function of the 
System.out.println() that it's doing.
Have you tried examining the errant characters in a debugger or writing 
the strings returned from the StringBean (used by the stringextractor 
command) to a PrintWriter with an encoding that can handle those characters?

Derrick

Jan wrote:

> Dear Members!
>  
> Is it possible using htmlparser to extract plain text in original 
> encoding/charset?
>  
> I tried the sample stringextractor.cmd.
> It worked nicely, but non-common characters are replaced with question 
> marks (?). I would like to keep the original byte sequence.
>  
> Thanks,
>  
> Jan