Re: [Htmlparser-user] extracting plain text in original encoding/charset

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Derrick,

Really thank for the quick reply.

I wrote the string into a file, and the file contains the question marks.

I would like to have the original html text but without the html tags and
html entities.
Any conversion toward Unicode is undesired for my problem.
(I would like to use the plain text for language/encoding identification).

If htmlparser does not fit my problem, could you recommend something?

Thank you!
Jan

On 1/28/06, Derrick Oswald <Der...@ro...> wrote:
>
> Jan,
>
> In general, a lot of care has been taken to ensure that the correct
> character set (according to the web page meta data) is being used.
> The appearance of question marks may be just a function of the
> System.out.println() that it's doing.
> Have you tried examining the errant characters in a debugger or writing
> the strings returned from the StringBean (used by the stringextractor
> command) to a PrintWriter with an encoding that can handle those
> characters?
>
> Derrick
>
> Jan wrote:
>
> > Dear Members!
> >
> > Is it possible using htmlparser to extract plain text in original
> > encoding/charset?
> >
> > I tried the sample stringextractor.cmd.
> > It worked nicely, but non-common characters are replaced with question
> > marks (?). I would like to keep the original byte sequence.
> >
> > Thanks,
> >
> > Jan
>
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D103432&bid=3D230486&dat=
=3D121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user