Re: [Htmlparser-user] extracting plain text in original encoding/charset
Brought to you by:
derrickoswald
From: Jan <jan...@gm...> - 2006-01-28 15:26:39
|
Dear Derrick, Really thank for the quick reply. I wrote the string into a file, and the file contains the question marks. I would like to have the original html text but without the html tags and html entities. Any conversion toward Unicode is undesired for my problem. (I would like to use the plain text for language/encoding identification). If htmlparser does not fit my problem, could you recommend something? Thank you! Jan On 1/28/06, Derrick Oswald <Der...@ro...> wrote: > > Jan, > > In general, a lot of care has been taken to ensure that the correct > character set (according to the web page meta data) is being used. > The appearance of question marks may be just a function of the > System.out.println() that it's doing. > Have you tried examining the errant characters in a debugger or writing > the strings returned from the StringBean (used by the stringextractor > command) to a PrintWriter with an encoding that can handle those > characters? > > Derrick > > Jan wrote: > > > Dear Members! > > > > Is it possible using htmlparser to extract plain text in original > > encoding/charset? > > > > I tried the sample stringextractor.cmd. > > It worked nicely, but non-common characters are replaced with question > > marks (?). I would like to keep the original byte sequence. > > > > Thanks, > > > > Jan > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D103432&bid=3D230486&dat= =3D121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |