Java uses unicode. It stores characters in UTF-16 internally, i.e. char
is 16 bits, String is an array of 16 bit values encoding Unicode in UTF-16.
Character entity conversion is a way for HTML documents to contain
Unicode characters outside there current encoding and also to avoid the
reserved characters HTML is based on, like left angle bracket - < and
ampersand &. These must be converted to Unicode to extract the
semantic meaning of the page.
So your question is, "Is there a java program that uses something
besides the String type to store Unicode when parsing HTML"?
I don't think so.
You might want to look at the Translate class in the util package to see
if it does what you want.
Jan wrote:
> Dear Experts and Users,
>
> Could anyone say for sure whether htmlparser is capable for html tag
> stripping and html entity conversion, but without Unicode-conversion,
> or not?
>
> If not, what Java-tool could I use?
>
> Thanks,
>
> Jan
|