Re: [Htmlparser-user] html tag stripping and html entity conversion without Unicode-conversion

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Java uses unicode.  It stores characters in UTF-16 internally, i.e. char 
is 16 bits, String is an array of 16 bit values encoding Unicode in UTF-16.
Character entity conversion is a way for HTML documents to contain 
Unicode characters outside there current encoding and also to avoid the 
reserved characters HTML is based on, like left angle bracket - &lt; and 
ampersand &amp;.  These must be converted to Unicode to extract the 
semantic meaning of the page.
So your question is, "Is there a java program that uses something 
besides the String type to store Unicode when parsing HTML"?
I don't think so.

You might want to look at the Translate class in the util package to see 
if it does what you want.

Jan wrote:

> Dear Experts and Users,
>  
> Could anyone say for sure whether htmlparser is capable for html tag 
> stripping and html entity conversion, but without Unicode-conversion, 
> or not?
>  
> If not, what Java-tool could I use?
>  
> Thanks,
>  
> Jan