Thread: [Htmlparser-user] html tag stripping and html entity conversion without Unicode-conversion

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] html tag stripping and html entity conversion without Unicode-conversion

From: Jan <jan...@gm...> - 2006-02-03 05:48:12

Dear Experts and Users,

Could anyone say for sure whether htmlparser is capable for html tag
stripping and html entity conversion, but without Unicode-conversion, or
not?

If not, what Java-tool could I use?

 Thanks,

Jan

Re: [Htmlparser-user] html tag stripping and html entity conversion without Unicode-conversion

From: Derrick O. <Der...@Ro...> - 2006-02-03 13:13:01

Java uses unicode.  It stores characters in UTF-16 internally, i.e. char 
is 16 bits, String is an array of 16 bit values encoding Unicode in UTF-16.
Character entity conversion is a way for HTML documents to contain 
Unicode characters outside there current encoding and also to avoid the 
reserved characters HTML is based on, like left angle bracket - &lt; and 
ampersand &amp;.  These must be converted to Unicode to extract the 
semantic meaning of the page.
So your question is, "Is there a java program that uses something 
besides the String type to store Unicode when parsing HTML"?
I don't think so.

You might want to look at the Translate class in the util package to see 
if it does what you want.

Jan wrote:

> Dear Experts and Users,
>  
> Could anyone say for sure whether htmlparser is capable for html tag 
> stripping and html entity conversion, but without Unicode-conversion, 
> or not?
>  
> If not, what Java-tool could I use?
>  
> Thanks,
>  
> Jan