Re: [Htmlparser-user] html tag stripping and html entity conversion without Unicode-conversion
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-02-03 13:13:01
|
Java uses unicode. It stores characters in UTF-16 internally, i.e. char is 16 bits, String is an array of 16 bit values encoding Unicode in UTF-16. Character entity conversion is a way for HTML documents to contain Unicode characters outside there current encoding and also to avoid the reserved characters HTML is based on, like left angle bracket - < and ampersand &. These must be converted to Unicode to extract the semantic meaning of the page. So your question is, "Is there a java program that uses something besides the String type to store Unicode when parsing HTML"? I don't think so. You might want to look at the Translate class in the util package to see if it does what you want. Jan wrote: > Dear Experts and Users, > > Could anyone say for sure whether htmlparser is capable for html tag > stripping and html entity conversion, but without Unicode-conversion, > or not? > > If not, what Java-tool could I use? > > Thanks, > > Jan |