[Htmlparser-user] parsing raw downloaded content thats on file in arbitrary encodings

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi

I am thinking of using htmlparser for a project.
I have content of urls available  in file on disk
The file contains the headers, followed by the rest of the content as
received from the webserver (so its just a series of bytes).
I'll need something that can read and parse the headers, figure out
the encoding for the rest of the content and then parse the rest of
the content.

I have seen the javadocs and done some digging.
Here is what I think I need to do
Write my own code to read through headers to figure out encoding
Then call the following
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#create=
Parser(java.lang.String,%20java.lang.String)

The questions I have on this approach is -
1. The 'html' parameter is of type 'String', I'd think it would
automatically imply that strings content is already in java format
(utf-16 ?) . So what is the point of having the charset argument ?
I know utf-16 is a encoding and not charset, but I don't understand
the relevance of charset once something is in a 'java String' which
can only be unicode AFAIK.
It would have made sense to me if the html parameter was byte array or
some such thing.

2. I guess I could convert  to String myself from the byte buffer once
I have the code for encoding detection. But then what would I pass for
the charset. It makes no sense to me in Java to say I have some data
sitting in a 'java String' with charset iso-8859-1. I guess I am just
confused about the need for charset specification when something is
already in 'String'.

Thanks in advance for any ideas and help.

-Antony Sequeira