[Htmlparser-user] parsing raw downloaded content thats on file in arbitrary encodings
Brought to you by:
derrickoswald
From: Antony S. <ant...@gm...> - 2006-03-04 01:51:48
|
Hi I am thinking of using htmlparser for a project. I have content of urls available in file on disk The file contains the headers, followed by the rest of the content as received from the webserver (so its just a series of bytes). I'll need something that can read and parse the headers, figure out the encoding for the rest of the content and then parse the rest of the content. I have seen the javadocs and done some digging. Here is what I think I need to do Write my own code to read through headers to figure out encoding Then call the following http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#create= Parser(java.lang.String,%20java.lang.String) The questions I have on this approach is - 1. The 'html' parameter is of type 'String', I'd think it would automatically imply that strings content is already in java format (utf-16 ?) . So what is the point of having the charset argument ? I know utf-16 is a encoding and not charset, but I don't understand the relevance of charset once something is in a 'java String' which can only be unicode AFAIK. It would have made sense to me if the html parameter was byte array or some such thing. 2. I guess I could convert to String myself from the byte buffer once I have the code for encoding detection. But then what would I pass for the charset. It makes no sense to me in Java to say I have some data sitting in a 'java String' with charset iso-8859-1. I guess I am just confused about the need for charset specification when something is already in 'String'. Thanks in advance for any ideas and help. -Antony Sequeira |