Re: [Htmlparser-user] parsing raw downloaded content thats on file in arbitrary encodings
Brought to you by:
derrickoswald
|
From: <lui...@gm...> - 2006-03-04 04:13:54
|
Hi, The charset parameter of the constructor Parser(String, String) will =20 be returned when you call getEncoding(). No other effect beside this, =20= I believe. To read text from an InputStream (accessing a file, socket, etc) a =20 Reader should be used. To create a Reader, an explicit charset should be given (letting the =20 Reader use the system's default is asking for problems...) Because the creation of the Reader precedes the reading, the text =20 encoding must be known prior to reading it. This is why the HTTP =20 "Content-Type/charset-encoding" header is useful. However, this =20 header is not always correct (consider it a hint), and sometimes is =20 not even available (!) and we should consult an oracle then... If the charset used is not the proper charset, then the String can be =20= FIXED converting it into bytes (with the same charset used for =20 decoding) and then back to a String using the correct charset. How to tell if THE correct charset was used? Well, for now you can look for an http-equiv meta tag that specifies =20 the charset. If you find such a tag and the charset is the same =20 you've used before then you may trust in you conversion. Otherwise you should choose to believe one of them (the HTTP header =20 or the HTTP-EQUIV tag) and discard the other. Otherwise, When can someone detect THE correct charset? The short =20 answer: it's not easy and not always possible. I hope this helps you Antony. By the way, I too have a related question for the developers: I want to decouple the HTMLParser from the URLConnection where the =20 network IO is done. I still want the parser to resolve links against the original URL of =20 the page and to use the HTTP headers to parse the data (gunzipping =20 data and charset decoding). I think that the available constructors for Parser don't allow this =20 decoupling in a straightforward fashion and without loosing some of =20 these features. My current solution is to extend URLConnection and then use that =20 object to feed the parser. A, perhaps cleaner, solution would be to have a constructor taking =20 three args: URL (for link resolving) InputStream for the data HTTP headers The HTTP headers could be as returned from URLConnection. =20 getHeaderFields() for interoperability: public Map<String,List<String>> getHeaderFields(); Returns an unmodifiable Map of the header fields. The Map keys are =20 Strings that represent the response-header field names. Each Map =20 value is an unmodifiable List of Strings that represents the =20 corresponding field values. The signature of the constructor I'm proposing is: public Parser(String url, InputStream input, Map<String,List<String>> =20= httpHeaders); I will proceed with extending URLConnection and feeding it into the =20 Parser with the setter setConnection() (I reuse the Parser to parse =20 several documents) while no better solution is in my knowledge. Best Regards Lu=EDs Gomes On Mar 4, 2006, at 1:51 AM, Antony Sequeira wrote: > Hi > > I am thinking of using htmlparser for a project. > I have content of urls available in file on disk > The file contains the headers, followed by the rest of the content as > received from the webserver (so its just a series of bytes). > I'll need something that can read and parse the headers, figure out > the encoding for the rest of the content and then parse the rest of > the content. > > I have seen the javadocs and done some digging. > Here is what I think I need to do > Write my own code to read through headers to figure out encoding > Then call the following > http://htmlparser.sourceforge.net/javadoc/org/htmlparser/=20 > Parser.html#createParser(java.lang.String,%20java.lang.String) > > The questions I have on this approach is - > 1. The 'html' parameter is of type 'String', I'd think it would > automatically imply that strings content is already in java format > (utf-16 ?) . So what is the point of having the charset argument ? > I know utf-16 is a encoding and not charset, but I don't understand > the relevance of charset once something is in a 'java String' which > can only be unicode AFAIK. > It would have made sense to me if the html parameter was byte array or > some such thing. > > 2. I guess I could convert to String myself from the byte buffer once > I have the code for encoding detection. But then what would I pass for > the charset. It makes no sense to me in Java to say I have some data > sitting in a 'java String' with charset iso-8859-1. I guess I am just > confused about the need for charset specification when something is > already in 'String'. > > Thanks in advance for any ideas and help. > > -Antony Sequeira > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting =20 > language > that extends applications into web and mobile media. Attend the =20 > live webcast > and join the prime developer group breaking into this new coding =20 > territory! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=110944&bid$1720&dat=121642= > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |