Thread: [Htmlparser-user] parsing raw downloaded content thats on file in arbitrary encodings
Brought to you by:
derrickoswald
From: Antony S. <ant...@gm...> - 2006-03-04 01:51:48
|
Hi I am thinking of using htmlparser for a project. I have content of urls available in file on disk The file contains the headers, followed by the rest of the content as received from the webserver (so its just a series of bytes). I'll need something that can read and parse the headers, figure out the encoding for the rest of the content and then parse the rest of the content. I have seen the javadocs and done some digging. Here is what I think I need to do Write my own code to read through headers to figure out encoding Then call the following http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#create= Parser(java.lang.String,%20java.lang.String) The questions I have on this approach is - 1. The 'html' parameter is of type 'String', I'd think it would automatically imply that strings content is already in java format (utf-16 ?) . So what is the point of having the charset argument ? I know utf-16 is a encoding and not charset, but I don't understand the relevance of charset once something is in a 'java String' which can only be unicode AFAIK. It would have made sense to me if the html parameter was byte array or some such thing. 2. I guess I could convert to String myself from the byte buffer once I have the code for encoding detection. But then what would I pass for the charset. It makes no sense to me in Java to say I have some data sitting in a 'java String' with charset iso-8859-1. I guess I am just confused about the need for charset specification when something is already in 'String'. Thanks in advance for any ideas and help. -Antony Sequeira |
From: <lui...@gm...> - 2006-03-04 04:13:54
|
Hi, The charset parameter of the constructor Parser(String, String) will =20 be returned when you call getEncoding(). No other effect beside this, =20= I believe. To read text from an InputStream (accessing a file, socket, etc) a =20 Reader should be used. To create a Reader, an explicit charset should be given (letting the =20 Reader use the system's default is asking for problems...) Because the creation of the Reader precedes the reading, the text =20 encoding must be known prior to reading it. This is why the HTTP =20 "Content-Type/charset-encoding" header is useful. However, this =20 header is not always correct (consider it a hint), and sometimes is =20 not even available (!) and we should consult an oracle then... If the charset used is not the proper charset, then the String can be =20= FIXED converting it into bytes (with the same charset used for =20 decoding) and then back to a String using the correct charset. How to tell if THE correct charset was used? Well, for now you can look for an http-equiv meta tag that specifies =20 the charset. If you find such a tag and the charset is the same =20 you've used before then you may trust in you conversion. Otherwise you should choose to believe one of them (the HTTP header =20 or the HTTP-EQUIV tag) and discard the other. Otherwise, When can someone detect THE correct charset? The short =20 answer: it's not easy and not always possible. I hope this helps you Antony. By the way, I too have a related question for the developers: I want to decouple the HTMLParser from the URLConnection where the =20 network IO is done. I still want the parser to resolve links against the original URL of =20 the page and to use the HTTP headers to parse the data (gunzipping =20 data and charset decoding). I think that the available constructors for Parser don't allow this =20 decoupling in a straightforward fashion and without loosing some of =20 these features. My current solution is to extend URLConnection and then use that =20 object to feed the parser. A, perhaps cleaner, solution would be to have a constructor taking =20 three args: URL (for link resolving) InputStream for the data HTTP headers The HTTP headers could be as returned from URLConnection. =20 getHeaderFields() for interoperability: public Map<String,List<String>> getHeaderFields(); Returns an unmodifiable Map of the header fields. The Map keys are =20 Strings that represent the response-header field names. Each Map =20 value is an unmodifiable List of Strings that represents the =20 corresponding field values. The signature of the constructor I'm proposing is: public Parser(String url, InputStream input, Map<String,List<String>> =20= httpHeaders); I will proceed with extending URLConnection and feeding it into the =20 Parser with the setter setConnection() (I reuse the Parser to parse =20 several documents) while no better solution is in my knowledge. Best Regards Lu=EDs Gomes On Mar 4, 2006, at 1:51 AM, Antony Sequeira wrote: > Hi > > I am thinking of using htmlparser for a project. > I have content of urls available in file on disk > The file contains the headers, followed by the rest of the content as > received from the webserver (so its just a series of bytes). > I'll need something that can read and parse the headers, figure out > the encoding for the rest of the content and then parse the rest of > the content. > > I have seen the javadocs and done some digging. > Here is what I think I need to do > Write my own code to read through headers to figure out encoding > Then call the following > http://htmlparser.sourceforge.net/javadoc/org/htmlparser/=20 > Parser.html#createParser(java.lang.String,%20java.lang.String) > > The questions I have on this approach is - > 1. The 'html' parameter is of type 'String', I'd think it would > automatically imply that strings content is already in java format > (utf-16 ?) . So what is the point of having the charset argument ? > I know utf-16 is a encoding and not charset, but I don't understand > the relevance of charset once something is in a 'java String' which > can only be unicode AFAIK. > It would have made sense to me if the html parameter was byte array or > some such thing. > > 2. I guess I could convert to String myself from the byte buffer once > I have the code for encoding detection. But then what would I pass for > the charset. It makes no sense to me in Java to say I have some data > sitting in a 'java String' with charset iso-8859-1. I guess I am just > confused about the need for charset specification when something is > already in 'String'. > > Thanks in advance for any ideas and help. > > -Antony Sequeira > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting =20 > language > that extends applications into web and mobile media. Attend the =20 > live webcast > and join the prime developer group breaking into this new coding =20 > territory! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=110944&bid$1720&dat=121642= > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-03-04 12:12:32
|
Luís, I believe what you want to do is possible with the current API. Page page = new Page (new InputStreamSource (input, charset)); page.setUrl (url); Parser parser = new Parser (new Lexer (page)); You would use the HTTP headers to figure out if it's gzipped (and use a GZIPInputStream) and determine the charset yourself. Derrick Luís Manuel dos Santos Gomes wrote: > Hi, > <snip> > > > > By the way, I too have a related question for the developers: > > I want to decouple the HTMLParser from the URLConnection where the > network IO is done. > I still want the parser to resolve links against the original URL of > the page and to use the HTTP headers to parse the data (gunzipping > data and charset decoding). > > I think that the available constructors for Parser don't allow this > decoupling in a straightforward fashion and without loosing some of > these features. > > My current solution is to extend URLConnection and then use that > object to feed the parser. > > A, perhaps cleaner, solution would be to have a constructor taking > three args: > URL (for link resolving) > InputStream for the data > HTTP headers > > The HTTP headers could be as returned from URLConnection. > getHeaderFields() for interoperability: > public Map<String,List<String>> getHeaderFields(); > Returns an unmodifiable Map of the header fields. The Map keys are > Strings that represent the response-header field names. Each Map > value is an unmodifiable List of Strings that represents the > corresponding field values. > > The signature of the constructor I'm proposing is: > public Parser(String url, InputStream input, Map<String,List<String>> > httpHeaders); > > I will proceed with extending URLConnection and feeding it into the > Parser with the setter setConnection() (I reuse the Parser to parse > several documents) > while no better solution is in my knowledge. > > > Best Regards > > Luís Gomes > > > On Mar 4, 2006, at 1:51 AM, Antony Sequeira wrote: > >> Hi >> >> I am thinking of using htmlparser for a project. >> I have content of urls available in file on disk >> The file contains the headers, followed by the rest of the content as >> received from the webserver (so its just a series of bytes). >> I'll need something that can read and parse the headers, figure out >> the encoding for the rest of the content and then parse the rest of >> the content. >> >> I have seen the javadocs and done some digging. >> Here is what I think I need to do >> Write my own code to read through headers to figure out encoding >> Then call the following >> http://htmlparser.sourceforge.net/javadoc/org/htmlparser/ >> Parser.html#createParser(java.lang.String,%20java.lang.String) >> >> The questions I have on this approach is - >> 1. The 'html' parameter is of type 'String', I'd think it would >> automatically imply that strings content is already in java format >> (utf-16 ?) . So what is the point of having the charset argument ? >> I know utf-16 is a encoding and not charset, but I don't understand >> the relevance of charset once something is in a 'java String' which >> can only be unicode AFAIK. >> It would have made sense to me if the html parameter was byte array or >> some such thing. >> >> 2. I guess I could convert to String myself from the byte buffer once >> I have the code for encoding detection. But then what would I pass for >> the charset. It makes no sense to me in Java to say I have some data >> sitting in a 'java String' with charset iso-8859-1. I guess I am just >> confused about the need for charset specification when something is >> already in 'String'. >> >> Thanks in advance for any ideas and help. >> >> -Antony Sequeira >> >> >> ------------------------------------------------------- >> This SF.Net email is sponsored by xPML, a groundbreaking scripting >> language >> that extends applications into web and mobile media. Attend the live >> webcast >> and join the prime developer group breaking into this new coding >> territory! >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting > language > that extends applications into web and mobile media. Attend the live > webcast > and join the prime developer group breaking into this new coding > territory! > http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Antony S. <ant...@gm...> - 2006-03-07 04:08:14
Attachments:
ByteBufferURL.java
|
Thank you. I will use your suggested approach if my current approach does not work out= . Currently I have come up with a means of providing a URLConnection backed by a byte array (instead of a TCP connection) and using that connection to construct the parser object. I have attached the code file. It is ugly and very specific to my current experimentation. I use it like URL urlob =3D ByteBufferURL.fromByteArray(new URL("http://original url string so relative links get resolved right"),byetarray,bytecontentlenght); Parser parser =3D new Parser(urlob.openConnection()); This does not result in any network activity of resolving/connecting etc (at least in my limited testing) as desired. The advantage IMO is it keeps the rest of the code simple (hopefully). Responding since this may be useful to Lu=EDs Gomes. I have other unrelated questions that I'll ask in a separate thread Thanks for the pointers. -Antony On 3/4/06, Derrick Oswald <Der...@ro...> wrote: > Lu=EDs, > > I believe what you want to do is possible with the current API. > > Page page =3D new Page (new InputStreamSource (input, charset)); > page.setUrl (url); > Parser parser =3D new Parser (new Lexer (page)); > > You would use the HTTP headers to figure out if it's gzipped (and use a > GZIPInputStream) and determine the charset yourself. > > Derrick |