[Htmlparser-user] decouple parser from URLConnection
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-03-04 12:12:32
|
Luís, I believe what you want to do is possible with the current API. Page page = new Page (new InputStreamSource (input, charset)); page.setUrl (url); Parser parser = new Parser (new Lexer (page)); You would use the HTTP headers to figure out if it's gzipped (and use a GZIPInputStream) and determine the charset yourself. Derrick Luís Manuel dos Santos Gomes wrote: > Hi, > <snip> > > > > By the way, I too have a related question for the developers: > > I want to decouple the HTMLParser from the URLConnection where the > network IO is done. > I still want the parser to resolve links against the original URL of > the page and to use the HTTP headers to parse the data (gunzipping > data and charset decoding). > > I think that the available constructors for Parser don't allow this > decoupling in a straightforward fashion and without loosing some of > these features. > > My current solution is to extend URLConnection and then use that > object to feed the parser. > > A, perhaps cleaner, solution would be to have a constructor taking > three args: > URL (for link resolving) > InputStream for the data > HTTP headers > > The HTTP headers could be as returned from URLConnection. > getHeaderFields() for interoperability: > public Map<String,List<String>> getHeaderFields(); > Returns an unmodifiable Map of the header fields. The Map keys are > Strings that represent the response-header field names. Each Map > value is an unmodifiable List of Strings that represents the > corresponding field values. > > The signature of the constructor I'm proposing is: > public Parser(String url, InputStream input, Map<String,List<String>> > httpHeaders); > > I will proceed with extending URLConnection and feeding it into the > Parser with the setter setConnection() (I reuse the Parser to parse > several documents) > while no better solution is in my knowledge. > > > Best Regards > > Luís Gomes > > > On Mar 4, 2006, at 1:51 AM, Antony Sequeira wrote: > >> Hi >> >> I am thinking of using htmlparser for a project. >> I have content of urls available in file on disk >> The file contains the headers, followed by the rest of the content as >> received from the webserver (so its just a series of bytes). >> I'll need something that can read and parse the headers, figure out >> the encoding for the rest of the content and then parse the rest of >> the content. >> >> I have seen the javadocs and done some digging. >> Here is what I think I need to do >> Write my own code to read through headers to figure out encoding >> Then call the following >> http://htmlparser.sourceforge.net/javadoc/org/htmlparser/ >> Parser.html#createParser(java.lang.String,%20java.lang.String) >> >> The questions I have on this approach is - >> 1. The 'html' parameter is of type 'String', I'd think it would >> automatically imply that strings content is already in java format >> (utf-16 ?) . So what is the point of having the charset argument ? >> I know utf-16 is a encoding and not charset, but I don't understand >> the relevance of charset once something is in a 'java String' which >> can only be unicode AFAIK. >> It would have made sense to me if the html parameter was byte array or >> some such thing. >> >> 2. I guess I could convert to String myself from the byte buffer once >> I have the code for encoding detection. But then what would I pass for >> the charset. It makes no sense to me in Java to say I have some data >> sitting in a 'java String' with charset iso-8859-1. I guess I am just >> confused about the need for charset specification when something is >> already in 'String'. >> >> Thanks in advance for any ideas and help. >> >> -Antony Sequeira >> >> >> ------------------------------------------------------- >> This SF.Net email is sponsored by xPML, a groundbreaking scripting >> language >> that extends applications into web and mobile media. Attend the live >> webcast >> and join the prime developer group breaking into this new coding >> territory! >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting > language > that extends applications into web and mobile media. Attend the live > webcast > and join the prime developer group breaking into this new coding > territory! > http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |