Thread: [Htmlparser-developer] Working out the length of the HTML page being downloaded
Brought to you by:
derrickoswald
From: Sam J. <ga...@yh...> - 2002-12-13 07:32:35
|
Hi Somik, Once again apologies for barraging you with the questions this week, but I guess that's what open source is all about eh? When you wrote the NeuroGridHTMLParser for me a while back you added functionality to support getting the full plain text of a page. I've been busy modifying that and one of the things I'd really like to do is initialize a StringBuffer to an appropriate size, so that the buffer doesn't have to get resized while parsing the page. My first thought is that I would like to get access the HTTP headers that would tell me the content length of the incoming HTML page, and looking through the HTMLParser as is, it looks like I can't really access those headers directly. Is there some other mechanism to determine a document's length before parsing starts, or could we put one in? Naturally when reading from a file as opposed to a url one would call a different underlying method, but it would seem plausible to have a getDocumentSize() method. Or how about access to the underlying File or URLConnection objects? I'm just thinking out loud, ..., maybe giving the user the ability to pass in a URLConnection or File object would be the best, as then the user could get all the info they need. I guess some changes would be required to support this given that currently the HTMLParser opens connections using this private method: private HTMLReader openURLConnection() throws HTMLParserException { try { // Its a web address resourceLocn=HTMLLinkProcessor.removeEscapeCharacters(resourceLocn); resourceLocn=checkEnding(resourceLocn); resourceLocn=HTMLLinkProcessor.fixSpaces(resourceLocn); URL url = new URL(resourceLocn); URLConnection uc = url.openConnection(); return new HTMLReader(new BufferedReader(new InputStreamReader(uc.getInputStream(),"8859_4")),resourceLocn); } catch (Exception e) { String msg="HTMLParser.openURLConnection() : Error in opening a URL connection to "+resourceLocn; HTMLParserException ex = new HTMLParserException(msg,e); feedback.error(msg,ex); throw ex; } } CHEERS> SAM |
From: Derrick O. <Der...@ro...> - 2002-12-13 13:05:18
|
Sam, I've had some success in passing in an HTMLReader object I construct from the contents of a URL (from which you can get your own header info). But an outstanding issue ([ 649133 ] reader.reset crash in HTMLParser https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399) that explains there is an exception thrown at the end of parsing by a reset() on the reader object for pages longer than 5000 characters, means you have to perform a workaround like this until it gets fixed: reader = new HTMLReader (some_reader, some_url); parser = new HTMLParser (reader); // reset/remark to end of stream reader.reset (); reader.mark (real_number_of_characters_available_in_reader); // proceed with parse Derrick Sam Joseph wrote: >Hi Somik, > >Once again apologies for barraging you with the questions this week, but >I guess that's what open source is all about eh? > >When you wrote the NeuroGridHTMLParser for me a while back you added >functionality to support getting the full plain text of a page. I've >been busy modifying that and one of the things I'd really like to do is >initialize a StringBuffer to an appropriate size, so that the buffer >doesn't have to get resized while parsing the page. > >My first thought is that I would like to get access the HTTP headers >that would tell me the content length of the incoming HTML page, and >looking through the HTMLParser as is, it looks like I can't really >access those headers directly. > >Is there some other mechanism to determine a document's length before >parsing starts, or could we put one in? > >Naturally when reading from a file as opposed to a url one would call a >different underlying method, but it would seem plausible to have a >getDocumentSize() method. Or how about access to the underlying File or >URLConnection objects? > >I'm just thinking out loud, ..., maybe giving the user the ability to >pass in a URLConnection or File object would be the best, as then the >user could get all the info they need. I guess some changes would be >required to support this given that currently the HTMLParser opens >connections using this private method: > >private HTMLReader openURLConnection() throws HTMLParserException { >try { >// Its a web address >resourceLocn=HTMLLinkProcessor.removeEscapeCharacters(resourceLocn); >resourceLocn=checkEnding(resourceLocn); >resourceLocn=HTMLLinkProcessor.fixSpaces(resourceLocn); >URL url = new URL(resourceLocn); >URLConnection uc = url.openConnection(); >return new HTMLReader(new BufferedReader(new >InputStreamReader(uc.getInputStream(),"8859_4")),resourceLocn); >} >catch (Exception e) { >String msg="HTMLParser.openURLConnection() : Error in opening a URL >connection to "+resourceLocn; >HTMLParserException ex = new HTMLParserException(msg,e); >feedback.error(msg,ex); >throw ex; >} >} > >CHEERS> SAM > > > >------------------------------------------------------- >This sf.net email is sponsored by: >With Great Power, Comes Great Responsibility >Learn to use your power at OSDN's High Performance Computing Channel >http://hpc.devchannel.org/ >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |