[Htmlparser-developer] Working out the length of the HTML page being downloaded
Brought to you by:
derrickoswald
From: Sam J. <ga...@yh...> - 2002-12-13 07:32:35
|
Hi Somik, Once again apologies for barraging you with the questions this week, but I guess that's what open source is all about eh? When you wrote the NeuroGridHTMLParser for me a while back you added functionality to support getting the full plain text of a page. I've been busy modifying that and one of the things I'd really like to do is initialize a StringBuffer to an appropriate size, so that the buffer doesn't have to get resized while parsing the page. My first thought is that I would like to get access the HTTP headers that would tell me the content length of the incoming HTML page, and looking through the HTMLParser as is, it looks like I can't really access those headers directly. Is there some other mechanism to determine a document's length before parsing starts, or could we put one in? Naturally when reading from a file as opposed to a url one would call a different underlying method, but it would seem plausible to have a getDocumentSize() method. Or how about access to the underlying File or URLConnection objects? I'm just thinking out loud, ..., maybe giving the user the ability to pass in a URLConnection or File object would be the best, as then the user could get all the info they need. I guess some changes would be required to support this given that currently the HTMLParser opens connections using this private method: private HTMLReader openURLConnection() throws HTMLParserException { try { // Its a web address resourceLocn=HTMLLinkProcessor.removeEscapeCharacters(resourceLocn); resourceLocn=checkEnding(resourceLocn); resourceLocn=HTMLLinkProcessor.fixSpaces(resourceLocn); URL url = new URL(resourceLocn); URLConnection uc = url.openConnection(); return new HTMLReader(new BufferedReader(new InputStreamReader(uc.getInputStream(),"8859_4")),resourceLocn); } catch (Exception e) { String msg="HTMLParser.openURLConnection() : Error in opening a URL connection to "+resourceLocn; HTMLParserException ex = new HTMLParserException(msg,e); feedback.error(msg,ex); throw ex; } } CHEERS> SAM |