[Htmlparser-developer] Working out the length of the HTML page being downloaded
Brought to you by:
derrickoswald
|
From: Sam J. <ga...@yh...> - 2002-12-13 07:32:35
|
Hi Somik,
Once again apologies for barraging you with the questions this week, but
I guess that's what open source is all about eh?
When you wrote the NeuroGridHTMLParser for me a while back you added
functionality to support getting the full plain text of a page. I've
been busy modifying that and one of the things I'd really like to do is
initialize a StringBuffer to an appropriate size, so that the buffer
doesn't have to get resized while parsing the page.
My first thought is that I would like to get access the HTTP headers
that would tell me the content length of the incoming HTML page, and
looking through the HTMLParser as is, it looks like I can't really
access those headers directly.
Is there some other mechanism to determine a document's length before
parsing starts, or could we put one in?
Naturally when reading from a file as opposed to a url one would call a
different underlying method, but it would seem plausible to have a
getDocumentSize() method. Or how about access to the underlying File or
URLConnection objects?
I'm just thinking out loud, ..., maybe giving the user the ability to
pass in a URLConnection or File object would be the best, as then the
user could get all the info they need. I guess some changes would be
required to support this given that currently the HTMLParser opens
connections using this private method:
private HTMLReader openURLConnection() throws HTMLParserException {
try {
// Its a web address
resourceLocn=HTMLLinkProcessor.removeEscapeCharacters(resourceLocn);
resourceLocn=checkEnding(resourceLocn);
resourceLocn=HTMLLinkProcessor.fixSpaces(resourceLocn);
URL url = new URL(resourceLocn);
URLConnection uc = url.openConnection();
return new HTMLReader(new BufferedReader(new
InputStreamReader(uc.getInputStream(),"8859_4")),resourceLocn);
}
catch (Exception e) {
String msg="HTMLParser.openURLConnection() : Error in opening a URL
connection to "+resourceLocn;
HTMLParserException ex = new HTMLParserException(msg,e);
feedback.error(msg,ex);
throw ex;
}
}
CHEERS> SAM
|