Thread: [Htmlparser-developer] Working out the length of the HTML page being downloaded

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Somik,

Once again apologies for barraging you with the questions this week, but
I guess that's what open source is all about eh?

When you wrote the NeuroGridHTMLParser for me a while back you added
functionality to support getting the full plain text of a page. I've
been busy modifying that and one of the things I'd really like to do is
initialize a StringBuffer to an appropriate size, so that the buffer
doesn't have to get resized while parsing the page.

My first thought is that I would like to get access the HTTP headers
that would tell me the content length of the incoming HTML page, and
looking through the HTMLParser as is, it looks like I can't really
access those headers directly.

Is there some other mechanism to determine a document's length before
parsing starts, or could we put one in?

Naturally when reading from a file as opposed to a url one would call a
different underlying method, but it would seem plausible to have a
getDocumentSize() method. Or how about access to the underlying File or
URLConnection objects?

I'm just thinking out loud, ..., maybe giving the user the ability to
pass in a URLConnection or File object would be the best, as then the
user could get all the info they need. I guess some changes would be
required to support this given that currently the HTMLParser opens
connections using this private method:

private HTMLReader openURLConnection() throws HTMLParserException {
try {
// Its a web address
resourceLocn=HTMLLinkProcessor.removeEscapeCharacters(resourceLocn);
resourceLocn=checkEnding(resourceLocn);
resourceLocn=HTMLLinkProcessor.fixSpaces(resourceLocn);
URL url = new URL(resourceLocn);
URLConnection uc = url.openConnection();
return new HTMLReader(new BufferedReader(new
InputStreamReader(uc.getInputStream(),"8859_4")),resourceLocn);
}
catch (Exception e) {
String msg="HTMLParser.openURLConnection() : Error in opening a URL
connection to "+resourceLocn;
HTMLParserException ex = new HTMLParserException(msg,e);
feedback.error(msg,ex);
throw ex;
}
}

CHEERS> SAM

Thread: [Htmlparser-developer] Working out the length of the HTML page being downloaded

htmlparser-developer