Endless loop while scanning html page

Brought to you by: andyc2, mguillem

#2 Endless loop while scanning html page

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2009-08-24

Created: 2009-08-24

Creator: Daniela Dusak

Private: No

Hello all,

our company recently updated the htmlunit library and therefore also the nekohtml library. With this new version I've had problems with one particular web-page, where HtmlScanner ran into an endless loop trying to scan the page. Since it was only that one page and I couldn't really find any difference to the other pages, I spent quite a while debugging the HtmlScanner code. I finally came up with the simple solution of inserting an additional check in function read to see if fCurrentEntity.length is not 0.

/** Reads a single character. */
protected int read() throws IOException {
if (DEBUG_BUFFER) {
System.out.print("(read: ");
printBuffer();
System.out.println();
}

if (fCurrentEntity.offset == fCurrentEntity.length || fCurrentEntity.length == 0) {
if (load(0) == -1) {
if (DEBUG_BUFFER) {
System.out.println(")read: -> -1");
}
return -1;
}
}
char c = fCurrentEntity.buffer[fCurrentEntity.offset++];
fCurrentEntity.characterOffset++;
fCurrentEntity.columnNumber++;
if (DEBUG_BUFFER) {
System.out.print(")read: ");
printBuffer();
System.out.print(" -> ");
System.out.print(c);
System.out.println();
}
return c;
} // read():int

That seems to fix the problem for me without everything else going haywire.
I don't know if this list is the correct place to post this, but I thought in case somebody else had this problem they might find it useful. Also, if any developers check the post on this list I hope they will check my fix and see if it's correct or not.
I should maybe mention that the scanner tried to parse a comment at the end of the page and while doing so (in scanComment) it encountered eof, which then got somehow lost.

Hope this helps.

Best Regards,

Daniela Dusak

Endless loop while scanning html page

Group

Searches

Help

#2 Endless loop while scanning html page

Discussion