Hello all,
our company recently updated the htmlunit library and therefore also the nekohtml library. With this new version I've had problems with one particular web-page, where HtmlScanner ran into an endless loop trying to scan the page. Since it was only that one page and I couldn't really find any difference to the other pages, I spent quite a while debugging the HtmlScanner code. I finally came up with the simple solution of inserting an additional check in function read to see if fCurrentEntity.length is not 0.
/** Reads a single character. */
protected int read() throws IOException {
if (DEBUG_BUFFER) {
System.out.print("(read: ");
printBuffer();
System.out.println();
}
if (fCurrentEntity.offset == fCurrentEntity.length || fCurrentEntity.length == 0) {
if (load(0) == -1) {
if (DEBUG_BUFFER) {
System.out.println(")read: -> -1");
}
return -1;
}
}
char c = fCurrentEntity.buffer[fCurrentEntity.offset++];
fCurrentEntity.characterOffset++;
fCurrentEntity.columnNumber++;
if (DEBUG_BUFFER) {
System.out.print(")read: ");
printBuffer();
System.out.print(" -> ");
System.out.print(c);
System.out.println();
}
return c;
} // read():int
That seems to fix the problem for me without everything else going haywire.
I don't know if this list is the correct place to post this, but I thought in case somebody else had this problem they might find it useful. Also, if any developers check the post on this list I hope they will check my fix and see if it's correct or not.
I should maybe mention that the scanner tried to parse a comment at the end of the page and while doing so (in scanComment) it encountered eof, which then got somehow lost.
Hope this helps.
Best Regards,
Daniela Dusak