[Htmlparser-user] Daily bugs ... and one little fix:)
Brought to you by:
derrickoswald
From: R. <ced...@fr...> - 2002-07-17 10:37:51
|
When I parse this url: www.revues.org/calenda/articles/1083.html Parsing this file last more than 40 second so I've searched which problem may reduce performance. First, I begin to fix this problem with prevent it to appear. In HTMLReader.java: ------------------------------ protected boolean readNextLine() { boolean skipLine = true; if (posInLine!=-1 && !(line != null && node.elementEnd()+1>=line.length())) { for (int i = 0; i < line.length(); i++) { if (line.charAt(i) != ' ') { skipLine = false; break; } } } return skipLine; } Then I read sources around and I remark it will be a better idea to patch HTMLStringNode.java The solution is to go in state 1 when you are at the end of a space string. if (state==1) { text+=input.charAt(i); } //patch beginning here if (state==0 && i==input.length()-1) state=1; //patch ending here if (state==1 && i==input.length()-1) { input = reader.getNextLine(); ///..... I think the second solution is better. I hope this fix will help you Somik, to patch the code in the next integration release. Today, I've found another bug :) http://www.cybergeo.presse.fr/sommaire/sisterra/ind15.htm The last ">" is missing in the title mark out. <TITLE>SISTEMA TERRA, VOL. VI , No. 1-3, December 1997</TITLE => null pointer exception If I remember, you have already fix this problem with IMG mark out. Hope this patch will be the same. Regards, Cedric. |