[Htmlparser-user] Daily bugs ... and one little fix:)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

When I parse this url:
www.revues.org/calenda/articles/1083.html
Parsing this file last more than 40 second so I've searched which problem 
may reduce performance.

First, I begin to fix this problem with prevent it to appear.

In HTMLReader.java:
------------------------------
protected boolean readNextLine()
{
   boolean skipLine = true;
   if (posInLine!=-1 && !(line != null && node.elementEnd()+1>=line.length()))
   {
     for (int i = 0; i < line.length(); i++)
     {
       if (line.charAt(i) != ' ')
       {
         skipLine = false;
         break;
       }
     }
   }
   return skipLine;
}

Then I read sources around and I remark it will be a better idea to patch 
HTMLStringNode.java
The solution is to go in state 1 when you are at the end of a space string.

if (state==1)
{
   text+=input.charAt(i);
}
//patch beginning here
if (state==0 && i==input.length()-1)
   state=1;
//patch ending here
if (state==1 && i==input.length()-1)
{
   input = reader.getNextLine();
///.....

I think the second solution is better. I hope this fix will help you Somik, 
to patch the code in the next integration release.

Today, I've found another bug :)
http://www.cybergeo.presse.fr/sommaire/sisterra/ind15.htm
The last ">" is missing in the title mark out.
<TITLE>SISTEMA TERRA, VOL. VI , No. 1-3, December 1997</TITLE
=> null pointer exception
If I remember, you have already fix this problem with IMG mark out. Hope 
this patch will be the same.

Regards,

Cedric.