htmlparser-developer Mailing List for HTML Parser (Page 29)

Brought to you by: derrickoswald

htmlparser-developer — The developer mailing list of the htmlparser project

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

Flat | Threaded

<< < 1 .. 27 28 29 30 31 .. 33 > >> (Page 29 of 33)

[Htmlparser-developer] General Bug Behavior

From: Claude D. <CD...@ar...> - 2002-08-01 18:17:56

We've found three documents over the last few days that cause the
HTMLParser to hang. I will make sure they get into the bug database but
the issue centers around what should happen when the parser encounters
ill-formed HTML. I would propose that the correct behavior is to throw
and exception if the parser is unable to handle the syntax, but right
now it just hangs. Clearly, more investigation is required to determine
whether it's in a loop or waiting on the input. Since I'm not sure what
a fix would entail, I though it worth raising the issue as a general
design question. What should be done when the parser encounters
malformed HTML that goes beyond the realm of reasonable recovery?
=20
BTW: The documents we encountered that hung the parser had the following
artifacts:
=20
1) Inclusiong of "<!-->" pattern which is technically an invalid comment
syntax.
2) Inclusion of the "<html><head><TITLE>" pattern twice at the beginning
of the document.
3) Two opening "<TITLE>" tags with only one ending "</TITLE>" tag.
=20
From our point of view, a hag is devastating in that it does not allow
the application to move forward. An exception would be ideal in that it
would identify the problem without breaking the application.
=20

Re: [Htmlparser-developer] Bug in parseParameters() - Kaarle, need your help

From: Kaarle K. <kaa...@ik...> - 2002-08-01 04:20:01

At 11:17 1.8.2002 +0900, you wrote:
>Dear Kaarle,
>
>I made the modification and I wrote one testcase for it and it looks like
>OK now.
>
>Wow - you're fast! All testcases are passing! Thanks a ton. Bytway, your 
>parseParameters() method is really a key method in the parser - so I am 
>really interested in doing a profiling and see how we can optimize. It 
>will be great to collaborate on this. Bytway, there are two flags that I 
>see -isApo and isAmp. I guess the former is to flag an apostrophe, but 
>what is the latter ? Also, if I were to replace t and st with some names, 
>what would you suggest ?

isApo waits for next '-sign and
isAmp waits for next "-sign. I guess isAmp should be called something else 
(isCitation?)

I guess t stands for temp. Perhaps it could be e.g. item.
st should perhaps be token but then
the current token should be renamed to something like tokenSet.


>
>Quite a lot of changes in HTMLParser since I last time looked at it.
>I guess they have to do with all the bad html syntax there has been on the
>list lately.
>Oh yes, a lot of them are due to bug fixes, and some great suggestions 
>from the community. I have recieved some particularly fine suggestions 
>from Sam Joseph and Claude Duguay. Sam's idea of providing data extraction 
>methods like toHTML(), toPlainString(), took usability to the next level.
>
>Claude's suggestions, if implemented, will truly make this parser 
>professional :). Thats next on our agenda.
>
>Once again - thanks so much for your quick action on this bug. Bytway, 
>could you flag this bug as fixed on the htmlparser page with some comment, 
>for archiving purposes ? (You are a developer, so you can login and go to 
>the htmlparser bugs page from 
><http://htmlparser.sourceforge.net>http://htmlparser.sourceforge.net ).

OK. I wrote there something. Hope that was what you meant.

Kaarle


>
>Regards,
>Somik
>>----- Original Message -----
>>From: <mailto:kaa...@ik...>Kaarle Kaila
>>To: 
>><mailto:htm...@li...>htm...@li... 
>>
>>Sent: Thursday, August 01, 2002 4:35 AM
>>Subject: Re: [Htmlparser-developer] Bug in parseParameters() - Kaarle, 
>>need your help
>>
>>At 11:04 31.7.2002 +0900, you wrote:
>> >Hi Kaarle,
>> >     I am hoping you will have some time to help us on bug report 588885.
>> > You would have already got the mail from Bugzilla - there seems to be a
>> > bug in parseParameters() in dealing with spaces before =. I am wondering
>> > if I introduced this bug recently, or if this was always there.
>> >     Thanks in advance.
>> >
>>hi,
>>
>>Quite a lot of changes in HTMLParser since I last time looked at it.
>>I guess they have to do with all the bad html syntax there has been on the
>>list lately.
>>
>>I made the modification and I wrote one testcase for it and it looks like
>>OK now.
>>
>>regards
>>Kaarle
>>
>> >Cheers,
>> >Somik
>> >
>> >
>>
>>---------------------------------------------
>>Kaarle Kaila
>><http://www.iki.fi/kaila>http://www.iki.fi/kaila
>>mailto:kaa...@ik...
>>tel: +358 50 3725844
>>
>>
>>
>>
>>-------------------------------------------------------
>>This sf.net email is sponsored by: Dice - The leading online job board
>>for high-tech professionals. Search and apply for tech jobs today!
>><http://seeker.dice.com/seeker.epl?rel_code=31>http://seeker.dice.com/seeker.epl?rel_code=31
>>_______________________________________________
>>Htmlparser-developer mailing list
>><mailto:Htm...@li...>Htm...@li...
>>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>---------------------------------------------
>Kaarle Kaila
>http://www.iki.fi/kaila
>mailto:kaa...@ik...
>tel: +358 50 3725844

Re: [Htmlparser-developer] Bug in parseParameters() - Kaarle, need your help

From: Somik R. <so...@ya...> - 2002-08-01 02:24:29

Dear Kaarle,

I made the modification and I wrote one testcase for it and it looks =
like=20
OK now.   =20

Wow - you're fast! All testcases are passing! Thanks a ton. Bytway, your =
parseParameters() method is really a key method in the parser - so I am =
really interested in doing a profiling and see how we can optimize. It =
will be great to collaborate on this. Bytway, there are two flags that I =
see -isApo and isAmp. I guess the former is to flag an apostrophe, but =
what is the latter ? Also, if I were to replace t and st with some =
names, what would you suggest ?

Quite a lot of changes in HTMLParser since I last time looked at it.
I guess they have to do with all the bad html syntax there has been on =
the=20
list lately.

Oh yes, a lot of them are due to bug fixes, and some great suggestions =
from the community. I have recieved some particularly fine suggestions =
from Sam Joseph and Claude Duguay. Sam's idea of providing data =
extraction methods like toHTML(), toPlainString(), took usability to the =
next level.

Claude's suggestions, if implemented, will truly make this parser =
professional :). Thats next on our agenda.=20

Once again - thanks so much for your quick action on this bug. Bytway, =
could you flag this bug as fixed on the htmlparser page with some =
comment, for archiving purposes ? (You are a developer, so you can login =
and go to the htmlparser bugs page from =
http://htmlparser.sourceforge.net ).

Regards,
Somik
  ----- Original Message -----=20
  From: Kaarle Kaila=20
  To: htm...@li...=20
  Sent: Thursday, August 01, 2002 4:35 AM
  Subject: Re: [Htmlparser-developer] Bug in parseParameters() - Kaarle, =
need your help

  At 11:04 31.7.2002 +0900, you wrote:
  >Hi Kaarle,
  >     I am hoping you will have some time to help us on bug report =
588885.=20
  > You would have already got the mail from Bugzilla - there seems to =
be a=20
  > bug in parseParameters() in dealing with spaces before =3D. I am =
wondering=20
  > if I introduced this bug recently, or if this was always there.
  >     Thanks in advance.
  >
  hi,

  Quite a lot of changes in HTMLParser since I last time looked at it.
  I guess they have to do with all the bad html syntax there has been on =
the=20
  list lately.

  I made the modification and I wrote one testcase for it and it looks =
like=20
  OK now.

  regards
  Kaarle

  >Cheers,
  >Somik
  >
  >

  ---------------------------------------------
  Kaarle Kaila
  http://www.iki.fi/kaila
  mailto:kaa...@ik...
  tel: +358 50 3725844

  -------------------------------------------------------
  This sf.net email is sponsored by: Dice - The leading online job board
  for high-tech professionals. Search and apply for tech jobs today!
  http://seeker.dice.com/seeker.epl?rel_code=3D31
  _______________________________________________
  Htmlparser-developer mailing list
  Htm...@li...
  https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

Re: [Htmlparser-developer] Bug in parseParameters() - Kaarle, need your help

From: Kaarle K. <kaa...@ik...> - 2002-07-31 19:38:11

At 11:04 31.7.2002 +0900, you wrote:
>Hi Kaarle,
>     I am hoping you will have some time to help us on bug report 588885. 
> You would have already got the mail from Bugzilla - there seems to be a 
> bug in parseParameters() in dealing with spaces before =. I am wondering 
> if I introduced this bug recently, or if this was always there.
>     Thanks in advance.
>
hi,

Quite a lot of changes in HTMLParser since I last time looked at it.
I guess they have to do with all the bad html syntax there has been on the 
list lately.

I made the modification and I wrote one testcase for it and it looks like 
OK now.

regards
Kaarle

>Cheers,
>Somik
>
>

---------------------------------------------
Kaarle Kaila
http://www.iki.fi/kaila
mailto:kaa...@ik...
tel: +358 50 3725844

[Htmlparser-developer] Bug in parseParameters() - Kaarle, need your help

From: Somik R. <so...@ya...> - 2002-07-31 02:11:24

Hi Kaarle,   =20
    I am hoping you will have some time to help us on bug report 588885. =
You would have already got the mail from Bugzilla - there seems to be a =
bug in parseParameters() in dealing with spaces before =3D. I am =
wondering if I introduced this bug recently, or if this was always =
there.
    Thanks in advance.

Cheers,
Somik

Re: [Htmlparser-developer] Bug Report

From: Somik R. <so...@ya...> - 2002-07-31 00:53:37

Hi Claude,
    I will take a look at this as soon as I get some time. One request =
-- could you open a bug report from http://htmlparser.sourceforge.net

Cheers,
Somik
  ----- Original Message -----=20
  From: Claude Duguay=20
  To: htm...@li...=20
  Sent: Wednesday, July 31, 2002 5:31 AM
  Subject: [Htmlparser-developer] Bug Report

  We've found a number of documents, from the same site, that use a
  convention in the source documents that the browsers seem to deal with
  well enough but that HTMLParser hangs on. While it's arguable this is
  not valid HTML these documents should probably not cause hanging
  behavior.

The "<!-->" sequence (not including quotes) is apparently at fault. If
  the parser recognized and ignored these, I think this would help. I've
  attached a document that causes this hanging behavior.

[Htmlparser-developer] Bug Report

From: Claude D. <CD...@ar...> - 2002-07-30 20:31:17

Attachments: tech_chat_archives.html

We've found a number of documents, from the same site, that use a
convention in the source documents that the browsers seem to deal with
well enough but that HTMLParser hangs on. While it's arguable this is
not valid HTML these documents should probably not cause hanging
behavior.

The "<!-->" sequence (not including quotes) is apparently at fault. If
the parser recognized and ignored these, I think this would help. I've
attached a document that causes this hanging behavior.

[Htmlparser-developer] Integration Release 1.2-2002_07_28 is out

From: Somik R. <so...@ya...> - 2002-07-28 07:26:51

Hi Folks,
    This week's integration release is out - 1.2-2002_07_28.

    This contains some major bug fixes. They are :
[1] Fixed bug in HTMLParser.openConnection(), mistaking files for urls if
they contain "http" or "www" anywhere.
[2] Updated HTMLEndTag, this was accidentally left out in the previous
release.
[3] Fixed Bug 586062 - relative links bug - if first char is a slash, then
the subdirectories of the url need to be ignored.
[4] Fixed Bug 586222 - HTMLRemarkNode bug - if a line with a remark ndoe
contains a string before it, the string is ignored.
[5] Fixed major bug - allowing auto-correction of malformed tags. Current
code is very robust. Fix allowed removal of strictness vector concept,
making the design simpler.
[6] Fixed bug 586756 - in HTMLRemarkNode, if there are empty lines only, the
finite state machine would crash

My thanks to John Zook and Cedric Rosa for bug reports and suggestions.
Bytway, the strictness vector concept has been removed as I mentiond in
point [5] - this is probably the most important fix in this release. The
parser now begins to show some intelligence- it can auto-correct tags and
put inverted commas at the right places. All test cases are passing, and I
have put in some intensive amount of testing.

Tags like :
[1] <Meta name="sdsd" value="sdsds"">
[2] <Meta name="sdsd" value="sdsd"sds">
[3] <Meta name="sadd" value="sdsd " sdsd  sds ">

can be handled now. In case 2 and 3 - the parser corrects them to
<Meta name="sdsd" value="sdsdsds"> and
<Meta name="sadd" value="sdsd  sdsd  sds "> respectively.

We can also handle tags of a fourth kind :
[4] <crazy tag="</I>" dfkdlkfld=dfdf>

The criterion now is, if within the inverted comma, there is a begin tag,
then we shall expect an end tag, and not think its an error. This is a
fundamental change in the parsing automaton in HTMLTag.java.

Regards,
Somik

[Htmlparser-developer] Need Advice - peculiar bug

From: Somik R. <so...@ya...> - 2002-07-25 15:54:10

Hi Folks,
    As you know, Cedric Rosa has been giving some nice bug reports - and =
I was working on those today. A major problem has been solved-  we can =
now parse tags which are incorrectly ended, and those that have tags in =
inverted commas in them.   =20
    However, one problem remains. Although parsing doesent crash, when =
the tag was incorrectly ended, I am removing all the inverted commas =
from it, simply bcos I dont know which inverted comma was the wrong one.
e.g.
<Meta name=3D"sdsd" value=3D"sdsds"">
<Meta name=3D"sdsd" value=3D"sdsd"sds">
<Meta name=3D"sadd" value=3D"sdsd " sdsd  sds ">

This leads to complications. In the third case, if all inverted commas =
are removed, parseParameters cannot pick up the entire string in value =
(because of the spaces). The parser needs to be intelligent enough to =
know which inverted comma was the erroneous one - the same way as we can =
tell.

Can anyone suggest some logic that we could formulate to do this ?

Regards,
Somik

Re: [Htmlparser-developer] Bug found

From: Somik R. <so...@ya...> - 2002-07-23 23:05:20

Hi Cedric,
    This is related to the bug fix done earlier - whereby - META was added
to the strictness list - so META tags need to be well formed. If I take it
out, this will work but the previous bug will reappear, when parsing wierd
meta tags like :
<META NAME="Description" CONTENT="Ethnoburb </I>versus Chinatown: Two Types
of Urban Ethnic Communities in Los Angeles">

Though it might be possible to effect a fix for both.... got to think
harder..
Bytway, can you pls open bug reports for both the bugs on the site
atabase  - it will then be easy for us to track the bugs and refer to them
with links to the site.

Thanks again for your great testing!

Cheers,
Somik
****************************************
Somik Raha
System Architect
Kizna Corporation
Hiroo ON Bldg. 2F, 5-19-9 Hiroo,
Shibuya-ku, Tokyo,
150-0012,
JAPAN
Tel  :  +81-3-54752646
Fax : +81-3-5449-4870
Website : www.kizna.com
Mail : so...@ki...
****************************************************************************
*******
C makes it easy to shoot yourself in the foot. C++ makes it harder, but
when you do, it blows away your whole leg.
- Bjarne Stroustrup
****************************************************************************
*******

----- Original Message -----
From: "Cédric Rosa" <ced...@fr...>
To: <htm...@li...>
Sent: Wednesday, July 24, 2002 12:48 AM
Subject: [Htmlparser-developer] Bug found


> Hello, I've just tried the new integration release and here come daily
bugs
> found:
>
> I think this bug is new and come from our weekly fixes:
> When in a "meta tag" there is an odd number of " the program do an
infinite
> loop.
>
> Examples on theses pages:
> www.cybergeo.presse.fr/actualit/nouvparu/crendus/doriercr2.htm
> www.cybergeo.presse.fr/culture/vendina/vendina.htm
> www.cybergeo.presse.fr/REVGEO/ttsavoir/joly.htm
>
> Regards,
>
> Cedric.
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

[Htmlparser-developer] Bug found

From: R. <ced...@fr...> - 2002-07-23 15:48:46

Hello, I've just tried the new integration release and here come daily bugs 
found:

I think this bug is new and come from our weekly fixes:
When in a "meta tag" there is an odd number of " the program do an infinite 
loop.

Examples on theses pages:
www.cybergeo.presse.fr/actualit/nouvparu/crendus/doriercr2.htm
www.cybergeo.presse.fr/culture/vendina/vendina.htm
www.cybergeo.presse.fr/REVGEO/ttsavoir/joly.htm

Regards,

Cedric.

[Htmlparser-developer] Integration Release 1.2-2002_07_21 is out

From: Somik R. <so...@ya...> - 2002-07-21 06:14:23

Hi Folks,
    Integration Release 1.2-2002_07_21 is out. You can get it from =
http://htmlparser.sourceforge.net. This release contains four bug fixes =
- thanks a lot to Cedric Rosa for contributing the bug reports and some =
of the fixes.

    As an aside, I had been very busy with the open sourcing of another =
project - a synchronization collaboration server. Just in case folks are =
interested, check www.kizna.org - this is a commercial grade server - =
which we've used to build real-time applications (like Auctions, Chats, =
games, etc..). It has a very simple API - not requiring any knowledge of =
protocols, etc. And there is support too :)
   =20
    I am hoping to release some more apps - like a distributed pair =
programming plugin over Eclipse which might be of interest to those who =
believe in XP.

Regards,
Somik
**********************************
Somik Raha
System Architect
Kizna Corporation
Hiroo ON Bldg. 2F, 5-19-9 Hiroo,
Shibuya-ku, Tokyo,=20
150-0012, JAPAN
Phone : +81-3-5475-2646
Fax     : +81-3-3445-9089
Web   : http://www.kizna.com
Mail    : so...@ki...
**********************************

Re: [Htmlparser-developer] Malformed End Tag [Was:Daily bugs ... and one little fix:)]

From: Somik R. <so...@ya...> - 2002-07-20 01:54:10

Hi Cedric,
    This is fixed. You can check out the latest parser from CVS.=20
    Or wait till tomorrow, I will make the next integration release.
    Thank you for your good work on the bug reports. Bytway, I would be =
glad to give you CVS access so you can directly checkin bug fixes. Send =
me your sourceforge id.

Cheers,
Somik
----- Original Message -----=20
  From: Somik Raha=20
  To: htm...@li...=20
  Sent: Friday, July 19, 2002 5:48 PM
  Subject: [Htmlparser-developer] Malformed End Tag [Was:Daily bugs ... =
and one little fix:)]


  Hi Cedric,

  Today, I've found another bug :)
  http://www.cybergeo.presse.fr/sommaire/sisterra/ind15.htm
  The last ">" is missing in the title mark out.
  <TITLE>SISTEMA TERRA, VOL. VI , No. 1-3, December 1997</TITLE
  =3D> null pointer exception
  If I remember, you have already fix this problem with IMG mark out. =
Hope=20
  this patch will be the same.

  I think this is a diff bug altogether - probably in HTMLEndTag. Will =
try fixing it soon.
  Thanks for finding this.

  Regards,
  Somik
  =20

[Htmlparser-developer] Fixed HTMLTagScanner bug [Was: euh ... another fix]

From: Somik R. <so...@ya...> - 2002-07-19 09:04:17

Hi Cedric
    Thanks yet again for a good bug report. This fix has been =
incorporated. As of now, only one test case fails - that of the unended =
title tag. Ive got to duplicate some stuff in HTMLTag into HTMLEndTag.

Regards,
Somik
  ----- Original Message -----=20
  From: C=E9dric Rosa=20
  To: htm...@li...=20
  Sent: Wednesday, July 17, 2002 8:10 PM
  Subject: [Htmlparser-user] euh ... another fix


  Hello,

  To test with: www.revues.org/calenda/articles/1379.html
  ... <br>  <>PROGRAMME</b><br> ..
  =3D> String Index out of range : 0

  In HTMLTagScanner.java:
  -------------------------------------
  public static String absorbLeadingBlanks(String s)
  {
     String temp =3D new String(s);
     file://here we add a check for "temp.length()!=3D0" to prevent a =
bug with empty=20
  mark out.
     while (temp.length()!=3D0 && temp.charAt(0)=3D=3D' ')
     {
       temp =3D temp.substring(1,temp.length());
     }
     return temp;
  }

  I know my bugs report and my fixes are not useful (because bugs quasi =
never=20
  happen) but they contribute to increase the software stability. I hope =
my=20
  contribution help you.

  Regards,

  Cedric.



  -------------------------------------------------------
  This sf.net email is sponsored by:ThinkGeek
  Welcome to geek heaven.
  http://thinkgeek.com/sf
  _______________________________________________
  Htmlparser-user mailing list
  Htm...@li...
  https://lists.sourceforge.net/lists/listinfo/htmlparser-user

[Htmlparser-developer] Malformed End Tag [Was:Daily bugs ... and one little fix:)]

From: Somik R. <so...@ya...> - 2002-07-19 08:55:03

Hi Cedric,

Today, I've found another bug :)
http://www.cybergeo.presse.fr/sommaire/sisterra/ind15.htm
The last ">" is missing in the title mark out.
<TITLE>SISTEMA TERRA, VOL. VI , No. 1-3, December 1997</TITLE
=3D> null pointer exception
If I remember, you have already fix this problem with IMG mark out. Hope =

this patch will be the same.

I think this is a diff bug altogether - probably in HTMLEndTag. Will try =
fixing it soon.
Thanks for finding this.

Regards,
Somik

[Htmlparser-developer] Re: Daily bugs ... and one little fix:)

From: Somik R. <so...@ya...> - 2002-07-19 08:44:48

When I parse this url:
www.revues.org/calenda/articles/1083.html
Parsing this file last more than 40 second so I've searched which =
problem=20
may reduce performance.

First, I begin to fix this problem with prevent it to appear.

In HTMLReader.java:
------------------------------
protected boolean readNextLine()
{
   boolean skipLine =3D true;
   if (posInLine!=3D-1 && !(line !=3D null && =
node.elementEnd()+1>=3Dline.length()))
   {
     for (int i =3D 0; i < line.length(); i++)
     {
       if (line.charAt(i) !=3D ' ')
       {
         skipLine =3D false;
         break;
       }
     }
   }
   return skipLine;
}

Then I read sources around and I remark it will be a better idea to =
patch=20
HTMLStringNode.java
The solution is to go in state 1 when you are at the end of a space =
string.

if (state=3D=3D1)
{
   text+=3Dinput.charAt(i);
}
file://patch beginning here
if (state=3D=3D0 && i=3D=3Dinput.length()-1)
   state=3D1;
file://patch ending here
if (state=3D=3D1 && i=3D=3Dinput.length()-1)
{
   input =3D reader.getNextLine();
///.....

I think the second solution is better. I hope this fix will help you =
Somik,=20
to patch the code in the next integration release.

This fix is incorporated. Thanks. Ive written a test case to trap this =
bug.

Regards,
Somik

[Htmlparser-developer] Re: Another bug

From: Somik R. <so...@ya...> - 2002-07-19 08:10:33

Hi Cedric,
    This was a very good bug report. This turned out to be a deep bug - =
but easy to fix. HTMLParser does auto correction of tags when inverted =
commas are not provided. However, this can conflict with certain tags =
where they are provided. So to provide some intelligence into the =
parser-  there is this feature of "strictness".=20

    This allows you to tell the parser when to be strict and when not to =
be. This makes sense in situations when you know, that the html coder =
would not make a mistake, and if he does, browsers like IE would crash. =
Examples of such tags would be INPUT - for applets, if you are providing =
complex params, they must be within inverted commas or it confuses the =
browser. I have added the META tag also to this strictness list.

    Also, there was an issue with HTMLTag.java itself related to this =
report.
    Thank you very much for this bug report - you can try the =
StringExtractor on the url you gave, the entire text comes out cleanly. =
(Check out from CVS and build, or wait for the next release)

Cheers,
Somik
  ----- Original Message -----=20
  From: C=E9dric Rosa=20
  To: htm...@li...=20
  Sent: Tuesday, July 16, 2002 7:38 PM
  Subject: [Htmlparser-user] Another bug


  Hi,

  When I parse this url: www.cybergeo.presse.fr\culture\weili\weili.htm =
no=20
  text is found.

  With my daily bugs reports, you might think that I want to break your=20
  software lol ... excuse me for testing with "space" url :)

  Cedric.



  -------------------------------------------------------
  This sf.net email is sponsored by: Jabber - The world's fastest =
growing=20
  real-time communications platform! Don't just IM. Build it in!=20
  http://www.jabber.com/osdn/xim
  _______________________________________________
  Htmlparser-user mailing list
  Htm...@li...
  https://lists.sourceforge.net/lists/listinfo/htmlparser-user

RE: [Htmlparser-developer] Re: Final Statistics from Trek Run

From: Claude D. <CD...@ar...> - 2002-07-12 04:08:33

WW91IG1heSBuZWVkIHRvIGhhdmUgeW91ciB1bml0IHRlc3RzIGNvdmVyIGEgbGFyZ2VyIHNldC4g
SSd2ZSBvZnRlbiBmb3VuZCB0aGUgSmF2YURvYyBzZXQgdXNlZnVsIGZvciBzbWFsbGVyIHRlc3Rz
LiBUaGVyZSBhcmUgYWJvdXQgODAwMCBkb2N1bWVudHMgaW4gdGhlcmUgd2l0aCBhIHZhcmlldHkg
b2Ygc2l6ZXMsIHRob3VnaCB0aGV5IGFyZSBub3QgbmVjZXNzYXJpbHkgcmVwcmVzZW50YXRpdmUg
b2YgdGhlIGxhcmdlciBlY29sb2d5IG9mIHRoZSBJbnRlcm5ldC4gVGhlIHJlYWwgdHJpY2sgaXMg
dG8gcHV0IGEgdGhyZXNob2xkIG9uIHRoZSB1bml0IHRlc3QgdGhhdCBmbGFncyB5b3UgaWYgeW91
IGV2ZXIgbWFrZSBhIGNoYW5nZSB0aGF0IHNsb3dzIHRoaW5ncyBkb3duLCBhdCB3aGljaCBwb2lu
dCB5b3UgY2FuIGV2YWx1YXRlIHdoZXRoZXIgdGhlIHRyYWRlb2ZmIGJldHdlZW4gYSBuZXcgZmVh
dHVyZSBvciByZWZhY3RvcmluZyBjaG9pY2UgaXMgd29ydGggdGhlIHBlcmZvcm1hbmNlIGhpdC4N
CiANCllvdSd2ZSBkb25lIGEgcHJldHR5IGV4Y2VwdGlvbmFsIGpvYiBhbmQgc2hvdWxkIGJlIHBy
b3VkIG9mIHRoZSB3b3JrIHlvdSd2ZSBkb25lLiBwZXJzb25hbGx5LCBJIGNvdWxkbid0IGJlIG1v
cmUgcGxlYXNlZCB0aGF0IG91ciBwcm9kdWN0IGlzIDE1JSsgZmFzdGVyIHRoYW5rcyB0byB5b3Vy
IGRlc2lnbiBhbmQgaW1wbGVtZW50YXRpb24uIFRoYW5rcyENCg0KCS0tLS0tT3JpZ2luYWwgTWVz
c2FnZS0tLS0tIA0KCUZyb206IFNvbWlrIFJhaGEgW21haWx0bzpzb21pa0B5YWhvby5jb21dIA0K
CVNlbnQ6IFRodSA3LzExLzIwMDIgNzoyMyBQTSANCglUbzogaHRtbHBhcnNlci1kZXZlbG9wZXJA
bGlzdHMuc291cmNlZm9yZ2UubmV0IA0KCUNjOiANCglTdWJqZWN0OiBSZTogW0h0bWxwYXJzZXIt
ZGV2ZWxvcGVyXSBSZTogRmluYWwgU3RhdGlzdGljcyBmcm9tIFRyZWsgUnVuDQoJDQoJDQoNCgk+
IFRoZSAxLjIgbnVtYmVycyBhcmUgYmFzZWQgb24gdGhlIDA3MDcgYnVpbGQuDQoJDQoJT2ssIEkg
d2lsbCBwcm9maWxlIHNvbWUgbW9yZSBhbmQgdHJ5IHRvIHJlbW92ZSBhbnkgb3RoZXIgYm90dGxl
bmVja3MuIEkgd2FzDQoJYWxzbyB0aGlua2luZyBvZiBtYWtpbmcgYSBoZWFkIHNjYW5uZXIuIFRo
YXQgd291bGQgYWxsb3cgbWUgdG8gcmVtb3ZlIHRoZQ0KCXRpdGxlIGFuZCBtZXRhIHNjYW5uZXJz
IGZyb20gdGhlIHJlZ2lzdGVyZWQgbGlzdCwgYW5kIGFkZCB0aGVtIG9ubHkgd2hlbg0KCXRoZXkg
YXJlIHJlYWxseSBuZWVkZWQgKG9uIGVuY291bnRlcmluZyB0aGUgaGVhZCB0YWcpLg0KCQ0KCVJl
Z2FyZHMsDQoJU29taWsNCgkNCgkNCgkNCgkNCgktLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tDQoJVGhpcyBzZi5uZXQgZW1haWwgaXMgc3BvbnNv
cmVkIGJ5OlRoaW5rR2Vlaw0KCVBDIE1vZHMsIENvbXB1dGluZyBnb29kaWVzLCBjYXNlcyAmIG1v
cmUNCglodHRwOi8vdGhpbmtnZWVrLmNvbS9zZg0KCV9fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fDQoJSHRtbHBhcnNlci1kZXZlbG9wZXIgbWFpbGluZyBsaXN0
DQoJSHRtbHBhcnNlci1kZXZlbG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0DQoJaHR0cHM6Ly9s
aXN0cy5zb3VyY2Vmb3JnZS5uZXQvbGlzdHMvbGlzdGluZm8vaHRtbHBhcnNlci1kZXZlbG9wZXIN
CgkNCg0K

Re: [Htmlparser-developer] Re: Final Statistics from Trek Run

From: Somik R. <so...@ya...> - 2002-07-12 02:23:47

> The 1.2 numbers are based on the 0707 build.

Ok, I will profile some more and try to remove any other bottlenecks. I was
also thinking of making a head scanner. That would allow me to remove the
title and meta scanners from the registered list, and add them only when
they are really needed (on encountering the head tag).

Regards,
Somik

RE: [Htmlparser-developer] Re: Final Statistics from Trek Run

From: Claude D. <CD...@ar...> - 2002-07-12 02:05:25

VGhlIDEuMiBudW1iZXJzIGFyZSBiYXNlZCBvbiB0aGUgMDcwNyBidWlsZC4NCg0KCS0tLS0tT3Jp
Z2luYWwgTWVzc2FnZS0tLS0tIA0KCUZyb206IFNvbWlrIFJhaGEgW21haWx0bzpzb21pa0B5YWhv
by5jb21dIA0KCVNlbnQ6IFRodSA3LzExLzIwMDIgMzo0NCBQTSANCglUbzogaHRtbHBhcnNlci1k
ZXZlbG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0OyBodG1scGFyc2VyLXVzZXJAbGlzdHMuc291
cmNlZm9yZ2UubmV0IA0KCUNjOiANCglTdWJqZWN0OiBbSHRtbHBhcnNlci1kZXZlbG9wZXJdIFJl
OiBGaW5hbCBTdGF0aXN0aWNzIGZyb20gVHJlayBSdW4NCgkNCgkNCglIaSBDbGF1ZGUNCgkgDQoJ
VGltZSBmb3IgU3dpbmcgKGluIG1pbnV0ZXMpOiAxMCwzMDUgKDAuMDkzNzI3OSBkb2NzL3NlYykN
CglUaW1lIGZvciBIVE1MUGFyc2VyIDEuMSAoaW4gbWludXRlcyk6IDI5NCAoMy4yOTQzODc3IGRv
Y3Mvc2VjKQ0KCVRpbWUgZm9yIEhUTUxQYXJzZXIgMS4yIChpbiBtaW51dGVzKTogMzExICgzLjE2
NjUwNTggZG9jcy9zZWMpDQoNCglXaGljaCB2ZXIgb2YgMS4yIGlzIHRoaXMgKGlzIGl0IHRoZSBs
YXRlc3QpID8gVGhlIHByZXZpb3VzIG9uZSBoYWQgc2VyaW91cyBpc3N1ZXMgd2l0aCBzdHJpbmcg
YWxsb2NhdGlvbnMsIGJ1dCB0aGUgbGF0ZXN0IG91Z2h0IHRvIGJlIGZhc3RlciBmb3IgYmlnZ2Vy
IGZpbGVzIHRoYW4gMS4xLg0KCSANCglSZWdhcmRzLA0KCVNvbWlrDQoNCg==

[Htmlparser-developer] Re: Final Statistics from Trek Run

From: Somik R. <so...@ya...> - 2002-07-12 01:06:02

MessageHi Claude

Time for Swing (in minutes): 10,305 (0.0937279 docs/sec)
Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec)
Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec)


Which ver of 1.2 is this (is it the latest) ? The previous one had =
serious issues with string allocations, but the latest ought to be =
faster for bigger files than 1.1.

Regards,
Somik

[Htmlparser-developer] Re: Final Statistics from Trek Run

From: Somik R. <so...@ya...> - 2002-07-11 22:42:37

MessageThe SWT is not a contender for replacing Swing. It may be an =
alternative, applicable in many circumstaces, but a quick look at the =
Sun's Swing connection should dissuade you from assuming that few people =
are using Swing.=20

LOL! I was asking for trouble with that comment :). I guess its just me =
that finds Swing unbearably slow.

I would not endorse trying to make HTMLParser Swing-compatible. These =
are different animals and should stay that way. The notion of providing =
a SAX-like interface is interesting but you should look instead toward =
XML pull-parsers, which are the high-performance alternatives now =
surfacing more widely. There is a JSR =
(http://www.jcp.org/jsr/detail/173.jsp) that is trying to unify a good =
interface for pull-parsing (they're calling it a Streaming API). You'll =
find this link especially intersting (http://www.xmlpull.org/).

I will look into this advice seriously (will start by educating myself =
on XML Pull-parsers).=20

HTMLParser has two fundamental strengths. 1) It's easy to use and =
extend. 2) It's lightning fast.
Don't lose sight of these distinctions. The whole XML community is =
strugling to achieve these goals and hasn't quite gotten there yet. =
There's much to learn from XML, but they are laregely moving in this =
direction.

Its interesting that this should come up - the other day someone was =
suggesting to me if the HTMLParser might not be used for parsing XML..

BTW: JTidy is a serious performance bottleneck in a high-performance =
application.

Good to know that :), havent checked it out myself yet.
Its great to have a knowledgable person like you join this parser =
community. It will be of great value in taking the final steps towards =
stabilizing the API of the parser. The next integration releases would =
focus on incorporating your suggestions, regarding the exception =
handling. Maybe first week of Sep might be a realistic date for the =
release of 1.2 (unless I get loads of time or help).

Regards,
Somik

  ----- Original Message -----=20
  From: Claude Duguay=20
  To: htm...@li...=20
  Sent: Friday, July 12, 2002 1:29 AM
  Subject: RE: [Htmlparser-user] Final Statistics from Trek Run


  The SWT is not a contender for replacing Swing. It may be an =
alternative, applicable in many circumstaces, but a quick look at the =
Sun's Swing connection should dissuade you from assuming that few people =
are using Swing. =20
  HTMLParser has two fundamental strengths. 1) It's easy to use and =
extend. 2) It's lightning fast.
  =20
  Don't lose sight of these distinctions. The whole XML community is =
strugling to achieve these goals and hasn't quite gotten there yet. =
There's much to learn from XML, but they are laregely moving in this =
direction.
  =20
  BTW: JTidy is a serious performance bottleneck in a high-performance =
application.
  =20
  -----Original Message-----
  From: Somik Raha [mailto:so...@ya...]=20
  Sent: Thursday, July 11, 2002 2:25 AM
  To: htm...@li...
  Subject: Re: [Htmlparser-user] Final Statistics from Trek Run


    Hi Craig,
    For example, the renderer built into Swing's JEditorPane expects
    callbacks resulting from well-formed HTML with certain (sometimes
    arbitrary) characteristics. (For example, a
    <head><title>X</title></head> section must exist, and X cannot be =
null).
    It is possible that the formatting of the input HTML into a =
structure
    with these characteristics reduces the parser's performance in order =
to
    produce a better render.
       =20
    Indeed - perhaps a good idea would be to rewrite JEditorPane :) - =
make an open source version, which is better designed. Swing =
compatibility is a real pain - we gave up on that not so far back :). On =
the other hand, I was thinking that SAX compliance would be feasible and =
worth it - I doubt if many people are considering Swing for graphics =
these days, especially with the SWT being out there. But the SAX =
mechanism is quite popular and its worth being able to just switch =
parsers.

    Of course, whether you need to take these considerations into =
account
    depends entirely on your application. The htmlparser seems to lean =
more
    toward the extraction of information rather than its representation, =
and
    the latter is so fraught with ambiguities as to make it a task of a
    different order altogether.

    So true. Like you had mailed sometime back, JTidy does a good job of =
that.
    =20
    Regards,
    Somik
      ----- Original Message -----=20
      From: Craig Raw=20
      To: htm...@li...=20
      Sent: Thursday, July 11, 2002 5:35 PM
      Subject: [Htmlparser-user] RE: [Htmlparser-developer] Final =
Statistics from Trek Run


      Just a point to notice on these tests. The htmlparser, for all =
it's
      merits, is not a direct functional replacement for the Swing =
parser.=20

      For example, the renderer built into Swing's JEditorPane expects
      callbacks resulting from well-formed HTML with certain (sometimes
      arbitrary) characteristics. (For example, a
      <head><title>X</title></head> section must exist, and X cannot be =
null).
      It is possible that the formatting of the input HTML into a =
structure
      with these characteristics reduces the parser's performance in =
order to
      produce a better render.

      Of course, whether you need to take these considerations into =
account
      depends entirely on your application. The htmlparser seems to lean =
more
      toward the extraction of information rather than its =
representation, and
      the latter is so fraught with ambiguities as to make it a task of =
a
      different order altogether.

      -craig

      -----Original Message-----
      From: htm...@li...
      [mailto:htm...@li...] On =
Behalf Of
      Somik Raha
      Sent: 11 July 2002 02:19 AM
      To: htm...@li...;
      htm...@li...
      Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final
      Statistics from Trek Run

      Hi Claude,
      Thanks a ton for all these tests. Do you think you could write an
      article on this that we could put up ?

      Regards
      Somik



      -------------------------------------------------------
      This sf.net email is sponsored by:ThinkGeek
      PC Mods, Computing goodies, cases & more
      http://thinkgeek.com/sf
      _______________________________________________
      Htmlparser-user mailing list
      Htm...@li...
      https://lists.sourceforge.net/lists/listinfo/htmlparser-user

RE: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run

From: Claude D. <CD...@ar...> - 2002-07-11 16:17:46

We're not quite done yet... ;-)
=20
Here are some numbers that reflect the differences with the larger
files. This set is 57,952 files (6,256,488,243 bytes), many of which are
several megabyte log file dumps to HTML (average file size for this set
is 107,959 bytes). These are especially problematic for the Swing
parser:
=20
Time for Swing (in minutes): 10,305 (0.0937279 docs/sec)
Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec)
Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec)
=20
Note that this run was done on a single box with no other parallel runs.
Also, there was a variance of about 1000 files between runs that are
reflected in the speed numbers. But I provided the average in the
paragraph above, so you will not get exact results from recalculating
from those numbers. Still, everything needs to be looked at in
perspective.
=20
Notable here is that the 1.2 version seems to be a tiny bit slower on
big files. This is almost certainly due to string reallocation. As
contiguous content gets larger, which can happen in any application that
works heavily with string objects. It might be worth looking at whether
this is addressable. Overall, though, HTMParser 1.2 is clearly an
improvement over the most commonly used Java/HTML parser (ie: Swing) in
use today ;-).

-----Original Message-----
From: Somik Raha [mailto:so...@ya...]=20
Sent: Wednesday, July 10, 2002 5:19 PM
To: htm...@li...;
htm...@li...
Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final
Statistics from Trek Run


Hi Claude,
    Thanks a ton for all these tests. Do you think you could write an
article on this that we could put up ?
=20
Regards
Somik

Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run

From: Somik R. <so...@ya...> - 2002-07-11 00:19:15

MessageHi Claude,
    Thanks a ton for all these tests. Do you think you could write an =
article on this that we could put up ?

Regards
Somik

RE: [Htmlparser-developer] Final Statistics from Trek Run

From: Claude D. <CD...@ar...> - 2002-07-10 16:24:51

Note also that these tests were run in parallel on the same Solaris box.
A single instance can often run significanly faster. These tests were
done to test relative speed between versions, keeping all other factors
constant.

-----Original Message-----
From: Claude Duguay=20
Sent: Wednesday, July 10, 2002 8:58 AM
To: htm...@li...;
htm...@li...
Subject: [Htmlparser-developer] Final Statistics from Trek Run

The latest version of the HTMLParser (20020707) appears to deliver good
performance over the Swing parser and previous HTMLParser versions.
These tests were done in context (using our application, which converts
HTML documents, among others, into a normalized form and transmits the
result as XML to a server over TCP/IP). We have subtracted the
transmission time from these numbers, but a small amount of imprecision
is probable given preprocessing and file I/O that gets done up front.
Given the size of the tests (more than a half million documents), these
elements should negligable. Note that this set includes a large number
of small documents and we know from earlier tests that the Swing parser
slows down dramatically as documents get larger, while the HTMLParser
does not.
=20
Total Documents processed: 642,077
Average Document Size: 4,043
=20
Average Number of Documents Per Second for:
=20
Swing Parser (Java 1.3.1): 2.797185195
HTMLParser 1.1 Production Version: 2.558727723
HTMLParser 1.2 Early integration build: 2.585632061
HTMLParser 1.2 (build 20020707): 3.224910367
=20
Conclusions: The HTMLParser 1.2 is now about 15% faster than the Swing
parser on Swing's home turf (Swing does best with smaller HTML files).
With larger files, we have seen improvements as high as 35 times the
seed of the Swing parser).
=20

14 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 27 28 29 30 31 .. 33 > >> (Page 29 of 33)