htmlparser-developer Mailing List for HTML Parser (Page 29)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Claude D. <CD...@ar...> - 2002-08-01 18:17:56
|
We've found three documents over the last few days that cause the HTMLParser to hang. I will make sure they get into the bug database but the issue centers around what should happen when the parser encounters ill-formed HTML. I would propose that the correct behavior is to throw and exception if the parser is unable to handle the syntax, but right now it just hangs. Clearly, more investigation is required to determine whether it's in a loop or waiting on the input. Since I'm not sure what a fix would entail, I though it worth raising the issue as a general design question. What should be done when the parser encounters malformed HTML that goes beyond the realm of reasonable recovery? =20 BTW: The documents we encountered that hung the parser had the following artifacts: =20 1) Inclusiong of "<!-->" pattern which is technically an invalid comment syntax. 2) Inclusion of the "<html><head><TITLE>" pattern twice at the beginning of the document. 3) Two opening "<TITLE>" tags with only one ending "</TITLE>" tag. =20 From our point of view, a hag is devastating in that it does not allow the application to move forward. An exception would be ideal in that it would identify the problem without breaking the application. =20 |
From: Kaarle K. <kaa...@ik...> - 2002-08-01 04:20:01
|
At 11:17 1.8.2002 +0900, you wrote: >Dear Kaarle, > >I made the modification and I wrote one testcase for it and it looks like >OK now. > >Wow - you're fast! All testcases are passing! Thanks a ton. Bytway, your >parseParameters() method is really a key method in the parser - so I am >really interested in doing a profiling and see how we can optimize. It >will be great to collaborate on this. Bytway, there are two flags that I >see -isApo and isAmp. I guess the former is to flag an apostrophe, but >what is the latter ? Also, if I were to replace t and st with some names, >what would you suggest ? isApo waits for next '-sign and isAmp waits for next "-sign. I guess isAmp should be called something else (isCitation?) I guess t stands for temp. Perhaps it could be e.g. item. st should perhaps be token but then the current token should be renamed to something like tokenSet. > >Quite a lot of changes in HTMLParser since I last time looked at it. >I guess they have to do with all the bad html syntax there has been on the >list lately. >Oh yes, a lot of them are due to bug fixes, and some great suggestions >from the community. I have recieved some particularly fine suggestions >from Sam Joseph and Claude Duguay. Sam's idea of providing data extraction >methods like toHTML(), toPlainString(), took usability to the next level. > >Claude's suggestions, if implemented, will truly make this parser >professional :). Thats next on our agenda. > >Once again - thanks so much for your quick action on this bug. Bytway, >could you flag this bug as fixed on the htmlparser page with some comment, >for archiving purposes ? (You are a developer, so you can login and go to >the htmlparser bugs page from ><http://htmlparser.sourceforge.net>http://htmlparser.sourceforge.net ). OK. I wrote there something. Hope that was what you meant. Kaarle > >Regards, >Somik >>----- Original Message ----- >>From: <mailto:kaa...@ik...>Kaarle Kaila >>To: >><mailto:htm...@li...>htm...@li... >> >>Sent: Thursday, August 01, 2002 4:35 AM >>Subject: Re: [Htmlparser-developer] Bug in parseParameters() - Kaarle, >>need your help >> >>At 11:04 31.7.2002 +0900, you wrote: >> >Hi Kaarle, >> > I am hoping you will have some time to help us on bug report 588885. >> > You would have already got the mail from Bugzilla - there seems to be a >> > bug in parseParameters() in dealing with spaces before =. I am wondering >> > if I introduced this bug recently, or if this was always there. >> > Thanks in advance. >> > >>hi, >> >>Quite a lot of changes in HTMLParser since I last time looked at it. >>I guess they have to do with all the bad html syntax there has been on the >>list lately. >> >>I made the modification and I wrote one testcase for it and it looks like >>OK now. >> >>regards >>Kaarle >> >> >Cheers, >> >Somik >> > >> > >> >>--------------------------------------------- >>Kaarle Kaila >><http://www.iki.fi/kaila>http://www.iki.fi/kaila >>mailto:kaa...@ik... >>tel: +358 50 3725844 >> >> >> >> >>------------------------------------------------------- >>This sf.net email is sponsored by: Dice - The leading online job board >>for high-tech professionals. Search and apply for tech jobs today! >><http://seeker.dice.com/seeker.epl?rel_code=31>http://seeker.dice.com/seeker.epl?rel_code=31 >>_______________________________________________ >>Htmlparser-developer mailing list >><mailto:Htm...@li...>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > >--------------------------------------------- >Kaarle Kaila >http://www.iki.fi/kaila >mailto:kaa...@ik... >tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-08-01 02:24:29
|
Dear Kaarle, I made the modification and I wrote one testcase for it and it looks = like=20 OK now. =20 Wow - you're fast! All testcases are passing! Thanks a ton. Bytway, your = parseParameters() method is really a key method in the parser - so I am = really interested in doing a profiling and see how we can optimize. It = will be great to collaborate on this. Bytway, there are two flags that I = see -isApo and isAmp. I guess the former is to flag an apostrophe, but = what is the latter ? Also, if I were to replace t and st with some = names, what would you suggest ? Quite a lot of changes in HTMLParser since I last time looked at it. I guess they have to do with all the bad html syntax there has been on = the=20 list lately. Oh yes, a lot of them are due to bug fixes, and some great suggestions = from the community. I have recieved some particularly fine suggestions = from Sam Joseph and Claude Duguay. Sam's idea of providing data = extraction methods like toHTML(), toPlainString(), took usability to the = next level. Claude's suggestions, if implemented, will truly make this parser = professional :). Thats next on our agenda.=20 Once again - thanks so much for your quick action on this bug. Bytway, = could you flag this bug as fixed on the htmlparser page with some = comment, for archiving purposes ? (You are a developer, so you can login = and go to the htmlparser bugs page from = http://htmlparser.sourceforge.net ). Regards, Somik ----- Original Message -----=20 From: Kaarle Kaila=20 To: htm...@li...=20 Sent: Thursday, August 01, 2002 4:35 AM Subject: Re: [Htmlparser-developer] Bug in parseParameters() - Kaarle, = need your help At 11:04 31.7.2002 +0900, you wrote: >Hi Kaarle, > I am hoping you will have some time to help us on bug report = 588885.=20 > You would have already got the mail from Bugzilla - there seems to = be a=20 > bug in parseParameters() in dealing with spaces before =3D. I am = wondering=20 > if I introduced this bug recently, or if this was always there. > Thanks in advance. > hi, Quite a lot of changes in HTMLParser since I last time looked at it. I guess they have to do with all the bad html syntax there has been on = the=20 list lately. I made the modification and I wrote one testcase for it and it looks = like=20 OK now. regards Kaarle >Cheers, >Somik > > --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 ------------------------------------------------------- This sf.net email is sponsored by: Dice - The leading online job board for high-tech professionals. Search and apply for tech jobs today! http://seeker.dice.com/seeker.epl?rel_code=3D31 _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Kaarle K. <kaa...@ik...> - 2002-07-31 19:38:11
|
At 11:04 31.7.2002 +0900, you wrote: >Hi Kaarle, > I am hoping you will have some time to help us on bug report 588885. > You would have already got the mail from Bugzilla - there seems to be a > bug in parseParameters() in dealing with spaces before =. I am wondering > if I introduced this bug recently, or if this was always there. > Thanks in advance. > hi, Quite a lot of changes in HTMLParser since I last time looked at it. I guess they have to do with all the bad html syntax there has been on the list lately. I made the modification and I wrote one testcase for it and it looks like OK now. regards Kaarle >Cheers, >Somik > > --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-07-31 02:11:24
|
Hi Kaarle, =20 I am hoping you will have some time to help us on bug report 588885. = You would have already got the mail from Bugzilla - there seems to be a = bug in parseParameters() in dealing with spaces before =3D. I am = wondering if I introduced this bug recently, or if this was always = there. Thanks in advance. Cheers, Somik |
From: Somik R. <so...@ya...> - 2002-07-31 00:53:37
|
Hi Claude, I will take a look at this as soon as I get some time. One request = -- could you open a bug report from http://htmlparser.sourceforge.net Cheers, Somik ----- Original Message -----=20 From: Claude Duguay=20 To: htm...@li...=20 Sent: Wednesday, July 31, 2002 5:31 AM Subject: [Htmlparser-developer] Bug Report We've found a number of documents, from the same site, that use a convention in the source documents that the browsers seem to deal with well enough but that HTMLParser hangs on. While it's arguable this is not valid HTML these documents should probably not cause hanging behavior. The "<!-->" sequence (not including quotes) is apparently at fault. If the parser recognized and ignored these, I think this would help. I've attached a document that causes this hanging behavior. |
From: Claude D. <CD...@ar...> - 2002-07-30 20:31:17
|
We've found a number of documents, from the same site, that use a convention in the source documents that the browsers seem to deal with well enough but that HTMLParser hangs on. While it's arguable this is not valid HTML these documents should probably not cause hanging behavior. The "<!-->" sequence (not including quotes) is apparently at fault. If the parser recognized and ignored these, I think this would help. I've attached a document that causes this hanging behavior. |
From: Somik R. <so...@ya...> - 2002-07-28 07:26:51
|
Hi Folks, This week's integration release is out - 1.2-2002_07_28. This contains some major bug fixes. They are : [1] Fixed bug in HTMLParser.openConnection(), mistaking files for urls if they contain "http" or "www" anywhere. [2] Updated HTMLEndTag, this was accidentally left out in the previous release. [3] Fixed Bug 586062 - relative links bug - if first char is a slash, then the subdirectories of the url need to be ignored. [4] Fixed Bug 586222 - HTMLRemarkNode bug - if a line with a remark ndoe contains a string before it, the string is ignored. [5] Fixed major bug - allowing auto-correction of malformed tags. Current code is very robust. Fix allowed removal of strictness vector concept, making the design simpler. [6] Fixed bug 586756 - in HTMLRemarkNode, if there are empty lines only, the finite state machine would crash My thanks to John Zook and Cedric Rosa for bug reports and suggestions. Bytway, the strictness vector concept has been removed as I mentiond in point [5] - this is probably the most important fix in this release. The parser now begins to show some intelligence- it can auto-correct tags and put inverted commas at the right places. All test cases are passing, and I have put in some intensive amount of testing. Tags like : [1] <Meta name="sdsd" value="sdsds""> [2] <Meta name="sdsd" value="sdsd"sds"> [3] <Meta name="sadd" value="sdsd " sdsd sds "> can be handled now. In case 2 and 3 - the parser corrects them to <Meta name="sdsd" value="sdsdsds"> and <Meta name="sadd" value="sdsd sdsd sds "> respectively. We can also handle tags of a fourth kind : [4] <crazy tag="</I>" dfkdlkfld=dfdf> The criterion now is, if within the inverted comma, there is a begin tag, then we shall expect an end tag, and not think its an error. This is a fundamental change in the parsing automaton in HTMLTag.java. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-07-25 15:54:10
|
Hi Folks, As you know, Cedric Rosa has been giving some nice bug reports - and = I was working on those today. A major problem has been solved- we can = now parse tags which are incorrectly ended, and those that have tags in = inverted commas in them. =20 However, one problem remains. Although parsing doesent crash, when = the tag was incorrectly ended, I am removing all the inverted commas = from it, simply bcos I dont know which inverted comma was the wrong one. e.g. <Meta name=3D"sdsd" value=3D"sdsds""> <Meta name=3D"sdsd" value=3D"sdsd"sds"> <Meta name=3D"sadd" value=3D"sdsd " sdsd sds "> This leads to complications. In the third case, if all inverted commas = are removed, parseParameters cannot pick up the entire string in value = (because of the spaces). The parser needs to be intelligent enough to = know which inverted comma was the erroneous one - the same way as we can = tell. Can anyone suggest some logic that we could formulate to do this ? Regards, Somik |
From: Somik R. <so...@ya...> - 2002-07-23 23:05:20
|
Hi Cedric, This is related to the bug fix done earlier - whereby - META was added to the strictness list - so META tags need to be well formed. If I take it out, this will work but the previous bug will reappear, when parsing wierd meta tags like : <META NAME="Description" CONTENT="Ethnoburb </I>versus Chinatown: Two Types of Urban Ethnic Communities in Los Angeles"> Though it might be possible to effect a fix for both.... got to think harder.. Bytway, can you pls open bug reports for both the bugs on the site atabase - it will then be easy for us to track the bugs and refer to them with links to the site. Thanks again for your great testing! Cheers, Somik **************************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo, 150-0012, JAPAN Tel : +81-3-54752646 Fax : +81-3-5449-4870 Website : www.kizna.com Mail : so...@ki... **************************************************************************** ******* C makes it easy to shoot yourself in the foot. C++ makes it harder, but when you do, it blows away your whole leg. - Bjarne Stroustrup **************************************************************************** ******* ----- Original Message ----- From: "Cédric Rosa" <ced...@fr...> To: <htm...@li...> Sent: Wednesday, July 24, 2002 12:48 AM Subject: [Htmlparser-developer] Bug found > Hello, I've just tried the new integration release and here come daily bugs > found: > > I think this bug is new and come from our weekly fixes: > When in a "meta tag" there is an odd number of " the program do an infinite > loop. > > Examples on theses pages: > www.cybergeo.presse.fr/actualit/nouvparu/crendus/doriercr2.htm > www.cybergeo.presse.fr/culture/vendina/vendina.htm > www.cybergeo.presse.fr/REVGEO/ttsavoir/joly.htm > > Regards, > > Cedric. > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: R. <ced...@fr...> - 2002-07-23 15:48:46
|
Hello, I've just tried the new integration release and here come daily bugs found: I think this bug is new and come from our weekly fixes: When in a "meta tag" there is an odd number of " the program do an infinite loop. Examples on theses pages: www.cybergeo.presse.fr/actualit/nouvparu/crendus/doriercr2.htm www.cybergeo.presse.fr/culture/vendina/vendina.htm www.cybergeo.presse.fr/REVGEO/ttsavoir/joly.htm Regards, Cedric. |
From: Somik R. <so...@ya...> - 2002-07-21 06:14:23
|
Hi Folks, Integration Release 1.2-2002_07_21 is out. You can get it from = http://htmlparser.sourceforge.net. This release contains four bug fixes = - thanks a lot to Cedric Rosa for contributing the bug reports and some = of the fixes. As an aside, I had been very busy with the open sourcing of another = project - a synchronization collaboration server. Just in case folks are = interested, check www.kizna.org - this is a commercial grade server - = which we've used to build real-time applications (like Auctions, Chats, = games, etc..). It has a very simple API - not requiring any knowledge of = protocols, etc. And there is support too :) =20 I am hoping to release some more apps - like a distributed pair = programming plugin over Eclipse which might be of interest to those who = believe in XP. Regards, Somik ********************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012, JAPAN Phone : +81-3-5475-2646 Fax : +81-3-3445-9089 Web : http://www.kizna.com Mail : so...@ki... ********************************** |
From: Somik R. <so...@ya...> - 2002-07-20 01:54:10
|
Hi Cedric, This is fixed. You can check out the latest parser from CVS.=20 Or wait till tomorrow, I will make the next integration release. Thank you for your good work on the bug reports. Bytway, I would be = glad to give you CVS access so you can directly checkin bug fixes. Send = me your sourceforge id. Cheers, Somik ----- Original Message -----=20 From: Somik Raha=20 To: htm...@li...=20 Sent: Friday, July 19, 2002 5:48 PM Subject: [Htmlparser-developer] Malformed End Tag [Was:Daily bugs ... = and one little fix:)] Hi Cedric, Today, I've found another bug :) http://www.cybergeo.presse.fr/sommaire/sisterra/ind15.htm The last ">" is missing in the title mark out. <TITLE>SISTEMA TERRA, VOL. VI , No. 1-3, December 1997</TITLE =3D> null pointer exception If I remember, you have already fix this problem with IMG mark out. = Hope=20 this patch will be the same. I think this is a diff bug altogether - probably in HTMLEndTag. Will = try fixing it soon. Thanks for finding this. Regards, Somik =20 |
From: Somik R. <so...@ya...> - 2002-07-19 09:04:17
|
Hi Cedric Thanks yet again for a good bug report. This fix has been = incorporated. As of now, only one test case fails - that of the unended = title tag. Ive got to duplicate some stuff in HTMLTag into HTMLEndTag. Regards, Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: htm...@li...=20 Sent: Wednesday, July 17, 2002 8:10 PM Subject: [Htmlparser-user] euh ... another fix Hello, To test with: www.revues.org/calenda/articles/1379.html ... <br> <>PROGRAMME</b><br> .. =3D> String Index out of range : 0 In HTMLTagScanner.java: ------------------------------------- public static String absorbLeadingBlanks(String s) { String temp =3D new String(s); file://here we add a check for "temp.length()!=3D0" to prevent a = bug with empty=20 mark out. while (temp.length()!=3D0 && temp.charAt(0)=3D=3D' ') { temp =3D temp.substring(1,temp.length()); } return temp; } I know my bugs report and my fixes are not useful (because bugs quasi = never=20 happen) but they contribute to increase the software stability. I hope = my=20 contribution help you. Regards, Cedric. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-07-19 08:55:03
|
Hi Cedric, Today, I've found another bug :) http://www.cybergeo.presse.fr/sommaire/sisterra/ind15.htm The last ">" is missing in the title mark out. <TITLE>SISTEMA TERRA, VOL. VI , No. 1-3, December 1997</TITLE =3D> null pointer exception If I remember, you have already fix this problem with IMG mark out. Hope = this patch will be the same. I think this is a diff bug altogether - probably in HTMLEndTag. Will try = fixing it soon. Thanks for finding this. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-07-19 08:44:48
|
When I parse this url: www.revues.org/calenda/articles/1083.html Parsing this file last more than 40 second so I've searched which = problem=20 may reduce performance. First, I begin to fix this problem with prevent it to appear. In HTMLReader.java: ------------------------------ protected boolean readNextLine() { boolean skipLine =3D true; if (posInLine!=3D-1 && !(line !=3D null && = node.elementEnd()+1>=3Dline.length())) { for (int i =3D 0; i < line.length(); i++) { if (line.charAt(i) !=3D ' ') { skipLine =3D false; break; } } } return skipLine; } Then I read sources around and I remark it will be a better idea to = patch=20 HTMLStringNode.java The solution is to go in state 1 when you are at the end of a space = string. if (state=3D=3D1) { text+=3Dinput.charAt(i); } file://patch beginning here if (state=3D=3D0 && i=3D=3Dinput.length()-1) state=3D1; file://patch ending here if (state=3D=3D1 && i=3D=3Dinput.length()-1) { input =3D reader.getNextLine(); ///..... I think the second solution is better. I hope this fix will help you = Somik,=20 to patch the code in the next integration release. This fix is incorporated. Thanks. Ive written a test case to trap this = bug. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-07-19 08:10:33
|
Hi Cedric, This was a very good bug report. This turned out to be a deep bug - = but easy to fix. HTMLParser does auto correction of tags when inverted = commas are not provided. However, this can conflict with certain tags = where they are provided. So to provide some intelligence into the = parser- there is this feature of "strictness".=20 This allows you to tell the parser when to be strict and when not to = be. This makes sense in situations when you know, that the html coder = would not make a mistake, and if he does, browsers like IE would crash. = Examples of such tags would be INPUT - for applets, if you are providing = complex params, they must be within inverted commas or it confuses the = browser. I have added the META tag also to this strictness list. Also, there was an issue with HTMLTag.java itself related to this = report. Thank you very much for this bug report - you can try the = StringExtractor on the url you gave, the entire text comes out cleanly. = (Check out from CVS and build, or wait for the next release) Cheers, Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: htm...@li...=20 Sent: Tuesday, July 16, 2002 7:38 PM Subject: [Htmlparser-user] Another bug Hi, When I parse this url: www.cybergeo.presse.fr\culture\weili\weili.htm = no=20 text is found. With my daily bugs reports, you might think that I want to break your=20 software lol ... excuse me for testing with "space" url :) Cedric. ------------------------------------------------------- This sf.net email is sponsored by: Jabber - The world's fastest = growing=20 real-time communications platform! Don't just IM. Build it in!=20 http://www.jabber.com/osdn/xim _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Claude D. <CD...@ar...> - 2002-07-12 04:08:33
|
WW91IG1heSBuZWVkIHRvIGhhdmUgeW91ciB1bml0IHRlc3RzIGNvdmVyIGEgbGFyZ2VyIHNldC4g SSd2ZSBvZnRlbiBmb3VuZCB0aGUgSmF2YURvYyBzZXQgdXNlZnVsIGZvciBzbWFsbGVyIHRlc3Rz LiBUaGVyZSBhcmUgYWJvdXQgODAwMCBkb2N1bWVudHMgaW4gdGhlcmUgd2l0aCBhIHZhcmlldHkg b2Ygc2l6ZXMsIHRob3VnaCB0aGV5IGFyZSBub3QgbmVjZXNzYXJpbHkgcmVwcmVzZW50YXRpdmUg b2YgdGhlIGxhcmdlciBlY29sb2d5IG9mIHRoZSBJbnRlcm5ldC4gVGhlIHJlYWwgdHJpY2sgaXMg dG8gcHV0IGEgdGhyZXNob2xkIG9uIHRoZSB1bml0IHRlc3QgdGhhdCBmbGFncyB5b3UgaWYgeW91 IGV2ZXIgbWFrZSBhIGNoYW5nZSB0aGF0IHNsb3dzIHRoaW5ncyBkb3duLCBhdCB3aGljaCBwb2lu dCB5b3UgY2FuIGV2YWx1YXRlIHdoZXRoZXIgdGhlIHRyYWRlb2ZmIGJldHdlZW4gYSBuZXcgZmVh dHVyZSBvciByZWZhY3RvcmluZyBjaG9pY2UgaXMgd29ydGggdGhlIHBlcmZvcm1hbmNlIGhpdC4N CiANCllvdSd2ZSBkb25lIGEgcHJldHR5IGV4Y2VwdGlvbmFsIGpvYiBhbmQgc2hvdWxkIGJlIHBy b3VkIG9mIHRoZSB3b3JrIHlvdSd2ZSBkb25lLiBwZXJzb25hbGx5LCBJIGNvdWxkbid0IGJlIG1v cmUgcGxlYXNlZCB0aGF0IG91ciBwcm9kdWN0IGlzIDE1JSsgZmFzdGVyIHRoYW5rcyB0byB5b3Vy IGRlc2lnbiBhbmQgaW1wbGVtZW50YXRpb24uIFRoYW5rcyENCg0KCS0tLS0tT3JpZ2luYWwgTWVz c2FnZS0tLS0tIA0KCUZyb206IFNvbWlrIFJhaGEgW21haWx0bzpzb21pa0B5YWhvby5jb21dIA0K CVNlbnQ6IFRodSA3LzExLzIwMDIgNzoyMyBQTSANCglUbzogaHRtbHBhcnNlci1kZXZlbG9wZXJA bGlzdHMuc291cmNlZm9yZ2UubmV0IA0KCUNjOiANCglTdWJqZWN0OiBSZTogW0h0bWxwYXJzZXIt ZGV2ZWxvcGVyXSBSZTogRmluYWwgU3RhdGlzdGljcyBmcm9tIFRyZWsgUnVuDQoJDQoJDQoNCgk+ IFRoZSAxLjIgbnVtYmVycyBhcmUgYmFzZWQgb24gdGhlIDA3MDcgYnVpbGQuDQoJDQoJT2ssIEkg d2lsbCBwcm9maWxlIHNvbWUgbW9yZSBhbmQgdHJ5IHRvIHJlbW92ZSBhbnkgb3RoZXIgYm90dGxl bmVja3MuIEkgd2FzDQoJYWxzbyB0aGlua2luZyBvZiBtYWtpbmcgYSBoZWFkIHNjYW5uZXIuIFRo YXQgd291bGQgYWxsb3cgbWUgdG8gcmVtb3ZlIHRoZQ0KCXRpdGxlIGFuZCBtZXRhIHNjYW5uZXJz IGZyb20gdGhlIHJlZ2lzdGVyZWQgbGlzdCwgYW5kIGFkZCB0aGVtIG9ubHkgd2hlbg0KCXRoZXkg YXJlIHJlYWxseSBuZWVkZWQgKG9uIGVuY291bnRlcmluZyB0aGUgaGVhZCB0YWcpLg0KCQ0KCVJl Z2FyZHMsDQoJU29taWsNCgkNCgkNCgkNCgkNCgktLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tDQoJVGhpcyBzZi5uZXQgZW1haWwgaXMgc3BvbnNv cmVkIGJ5OlRoaW5rR2Vlaw0KCVBDIE1vZHMsIENvbXB1dGluZyBnb29kaWVzLCBjYXNlcyAmIG1v cmUNCglodHRwOi8vdGhpbmtnZWVrLmNvbS9zZg0KCV9fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fDQoJSHRtbHBhcnNlci1kZXZlbG9wZXIgbWFpbGluZyBsaXN0 DQoJSHRtbHBhcnNlci1kZXZlbG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0DQoJaHR0cHM6Ly9s aXN0cy5zb3VyY2Vmb3JnZS5uZXQvbGlzdHMvbGlzdGluZm8vaHRtbHBhcnNlci1kZXZlbG9wZXIN CgkNCg0K |
From: Somik R. <so...@ya...> - 2002-07-12 02:23:47
|
> The 1.2 numbers are based on the 0707 build. Ok, I will profile some more and try to remove any other bottlenecks. I was also thinking of making a head scanner. That would allow me to remove the title and meta scanners from the registered list, and add them only when they are really needed (on encountering the head tag). Regards, Somik |
From: Claude D. <CD...@ar...> - 2002-07-12 02:05:25
|
VGhlIDEuMiBudW1iZXJzIGFyZSBiYXNlZCBvbiB0aGUgMDcwNyBidWlsZC4NCg0KCS0tLS0tT3Jp Z2luYWwgTWVzc2FnZS0tLS0tIA0KCUZyb206IFNvbWlrIFJhaGEgW21haWx0bzpzb21pa0B5YWhv by5jb21dIA0KCVNlbnQ6IFRodSA3LzExLzIwMDIgMzo0NCBQTSANCglUbzogaHRtbHBhcnNlci1k ZXZlbG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0OyBodG1scGFyc2VyLXVzZXJAbGlzdHMuc291 cmNlZm9yZ2UubmV0IA0KCUNjOiANCglTdWJqZWN0OiBbSHRtbHBhcnNlci1kZXZlbG9wZXJdIFJl OiBGaW5hbCBTdGF0aXN0aWNzIGZyb20gVHJlayBSdW4NCgkNCgkNCglIaSBDbGF1ZGUNCgkgDQoJ VGltZSBmb3IgU3dpbmcgKGluIG1pbnV0ZXMpOiAxMCwzMDUgKDAuMDkzNzI3OSBkb2NzL3NlYykN CglUaW1lIGZvciBIVE1MUGFyc2VyIDEuMSAoaW4gbWludXRlcyk6IDI5NCAoMy4yOTQzODc3IGRv Y3Mvc2VjKQ0KCVRpbWUgZm9yIEhUTUxQYXJzZXIgMS4yIChpbiBtaW51dGVzKTogMzExICgzLjE2 NjUwNTggZG9jcy9zZWMpDQoNCglXaGljaCB2ZXIgb2YgMS4yIGlzIHRoaXMgKGlzIGl0IHRoZSBs YXRlc3QpID8gVGhlIHByZXZpb3VzIG9uZSBoYWQgc2VyaW91cyBpc3N1ZXMgd2l0aCBzdHJpbmcg YWxsb2NhdGlvbnMsIGJ1dCB0aGUgbGF0ZXN0IG91Z2h0IHRvIGJlIGZhc3RlciBmb3IgYmlnZ2Vy IGZpbGVzIHRoYW4gMS4xLg0KCSANCglSZWdhcmRzLA0KCVNvbWlrDQoNCg== |
From: Somik R. <so...@ya...> - 2002-07-12 01:06:02
|
MessageHi Claude Time for Swing (in minutes): 10,305 (0.0937279 docs/sec) Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec) Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec) Which ver of 1.2 is this (is it the latest) ? The previous one had = serious issues with string allocations, but the latest ought to be = faster for bigger files than 1.1. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-07-11 22:42:37
|
MessageThe SWT is not a contender for replacing Swing. It may be an = alternative, applicable in many circumstaces, but a quick look at the = Sun's Swing connection should dissuade you from assuming that few people = are using Swing.=20 LOL! I was asking for trouble with that comment :). I guess its just me = that finds Swing unbearably slow. I would not endorse trying to make HTMLParser Swing-compatible. These = are different animals and should stay that way. The notion of providing = a SAX-like interface is interesting but you should look instead toward = XML pull-parsers, which are the high-performance alternatives now = surfacing more widely. There is a JSR = (http://www.jcp.org/jsr/detail/173.jsp) that is trying to unify a good = interface for pull-parsing (they're calling it a Streaming API). You'll = find this link especially intersting (http://www.xmlpull.org/). I will look into this advice seriously (will start by educating myself = on XML Pull-parsers).=20 HTMLParser has two fundamental strengths. 1) It's easy to use and = extend. 2) It's lightning fast. Don't lose sight of these distinctions. The whole XML community is = strugling to achieve these goals and hasn't quite gotten there yet. = There's much to learn from XML, but they are laregely moving in this = direction. Its interesting that this should come up - the other day someone was = suggesting to me if the HTMLParser might not be used for parsing XML.. BTW: JTidy is a serious performance bottleneck in a high-performance = application. Good to know that :), havent checked it out myself yet. Its great to have a knowledgable person like you join this parser = community. It will be of great value in taking the final steps towards = stabilizing the API of the parser. The next integration releases would = focus on incorporating your suggestions, regarding the exception = handling. Maybe first week of Sep might be a realistic date for the = release of 1.2 (unless I get loads of time or help). Regards, Somik ----- Original Message -----=20 From: Claude Duguay=20 To: htm...@li...=20 Sent: Friday, July 12, 2002 1:29 AM Subject: RE: [Htmlparser-user] Final Statistics from Trek Run The SWT is not a contender for replacing Swing. It may be an = alternative, applicable in many circumstaces, but a quick look at the = Sun's Swing connection should dissuade you from assuming that few people = are using Swing. =20 HTMLParser has two fundamental strengths. 1) It's easy to use and = extend. 2) It's lightning fast. =20 Don't lose sight of these distinctions. The whole XML community is = strugling to achieve these goals and hasn't quite gotten there yet. = There's much to learn from XML, but they are laregely moving in this = direction. =20 BTW: JTidy is a serious performance bottleneck in a high-performance = application. =20 -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Thursday, July 11, 2002 2:25 AM To: htm...@li... Subject: Re: [Htmlparser-user] Final Statistics from Trek Run Hi Craig, For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be = null). It is possible that the formatting of the input HTML into a = structure with these characteristics reduces the parser's performance in order = to produce a better render. =20 Indeed - perhaps a good idea would be to rewrite JEditorPane :) - = make an open source version, which is better designed. Swing = compatibility is a real pain - we gave up on that not so far back :). On = the other hand, I was thinking that SAX compliance would be feasible and = worth it - I doubt if many people are considering Swing for graphics = these days, especially with the SWT being out there. But the SAX = mechanism is quite popular and its worth being able to just switch = parsers. Of course, whether you need to take these considerations into = account depends entirely on your application. The htmlparser seems to lean = more toward the extraction of information rather than its representation, = and the latter is so fraught with ambiguities as to make it a task of a different order altogether. So true. Like you had mailed sometime back, JTidy does a good job of = that. =20 Regards, Somik ----- Original Message -----=20 From: Craig Raw=20 To: htm...@li...=20 Sent: Thursday, July 11, 2002 5:35 PM Subject: [Htmlparser-user] RE: [Htmlparser-developer] Final = Statistics from Trek Run Just a point to notice on these tests. The htmlparser, for all = it's merits, is not a direct functional replacement for the Swing = parser.=20 For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be = null). It is possible that the formatting of the input HTML into a = structure with these characteristics reduces the parser's performance in = order to produce a better render. Of course, whether you need to take these considerations into = account depends entirely on your application. The htmlparser seems to lean = more toward the extraction of information rather than its = representation, and the latter is so fraught with ambiguities as to make it a task of = a different order altogether. -craig -----Original Message----- From: htm...@li... [mailto:htm...@li...] On = Behalf Of Somik Raha Sent: 11 July 2002 02:19 AM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, Thanks a ton for all these tests. Do you think you could write an article on this that we could put up ? Regards Somik ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek PC Mods, Computing goodies, cases & more http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Claude D. <CD...@ar...> - 2002-07-11 16:17:46
|
We're not quite done yet... ;-) =20 Here are some numbers that reflect the differences with the larger files. This set is 57,952 files (6,256,488,243 bytes), many of which are several megabyte log file dumps to HTML (average file size for this set is 107,959 bytes). These are especially problematic for the Swing parser: =20 Time for Swing (in minutes): 10,305 (0.0937279 docs/sec) Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec) Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec) =20 Note that this run was done on a single box with no other parallel runs. Also, there was a variance of about 1000 files between runs that are reflected in the speed numbers. But I provided the average in the paragraph above, so you will not get exact results from recalculating from those numbers. Still, everything needs to be looked at in perspective. =20 Notable here is that the 1.2 version seems to be a tiny bit slower on big files. This is almost certainly due to string reallocation. As contiguous content gets larger, which can happen in any application that works heavily with string objects. It might be worth looking at whether this is addressable. Overall, though, HTMParser 1.2 is clearly an improvement over the most commonly used Java/HTML parser (ie: Swing) in use today ;-). -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Wednesday, July 10, 2002 5:19 PM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, Thanks a ton for all these tests. Do you think you could write an article on this that we could put up ? =20 Regards Somik |
From: Somik R. <so...@ya...> - 2002-07-11 00:19:15
|
MessageHi Claude, Thanks a ton for all these tests. Do you think you could write an = article on this that we could put up ? Regards Somik |
From: Claude D. <CD...@ar...> - 2002-07-10 16:24:51
|
Note also that these tests were run in parallel on the same Solaris box. A single instance can often run significanly faster. These tests were done to test relative speed between versions, keeping all other factors constant. -----Original Message----- From: Claude Duguay=20 Sent: Wednesday, July 10, 2002 8:58 AM To: htm...@li...; htm...@li... Subject: [Htmlparser-developer] Final Statistics from Trek Run The latest version of the HTMLParser (20020707) appears to deliver good performance over the Swing parser and previous HTMLParser versions. These tests were done in context (using our application, which converts HTML documents, among others, into a normalized form and transmits the result as XML to a server over TCP/IP). We have subtracted the transmission time from these numbers, but a small amount of imprecision is probable given preprocessing and file I/O that gets done up front. Given the size of the tests (more than a half million documents), these elements should negligable. Note that this set includes a large number of small documents and we know from earlier tests that the Swing parser slows down dramatically as documents get larger, while the HTMLParser does not. =20 Total Documents processed: 642,077 Average Document Size: 4,043 =20 Average Number of Documents Per Second for: =20 Swing Parser (Java 1.3.1): 2.797185195 HTMLParser 1.1 Production Version: 2.558727723 HTMLParser 1.2 Early integration build: 2.585632061 HTMLParser 1.2 (build 20020707): 3.224910367 =20 Conclusions: The HTMLParser 1.2 is now about 15% faster than the Swing parser on Swing's home turf (Swing does best with smaller HTML files). With larger files, we have seen improvements as high as 35 times the seed of the Swing parser). =20 |