htmlparser-user Mailing List for HTML Parser (Page 93)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Somik R. <so...@ya...> - 2002-08-01 02:37:21
|
While using the parser, if I need to call HTMLTag.parseParameters(), I get a Hashtable. Will the keys in this hashtable be in the exact same case as they are in the HTML page or will they be in a standard form, all upper case or all lower case? Since in my project, my team works on HTMLs created by another team this could be in any case, however before conversion I would need to be able to read these attributes irrespective of case. Any ideas? Good question. Typically, you should not call parseParameters() = directly. Instead you should call : HTMLTag.getParameter("KEY"); =20 The reason is parseParameters() is a computation intensive method - you = will take a solid scalability hit if your tag keeps calling it for = parsing the table. The parsing should be done only once, and the table = created is maintained in the tag object. You can get the table too if = you wish by : HTMLTag.getParsed(). Coming to the case of the key - it is ALWAYS converted to UPPERCASE by = parseParameters(). This allows you to forget about case issues and deal = with parameters uniformly. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-08-01 02:33:09
|
I am evaluating the HTMLParser for suitability in my project. I have gone through the Javadocs and observed that there is support only for some tags. For example IMAGE, LINK, SCRIPT etc. I wanted to know that suppose I want to support another tag say <INPUT> will I need to write my own tag-scanner pair? And if I need to write it how do I do it? You could do it in two ways -=20 [1] Handle it directly at in your application - like this : HTMLNode node; HTMLTag tag; for (Enumeration e =3D parser.elements();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); if (node instanceof HTMLTag) { tag =3D (HTMLTag)node; if (tag.getText().indexOf("INPUT")=3D=3D0) { // Its an input tag // Your code here // The conditional above can be made more robust.. } =20 } } [2] Write your own scanner-tag pair. This is very easy for the input = tag, as no additional parsing is needed. Have a HTMLInputTagScanner = extends HTMLTagScanner. Implement evaluate() - when should this tag = scanner activate ? i.e. when the string contains INPUT in a certain = location (first location). The HTMLTagScanner has some utility methods, = like absorbLeadingBlanks() - which you should do to make the checkign = simpler and more robust. The scan method is given control when you are evaluate has returned = true. You have to create an object of type HTMLInputTag (extends = HTMLTag), and this is really very simple. Not much in your tag changes, = so use the interface to extract data and create the input tag object. To = see how easy this can be, look at HTMLMetaTagScanner. Finally, make sure that you dont have HTMLFormScanner registered. = Because, HTMLFormScanner automatically picks up the Input tags. = Actually, we have taken out HTMLFormScanner from the default registry, = because its very hard to auto-correct a complete form block - we dont = know how to predict when a form has ended. So this shouldnt be a problem = at all for you. The class is just there for people who need to parse = forms (some of the users of the parser are using it). Cheers, Somik |
From: <dha...@or...> - 2002-07-31 11:40:06
|
Hi, I am trying to write my own scanner class. In the scan method what should be ideally returned? i.e. the Javadoc says that the return type is an HTMLNOde however I am not clear as to exactly what shoudl be returned. Can anyone help me out here? Bye, Dhaval |
From: <dha...@or...> - 2002-07-31 10:49:02
|
Hi, While using the parser, if I need to call HTMLTag.parseParameters(), I get a Hashtable. Will the keys in this hashtable be in the exact same case as they are in the HTML page or will they be in a standard form, all upper case or all lower case? Since in my project, my team works on HTMLs created by another team this could be in any case, however before conversion I would need to be able to read these attributes irrespective of case. Any ideas? Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 |
From: <dha...@or...> - 2002-07-31 09:33:45
|
Hi everyone, I am evaluating the HTMLParser for suitability in my project. I have gone through the Javadocs and observed that there is support only for some tags. For example IMAGE, LINK, SCRIPT etc. I wanted to know that suppose I want to support another tag say <INPUT> will I need to write my own tag-scanner pair? And if I need to write it how do I do it? Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 |
From: Cheng J. <c....@sm...> - 2002-07-31 04:14:11
|
Dear Somik Raha, I found a webpage , which the parser could not process. when I try to parse this page, the parser would turn into endless circle. http://www.bton.ac.uk/ could you tell me how to deal with such kind of page. Best wishes, Cheng Jun 2002-07-31 |
From: Somik R. <so...@ya...> - 2002-07-28 07:26:51
|
Hi Folks, This week's integration release is out - 1.2-2002_07_28. This contains some major bug fixes. They are : [1] Fixed bug in HTMLParser.openConnection(), mistaking files for urls if they contain "http" or "www" anywhere. [2] Updated HTMLEndTag, this was accidentally left out in the previous release. [3] Fixed Bug 586062 - relative links bug - if first char is a slash, then the subdirectories of the url need to be ignored. [4] Fixed Bug 586222 - HTMLRemarkNode bug - if a line with a remark ndoe contains a string before it, the string is ignored. [5] Fixed major bug - allowing auto-correction of malformed tags. Current code is very robust. Fix allowed removal of strictness vector concept, making the design simpler. [6] Fixed bug 586756 - in HTMLRemarkNode, if there are empty lines only, the finite state machine would crash My thanks to John Zook and Cedric Rosa for bug reports and suggestions. Bytway, the strictness vector concept has been removed as I mentiond in point [5] - this is probably the most important fix in this release. The parser now begins to show some intelligence- it can auto-correct tags and put inverted commas at the right places. All test cases are passing, and I have put in some intensive amount of testing. Tags like : [1] <Meta name="sdsd" value="sdsds""> [2] <Meta name="sdsd" value="sdsd"sds"> [3] <Meta name="sadd" value="sdsd " sdsd sds "> can be handled now. In case 2 and 3 - the parser corrects them to <Meta name="sdsd" value="sdsdsds"> and <Meta name="sadd" value="sdsd sdsd sds "> respectively. We can also handle tags of a fourth kind : [4] <crazy tag="</I>" dfkdlkfld=dfdf> The criterion now is, if within the inverted comma, there is a begin tag, then we shall expect an end tag, and not think its an error. This is a fundamental change in the parsing automaton in HTMLTag.java. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-07-21 06:14:22
|
Hi Folks, Integration Release 1.2-2002_07_21 is out. You can get it from = http://htmlparser.sourceforge.net. This release contains four bug fixes = - thanks a lot to Cedric Rosa for contributing the bug reports and some = of the fixes. As an aside, I had been very busy with the open sourcing of another = project - a synchronization collaboration server. Just in case folks are = interested, check www.kizna.org - this is a commercial grade server - = which we've used to build real-time applications (like Auctions, Chats, = games, etc..). It has a very simple API - not requiring any knowledge of = protocols, etc. And there is support too :) =20 I am hoping to release some more apps - like a distributed pair = programming plugin over Eclipse which might be of interest to those who = believe in XP. Regards, Somik ********************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012, JAPAN Phone : +81-3-5475-2646 Fax : +81-3-3445-9089 Web : http://www.kizna.com Mail : so...@ki... ********************************** |
From: Somik R. <so...@ya...> - 2002-07-18 00:30:48
|
Hi Cedric, I really appreciate all your bug reports. Its just that Kizna = Corporation is going through a major rollout - we are open sourcing our = main product - a collaboration server, and that is taking up all my time = at present. As soon as our server rolls out, most of our remaining products also = will be open source.=20 I will definitely look at your reports tomorrow (our release date is = today).=20 One request - its a little difficult to track bug reports thru mail. = Can you enter the reports from the site = (http://htmlparser.sourceforge.net) I also want to mention that the bug report you gave yesterday was = very good. In the resulting bug fix, I have further optimized the String = parsing, and I think there will be some performance improvement. I will = provide detailed info later. Also - Cedric ---> Can you consider joining the developer list- we = can keep the user list free for doubts on the API, and hammer out the = bugs in the dev list. Cheers, Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: htm...@li...=20 Sent: Wednesday, July 17, 2002 8:10 PM Subject: [Htmlparser-user] euh ... another fix Hello, To test with: www.revues.org/calenda/articles/1379.html ... <br> <>PROGRAMME</b><br> .. =3D> String Index out of range : 0 In HTMLTagScanner.java: ------------------------------------- public static String absorbLeadingBlanks(String s) { String temp =3D new String(s); file://here we add a check for "temp.length()!=3D0" to prevent a = bug with empty=20 mark out. while (temp.length()!=3D0 && temp.charAt(0)=3D=3D' ') { temp =3D temp.substring(1,temp.length()); } return temp; } I know my bugs report and my fixes are not useful (because bugs quasi = never=20 happen) but they contribute to increase the software stability. I hope = my=20 contribution help you. Regards, Cedric. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: R. <ced...@fr...> - 2002-07-17 11:09:59
|
Hello, To test with: www.revues.org/calenda/articles/1379.html ... <br> <>PROGRAMME</b><br> .. => String Index out of range : 0 In HTMLTagScanner.java: ------------------------------------- public static String absorbLeadingBlanks(String s) { String temp = new String(s); //here we add a check for "temp.length()!=0" to prevent a bug with empty mark out. while (temp.length()!=0 && temp.charAt(0)==' ') { temp = temp.substring(1,temp.length()); } return temp; } I know my bugs report and my fixes are not useful (because bugs quasi never happen) but they contribute to increase the software stability. I hope my contribution help you. Regards, Cedric. |
From: R. <ced...@fr...> - 2002-07-17 10:37:51
|
When I parse this url: www.revues.org/calenda/articles/1083.html Parsing this file last more than 40 second so I've searched which problem may reduce performance. First, I begin to fix this problem with prevent it to appear. In HTMLReader.java: ------------------------------ protected boolean readNextLine() { boolean skipLine = true; if (posInLine!=-1 && !(line != null && node.elementEnd()+1>=line.length())) { for (int i = 0; i < line.length(); i++) { if (line.charAt(i) != ' ') { skipLine = false; break; } } } return skipLine; } Then I read sources around and I remark it will be a better idea to patch HTMLStringNode.java The solution is to go in state 1 when you are at the end of a space string. if (state==1) { text+=input.charAt(i); } //patch beginning here if (state==0 && i==input.length()-1) state=1; //patch ending here if (state==1 && i==input.length()-1) { input = reader.getNextLine(); ///..... I think the second solution is better. I hope this fix will help you Somik, to patch the code in the next integration release. Today, I've found another bug :) http://www.cybergeo.presse.fr/sommaire/sisterra/ind15.htm The last ">" is missing in the title mark out. <TITLE>SISTEMA TERRA, VOL. VI , No. 1-3, December 1997</TITLE => null pointer exception If I remember, you have already fix this problem with IMG mark out. Hope this patch will be the same. Regards, Cedric. |
From: Somik R. <so...@ya...> - 2002-07-17 01:57:49
|
Hi Cedric, Ive fixed this bug (in HTMLStringNode.java). Fix should be in the = next integration release. If you are in a hurry, you can check out from = CVS and build. I've also added StringExtractor under parserapplications - it seems = to be a common app for a lot of people. Regards Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: so...@ya...=20 Sent: Tuesday, July 16, 2002 5:16 PM Subject: Re: Fw: [Htmlparser-user] Microsoft's ugly web page = generation and parsing Hi Somik, first scuse for this big file include in my mail. "ref6.htm" is the document I obtain when crawling with wget. I try to = crawl=20 with your software and it works better. wget may include some newlines = or=20 others characters in the saved file. As you can see in the file "logpb.log", when I directly parse the = file,=20 your parser is almost perfect but tags <![endif]> are still here. "logpb2.log" contains the log when parsing with "ref6.htm" which is on = the=20 disk. Thanks a ton for your excellent support, Regards, C=E9dric. At 08:56 16/07/2002 +0200, you wrote: > >----- Original Message ----- >From: <mailto:so...@ya...>Somik Raha >To:=20 = ><mailto:htm...@li...>htm...@li...ur= ceforge.net=20 > >Sent: Tuesday, July 16, 2002 2:12 AM >Subject: Re: [Htmlparser-user] Microsoft's ugly web page generation = and=20 >parsing > >Hi Cedric > I couldnt figure out your bug report. On parsing the pages, the=20 > output seemed (prima facie) to be correct. > Can you specifically give the input that we should try with, and = what=20 > the actual output should be, and also post what you are getting.=20 > Alternatively, tell me which lines in the page are not being parsed = correctly. > Thanks. > >Regards, >Somik > >----- Original Message ----- >From: <mailto:ced...@fr...>C=E9dric Rosa >To:=20 = ><mailto:htm...@li...>htm...@li...ur= ceforge.net=20 > >Sent: Monday, July 15, 2002 11:41 PM >Subject: [Htmlparser-user] Microsoft's ugly web page generation and = parsing > >Hello, > >Simply try to parse this ugly document for example: = ><http://www.cevipof.msh-paris.fr\moment\ref6.htm>www.cevipof.msh-paris.f= r\moment\ref6.htm=20 >(2,6Mo !!!!!) > >The text which result from the parse contains lines like : >"" <![endif]--><!--[if supportFields]>" >"v\:* {behavior:url(#default#VML);}" >"mso-font-pitch:variable;" > >I think there is a problem in text detection when several tags are=20 >imbricated. >The solution will be maybe to skip all text after "<!--" until "-->". > >I don't have time to patch the code. If someone can fix this problem, = it >will be fantastic. > >Thanks by advance, > >Cedric Rosa. > > > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. ><http://thinkgeek.com/sf>http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-user mailing list = ><mailto:Htm...@li...>Htm...@li...ur= ceforge.net=20 > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: R. <ced...@fr...> - 2002-07-16 10:38:48
|
Hi, When I parse this url: www.cybergeo.presse.fr\culture\weili\weili.htm no text is found. With my daily bugs reports, you might think that I want to break your software lol ... excuse me for testing with "space" url :) Cedric. |
From: Somik R. <so...@ya...> - 2002-07-16 00:19:06
|
Hi Cedric I couldnt figure out your bug report. On parsing the pages, the = output seemed (prima facie) to be correct.=20 Can you specifically give the input that we should try with, and = what the actual output should be, and also post what you are getting. = Alternatively, tell me which lines in the page are not being parsed = correctly. Thanks. Regards, Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: htm...@li...=20 Sent: Monday, July 15, 2002 11:41 PM Subject: [Htmlparser-user] Microsoft's ugly web page generation and = parsing Hello, Simply try to parse this ugly document for example: www.cevipof.msh-paris.fr\moment\ref6.htm (2,6Mo !!!!!) The text which result from the parse contains lines like : "" <![endif]--><!--[if supportFields]>" "v\:* {behavior:url(#default#VML);}" "mso-font-pitch:variable;" I think there is a problem in text detection when several tags are = imbricated. The solution will be maybe to skip all text after "<!--" until "-->". I don't have time to patch the code. If someone can fix this problem, = it=20 will be fantastic. Thanks by advance, Cedric Rosa. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: R. <ced...@fr...> - 2002-07-15 14:40:45
|
Hello, Simply try to parse this ugly document for example: www.cevipof.msh-paris.fr\moment\ref6.htm (2,6Mo !!!!!) The text which result from the parse contains lines like : "" <![endif]--><!--[if supportFields]>" "v\:* {behavior:url(#default#VML);}" "mso-font-pitch:variable;" I think there is a problem in text detection when several tags are imbricated. The solution will be maybe to skip all text after "<!--" until "-->". I don't have time to patch the code. If someone can fix this problem, it will be fantastic. Thanks by advance, Cedric Rosa. |
From: Somik R. <so...@ya...> - 2002-07-12 01:06:02
|
MessageHi Claude Time for Swing (in minutes): 10,305 (0.0937279 docs/sec) Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec) Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec) Which ver of 1.2 is this (is it the latest) ? The previous one had = serious issues with string allocations, but the latest ought to be = faster for bigger files than 1.1. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-07-11 22:42:37
|
MessageThe SWT is not a contender for replacing Swing. It may be an = alternative, applicable in many circumstaces, but a quick look at the = Sun's Swing connection should dissuade you from assuming that few people = are using Swing.=20 LOL! I was asking for trouble with that comment :). I guess its just me = that finds Swing unbearably slow. I would not endorse trying to make HTMLParser Swing-compatible. These = are different animals and should stay that way. The notion of providing = a SAX-like interface is interesting but you should look instead toward = XML pull-parsers, which are the high-performance alternatives now = surfacing more widely. There is a JSR = (http://www.jcp.org/jsr/detail/173.jsp) that is trying to unify a good = interface for pull-parsing (they're calling it a Streaming API). You'll = find this link especially intersting (http://www.xmlpull.org/). I will look into this advice seriously (will start by educating myself = on XML Pull-parsers).=20 HTMLParser has two fundamental strengths. 1) It's easy to use and = extend. 2) It's lightning fast. Don't lose sight of these distinctions. The whole XML community is = strugling to achieve these goals and hasn't quite gotten there yet. = There's much to learn from XML, but they are laregely moving in this = direction. Its interesting that this should come up - the other day someone was = suggesting to me if the HTMLParser might not be used for parsing XML.. BTW: JTidy is a serious performance bottleneck in a high-performance = application. Good to know that :), havent checked it out myself yet. Its great to have a knowledgable person like you join this parser = community. It will be of great value in taking the final steps towards = stabilizing the API of the parser. The next integration releases would = focus on incorporating your suggestions, regarding the exception = handling. Maybe first week of Sep might be a realistic date for the = release of 1.2 (unless I get loads of time or help). Regards, Somik ----- Original Message -----=20 From: Claude Duguay=20 To: htm...@li...=20 Sent: Friday, July 12, 2002 1:29 AM Subject: RE: [Htmlparser-user] Final Statistics from Trek Run The SWT is not a contender for replacing Swing. It may be an = alternative, applicable in many circumstaces, but a quick look at the = Sun's Swing connection should dissuade you from assuming that few people = are using Swing. =20 HTMLParser has two fundamental strengths. 1) It's easy to use and = extend. 2) It's lightning fast. =20 Don't lose sight of these distinctions. The whole XML community is = strugling to achieve these goals and hasn't quite gotten there yet. = There's much to learn from XML, but they are laregely moving in this = direction. =20 BTW: JTidy is a serious performance bottleneck in a high-performance = application. =20 -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Thursday, July 11, 2002 2:25 AM To: htm...@li... Subject: Re: [Htmlparser-user] Final Statistics from Trek Run Hi Craig, For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be = null). It is possible that the formatting of the input HTML into a = structure with these characteristics reduces the parser's performance in order = to produce a better render. =20 Indeed - perhaps a good idea would be to rewrite JEditorPane :) - = make an open source version, which is better designed. Swing = compatibility is a real pain - we gave up on that not so far back :). On = the other hand, I was thinking that SAX compliance would be feasible and = worth it - I doubt if many people are considering Swing for graphics = these days, especially with the SWT being out there. But the SAX = mechanism is quite popular and its worth being able to just switch = parsers. Of course, whether you need to take these considerations into = account depends entirely on your application. The htmlparser seems to lean = more toward the extraction of information rather than its representation, = and the latter is so fraught with ambiguities as to make it a task of a different order altogether. So true. Like you had mailed sometime back, JTidy does a good job of = that. =20 Regards, Somik ----- Original Message -----=20 From: Craig Raw=20 To: htm...@li...=20 Sent: Thursday, July 11, 2002 5:35 PM Subject: [Htmlparser-user] RE: [Htmlparser-developer] Final = Statistics from Trek Run Just a point to notice on these tests. The htmlparser, for all = it's merits, is not a direct functional replacement for the Swing = parser.=20 For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be = null). It is possible that the formatting of the input HTML into a = structure with these characteristics reduces the parser's performance in = order to produce a better render. Of course, whether you need to take these considerations into = account depends entirely on your application. The htmlparser seems to lean = more toward the extraction of information rather than its = representation, and the latter is so fraught with ambiguities as to make it a task of = a different order altogether. -craig -----Original Message----- From: htm...@li... [mailto:htm...@li...] On = Behalf Of Somik Raha Sent: 11 July 2002 02:19 AM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, Thanks a ton for all these tests. Do you think you could write an article on this that we could put up ? Regards Somik ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek PC Mods, Computing goodies, cases & more http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Claude D. <CD...@ar...> - 2002-07-11 16:29:59
|
The SWT is not a contender for replacing Swing. It may be an alternative, applicable in many circumstaces, but a quick look at the Sun's Swing connection should dissuade you from assuming that few people are using Swing. I would not endorse trying to make HTMLParser Swing-compatible. These are different animals and should stay that way. The notion of providing a SAX-like interface is interesting but you should look instead toward XML pull-parsers, which are the high-performance alternatives now surfacing more widely. There is a JSR (http://www.jcp.org/jsr/detail/173.jsp) that is trying to unify a good interface for pull-parsing (they're calling it a Streaming API). You'll find this link especially intersting (http://www.xmlpull.org/). =20 HTMLParser has two fundamental strengths. 1) It's easy to use and extend. 2) It's lightning fast. =20 Don't lose sight of these distinctions. The whole XML community is strugling to achieve these goals and hasn't quite gotten there yet. There's much to learn from XML, but they are laregely moving in this direction. =20 BTW: JTidy is a serious performance bottleneck in a high-performance application. =20 -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Thursday, July 11, 2002 2:25 AM To: htm...@li... Subject: Re: [Htmlparser-user] Final Statistics from Trek Run Hi Craig, For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be null). It is possible that the formatting of the input HTML into a structure with these characteristics reduces the parser's performance in order to produce a better render. =20 Indeed - perhaps a good idea would be to rewrite JEditorPane :) - make an open source version, which is better designed. Swing compatibility is a real pain - we gave up on that not so far back :). On the other hand, I was thinking that SAX compliance would be feasible and worth it - I doubt if many people are considering Swing for graphics these days, especially with the SWT being out there. But the SAX mechanism is quite popular and its worth being able to just switch parsers. =20 Of course, whether you need to take these considerations into account depends entirely on your application. The htmlparser seems to lean more toward the extraction of information rather than its representation, and the latter is so fraught with ambiguities as to make it a task of a different order altogether. So true. Like you had mailed sometime back, JTidy does a good job of that. =20 Regards, Somik ----- Original Message -----=20 From: Craig Raw <mailto:cr...@qu...> =20 To: htm...@li...=20 Sent: Thursday, July 11, 2002 5:35 PM Subject: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Just a point to notice on these tests. The htmlparser, for all it's merits, is not a direct functional replacement for the Swing parser.=20 For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be null). It is possible that the formatting of the input HTML into a structure with these characteristics reduces the parser's performance in order to produce a better render. Of course, whether you need to take these considerations into account depends entirely on your application. The htmlparser seems to lean more toward the extraction of information rather than its representation, and the latter is so fraught with ambiguities as to make it a task of a different order altogether. -craig -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Somik Raha Sent: 11 July 2002 02:19 AM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, Thanks a ton for all these tests. Do you think you could write an article on this that we could put up ? Regards Somik ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek PC Mods, Computing goodies, cases & more http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Claude D. <CD...@ar...> - 2002-07-11 16:17:45
|
We're not quite done yet... ;-) =20 Here are some numbers that reflect the differences with the larger files. This set is 57,952 files (6,256,488,243 bytes), many of which are several megabyte log file dumps to HTML (average file size for this set is 107,959 bytes). These are especially problematic for the Swing parser: =20 Time for Swing (in minutes): 10,305 (0.0937279 docs/sec) Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec) Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec) =20 Note that this run was done on a single box with no other parallel runs. Also, there was a variance of about 1000 files between runs that are reflected in the speed numbers. But I provided the average in the paragraph above, so you will not get exact results from recalculating from those numbers. Still, everything needs to be looked at in perspective. =20 Notable here is that the 1.2 version seems to be a tiny bit slower on big files. This is almost certainly due to string reallocation. As contiguous content gets larger, which can happen in any application that works heavily with string objects. It might be worth looking at whether this is addressable. Overall, though, HTMParser 1.2 is clearly an improvement over the most commonly used Java/HTML parser (ie: Swing) in use today ;-). -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Wednesday, July 10, 2002 5:19 PM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, Thanks a ton for all these tests. Do you think you could write an article on this that we could put up ? =20 Regards Somik |
From: Claude D. <CD...@ar...> - 2002-07-11 16:00:46
|
There's no question that the Swing and HTMLParser are designed for = different purposes. Swing doesn't build much of an internal = representation if you plug in your own callback. I think that's handled = by the EditorKit(s). I think it's more fogiving than you're suggesting. = We've used it on millions of files (well above 12 million distinct files = that reflect real-word, ill-formedness) and it's handled these = situations well enough. Still because Swing offers this callback mechanism, the parser tends to = be used in cases where something like HTMLParser would be a much better = choice. -----Original Message----- From: Craig Raw [mailto:cr...@qu...]=20 Sent: Thursday, July 11, 2002 1:36 AM To: htm...@li... Subject: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics = from Trek Run Just a point to notice on these tests. The htmlparser, for all it's merits, is not a direct functional replacement for the Swing parser.=20 For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be null). It is possible that the formatting of the input HTML into a structure with these characteristics reduces the parser's performance in order to produce a better render. Of course, whether you need to take these considerations into account depends entirely on your application. The htmlparser seems to lean more toward the extraction of information rather than its representation, and the latter is so fraught with ambiguities as to make it a task of a different order altogether. -craig -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Somik Raha Sent: 11 July 2002 02:19 AM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, =A0=A0=A0 Thanks a ton for all these tests. Do you think you could write = an article on this that we could put up ? =A0 Regards Somik ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek PC Mods, Computing goodies, cases & more http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-07-11 09:31:42
|
Hi Craig, For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be null). It is possible that the formatting of the input HTML into a structure with these characteristics reduces the parser's performance in order to produce a better render. =20 Indeed - perhaps a good idea would be to rewrite JEditorPane :) - make = an open source version, which is better designed. Swing compatibility is = a real pain - we gave up on that not so far back :). On the other hand, = I was thinking that SAX compliance would be feasible and worth it - I = doubt if many people are considering Swing for graphics these days, = especially with the SWT being out there. But the SAX mechanism is quite = popular and its worth being able to just switch parsers. Of course, whether you need to take these considerations into account depends entirely on your application. The htmlparser seems to lean more toward the extraction of information rather than its representation, and the latter is so fraught with ambiguities as to make it a task of a different order altogether. So true. Like you had mailed sometime back, JTidy does a good job of = that. Regards, Somik ----- Original Message -----=20 From: Craig Raw=20 To: htm...@li...=20 Sent: Thursday, July 11, 2002 5:35 PM Subject: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics = from Trek Run Just a point to notice on these tests. The htmlparser, for all it's merits, is not a direct functional replacement for the Swing parser.=20 For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be = null). It is possible that the formatting of the input HTML into a structure with these characteristics reduces the parser's performance in order = to produce a better render. Of course, whether you need to take these considerations into account depends entirely on your application. The htmlparser seems to lean = more toward the extraction of information rather than its representation, = and the latter is so fraught with ambiguities as to make it a task of a different order altogether. -craig -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Somik Raha Sent: 11 July 2002 02:19 AM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, Thanks a ton for all these tests. Do you think you could write an article on this that we could put up ? Regards Somik ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek PC Mods, Computing goodies, cases & more http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Craig R. <cr...@qu...> - 2002-07-11 09:12:35
|
Just a point to notice on these tests. The htmlparser, for all it's merits, is not a direct functional replacement for the Swing parser.=20 For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be null). It is possible that the formatting of the input HTML into a structure with these characteristics reduces the parser's performance in order to produce a better render. Of course, whether you need to take these considerations into account depends entirely on your application. The htmlparser seems to lean more toward the extraction of information rather than its representation, and the latter is so fraught with ambiguities as to make it a task of a different order altogether. -craig -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Somik Raha Sent: 11 July 2002 02:19 AM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, =A0=A0=A0 Thanks a ton for all these tests. Do you think you could write = an article on this that we could put up ? =A0 Regards Somik |
From: Somik R. <so...@ya...> - 2002-07-11 00:18:43
|
MessageHi Claude, Thanks a ton for all these tests. Do you think you could write an = article on this that we could put up ? Regards Somik |
From: Claude D. <CD...@ar...> - 2002-07-10 16:24:51
|
Note also that these tests were run in parallel on the same Solaris box. A single instance can often run significanly faster. These tests were done to test relative speed between versions, keeping all other factors constant. -----Original Message----- From: Claude Duguay=20 Sent: Wednesday, July 10, 2002 8:58 AM To: htm...@li...; htm...@li... Subject: [Htmlparser-developer] Final Statistics from Trek Run The latest version of the HTMLParser (20020707) appears to deliver good performance over the Swing parser and previous HTMLParser versions. These tests were done in context (using our application, which converts HTML documents, among others, into a normalized form and transmits the result as XML to a server over TCP/IP). We have subtracted the transmission time from these numbers, but a small amount of imprecision is probable given preprocessing and file I/O that gets done up front. Given the size of the tests (more than a half million documents), these elements should negligable. Note that this set includes a large number of small documents and we know from earlier tests that the Swing parser slows down dramatically as documents get larger, while the HTMLParser does not. =20 Total Documents processed: 642,077 Average Document Size: 4,043 =20 Average Number of Documents Per Second for: =20 Swing Parser (Java 1.3.1): 2.797185195 HTMLParser 1.1 Production Version: 2.558727723 HTMLParser 1.2 Early integration build: 2.585632061 HTMLParser 1.2 (build 20020707): 3.224910367 =20 Conclusions: The HTMLParser 1.2 is now about 15% faster than the Swing parser on Swing's home turf (Swing does best with smaller HTML files). With larger files, we have seen improvements as high as 35 times the seed of the Swing parser). =20 |
From: Claude D. <CD...@ar...> - 2002-07-10 15:58:21
|
The latest version of the HTMLParser (20020707) appears to deliver good performance over the Swing parser and previous HTMLParser versions. These tests were done in context (using our application, which converts HTML documents, among others, into a normalized form and transmits the result as XML to a server over TCP/IP). We have subtracted the transmission time from these numbers, but a small amount of imprecision is probable given preprocessing and file I/O that gets done up front. Given the size of the tests (more than a half million documents), these elements should negligable. Note that this set includes a large number of small documents and we know from earlier tests that the Swing parser slows down dramatically as documents get larger, while the HTMLParser does not. =20 Total Documents processed: 642,077 Average Document Size: 4,043 =20 Average Number of Documents Per Second for: =20 Swing Parser (Java 1.3.1): 2.797185195 HTMLParser 1.1 Production Version: 2.558727723 HTMLParser 1.2 Early integration build: 2.585632061 HTMLParser 1.2 (build 20020707): 3.224910367 =20 Conclusions: The HTMLParser 1.2 is now about 15% faster than the Swing parser on Swing's home turf (Swing does best with smaller HTML files). With larger files, we have seen improvements as high as 35 times the seed of the Swing parser). =20 |