Thread: RE: [Htmlparser-developer] Final Statistics from Trek Run
Brought to you by:
derrickoswald
From: Claude D. <CD...@ar...> - 2002-07-10 16:24:51
|
Note also that these tests were run in parallel on the same Solaris box. A single instance can often run significanly faster. These tests were done to test relative speed between versions, keeping all other factors constant. -----Original Message----- From: Claude Duguay=20 Sent: Wednesday, July 10, 2002 8:58 AM To: htm...@li...; htm...@li... Subject: [Htmlparser-developer] Final Statistics from Trek Run The latest version of the HTMLParser (20020707) appears to deliver good performance over the Swing parser and previous HTMLParser versions. These tests were done in context (using our application, which converts HTML documents, among others, into a normalized form and transmits the result as XML to a server over TCP/IP). We have subtracted the transmission time from these numbers, but a small amount of imprecision is probable given preprocessing and file I/O that gets done up front. Given the size of the tests (more than a half million documents), these elements should negligable. Note that this set includes a large number of small documents and we know from earlier tests that the Swing parser slows down dramatically as documents get larger, while the HTMLParser does not. =20 Total Documents processed: 642,077 Average Document Size: 4,043 =20 Average Number of Documents Per Second for: =20 Swing Parser (Java 1.3.1): 2.797185195 HTMLParser 1.1 Production Version: 2.558727723 HTMLParser 1.2 Early integration build: 2.585632061 HTMLParser 1.2 (build 20020707): 3.224910367 =20 Conclusions: The HTMLParser 1.2 is now about 15% faster than the Swing parser on Swing's home turf (Swing does best with smaller HTML files). With larger files, we have seen improvements as high as 35 times the seed of the Swing parser). =20 |
From: Somik R. <so...@ya...> - 2002-07-11 22:42:37
|
MessageThe SWT is not a contender for replacing Swing. It may be an = alternative, applicable in many circumstaces, but a quick look at the = Sun's Swing connection should dissuade you from assuming that few people = are using Swing.=20 LOL! I was asking for trouble with that comment :). I guess its just me = that finds Swing unbearably slow. I would not endorse trying to make HTMLParser Swing-compatible. These = are different animals and should stay that way. The notion of providing = a SAX-like interface is interesting but you should look instead toward = XML pull-parsers, which are the high-performance alternatives now = surfacing more widely. There is a JSR = (http://www.jcp.org/jsr/detail/173.jsp) that is trying to unify a good = interface for pull-parsing (they're calling it a Streaming API). You'll = find this link especially intersting (http://www.xmlpull.org/). I will look into this advice seriously (will start by educating myself = on XML Pull-parsers).=20 HTMLParser has two fundamental strengths. 1) It's easy to use and = extend. 2) It's lightning fast. Don't lose sight of these distinctions. The whole XML community is = strugling to achieve these goals and hasn't quite gotten there yet. = There's much to learn from XML, but they are laregely moving in this = direction. Its interesting that this should come up - the other day someone was = suggesting to me if the HTMLParser might not be used for parsing XML.. BTW: JTidy is a serious performance bottleneck in a high-performance = application. Good to know that :), havent checked it out myself yet. Its great to have a knowledgable person like you join this parser = community. It will be of great value in taking the final steps towards = stabilizing the API of the parser. The next integration releases would = focus on incorporating your suggestions, regarding the exception = handling. Maybe first week of Sep might be a realistic date for the = release of 1.2 (unless I get loads of time or help). Regards, Somik ----- Original Message -----=20 From: Claude Duguay=20 To: htm...@li...=20 Sent: Friday, July 12, 2002 1:29 AM Subject: RE: [Htmlparser-user] Final Statistics from Trek Run The SWT is not a contender for replacing Swing. It may be an = alternative, applicable in many circumstaces, but a quick look at the = Sun's Swing connection should dissuade you from assuming that few people = are using Swing. =20 HTMLParser has two fundamental strengths. 1) It's easy to use and = extend. 2) It's lightning fast. =20 Don't lose sight of these distinctions. The whole XML community is = strugling to achieve these goals and hasn't quite gotten there yet. = There's much to learn from XML, but they are laregely moving in this = direction. =20 BTW: JTidy is a serious performance bottleneck in a high-performance = application. =20 -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Thursday, July 11, 2002 2:25 AM To: htm...@li... Subject: Re: [Htmlparser-user] Final Statistics from Trek Run Hi Craig, For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be = null). It is possible that the formatting of the input HTML into a = structure with these characteristics reduces the parser's performance in order = to produce a better render. =20 Indeed - perhaps a good idea would be to rewrite JEditorPane :) - = make an open source version, which is better designed. Swing = compatibility is a real pain - we gave up on that not so far back :). On = the other hand, I was thinking that SAX compliance would be feasible and = worth it - I doubt if many people are considering Swing for graphics = these days, especially with the SWT being out there. But the SAX = mechanism is quite popular and its worth being able to just switch = parsers. Of course, whether you need to take these considerations into = account depends entirely on your application. The htmlparser seems to lean = more toward the extraction of information rather than its representation, = and the latter is so fraught with ambiguities as to make it a task of a different order altogether. So true. Like you had mailed sometime back, JTidy does a good job of = that. =20 Regards, Somik ----- Original Message -----=20 From: Craig Raw=20 To: htm...@li...=20 Sent: Thursday, July 11, 2002 5:35 PM Subject: [Htmlparser-user] RE: [Htmlparser-developer] Final = Statistics from Trek Run Just a point to notice on these tests. The htmlparser, for all = it's merits, is not a direct functional replacement for the Swing = parser.=20 For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be = null). It is possible that the formatting of the input HTML into a = structure with these characteristics reduces the parser's performance in = order to produce a better render. Of course, whether you need to take these considerations into = account depends entirely on your application. The htmlparser seems to lean = more toward the extraction of information rather than its = representation, and the latter is so fraught with ambiguities as to make it a task of = a different order altogether. -craig -----Original Message----- From: htm...@li... [mailto:htm...@li...] On = Behalf Of Somik Raha Sent: 11 July 2002 02:19 AM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, Thanks a ton for all these tests. Do you think you could write an article on this that we could put up ? Regards Somik ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek PC Mods, Computing goodies, cases & more http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-07-11 00:19:15
|
MessageHi Claude, Thanks a ton for all these tests. Do you think you could write an = article on this that we could put up ? Regards Somik |