Thread: RE: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run
Brought to you by:
derrickoswald
From: Claude D. <CD...@ar...> - 2002-07-11 16:00:46
|
There's no question that the Swing and HTMLParser are designed for = different purposes. Swing doesn't build much of an internal = representation if you plug in your own callback. I think that's handled = by the EditorKit(s). I think it's more fogiving than you're suggesting. = We've used it on millions of files (well above 12 million distinct files = that reflect real-word, ill-formedness) and it's handled these = situations well enough. Still because Swing offers this callback mechanism, the parser tends to = be used in cases where something like HTMLParser would be a much better = choice. -----Original Message----- From: Craig Raw [mailto:cr...@qu...]=20 Sent: Thursday, July 11, 2002 1:36 AM To: htm...@li... Subject: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics = from Trek Run Just a point to notice on these tests. The htmlparser, for all it's merits, is not a direct functional replacement for the Swing parser.=20 For example, the renderer built into Swing's JEditorPane expects callbacks resulting from well-formed HTML with certain (sometimes arbitrary) characteristics. (For example, a <head><title>X</title></head> section must exist, and X cannot be null). It is possible that the formatting of the input HTML into a structure with these characteristics reduces the parser's performance in order to produce a better render. Of course, whether you need to take these considerations into account depends entirely on your application. The htmlparser seems to lean more toward the extraction of information rather than its representation, and the latter is so fraught with ambiguities as to make it a task of a different order altogether. -craig -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Somik Raha Sent: 11 July 2002 02:19 AM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, =A0=A0=A0 Thanks a ton for all these tests. Do you think you could write = an article on this that we could put up ? =A0 Regards Somik ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek PC Mods, Computing goodies, cases & more http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Claude D. <CD...@ar...> - 2002-07-11 16:17:45
|
We're not quite done yet... ;-) =20 Here are some numbers that reflect the differences with the larger files. This set is 57,952 files (6,256,488,243 bytes), many of which are several megabyte log file dumps to HTML (average file size for this set is 107,959 bytes). These are especially problematic for the Swing parser: =20 Time for Swing (in minutes): 10,305 (0.0937279 docs/sec) Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec) Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec) =20 Note that this run was done on a single box with no other parallel runs. Also, there was a variance of about 1000 files between runs that are reflected in the speed numbers. But I provided the average in the paragraph above, so you will not get exact results from recalculating from those numbers. Still, everything needs to be looked at in perspective. =20 Notable here is that the 1.2 version seems to be a tiny bit slower on big files. This is almost certainly due to string reallocation. As contiguous content gets larger, which can happen in any application that works heavily with string objects. It might be worth looking at whether this is addressable. Overall, though, HTMParser 1.2 is clearly an improvement over the most commonly used Java/HTML parser (ie: Swing) in use today ;-). -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Wednesday, July 10, 2002 5:19 PM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, Thanks a ton for all these tests. Do you think you could write an article on this that we could put up ? =20 Regards Somik |
From: Somik R. <so...@ya...> - 2002-07-12 01:06:02
|
MessageHi Claude Time for Swing (in minutes): 10,305 (0.0937279 docs/sec) Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec) Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec) Which ver of 1.2 is this (is it the latest) ? The previous one had = serious issues with string allocations, but the latest ought to be = faster for bigger files than 1.1. Regards, Somik |