[Htmlparser-developer] Performance Statistics
Brought to you by:
derrickoswald
From: Claude D. <CD...@ar...> - 2002-07-08 22:08:34
|
Here are the latest number from runs we've done using the Trek data set. These are mostly small html documents used in IR (Information Retrieval) as baselines (slightly cleaned up for HTML processing): =20 Total number of documents: 642,077 Total original document Size (in bytes): 2,596,104,858 =20 Comparison (times include local socket tranmission of output documents - possibly as much as 20-25% of the total time spent): =20 Swing parser - 4715 minutes total time, average number of documents per second: 2.269625309 HTMLParser 1.1 - 5065 minutes total time, average number of documents per second: 2.112790392=20 HTMLParser 1.2 (pre optimizations) - 5026 minutes total time, average number of documents per second: 2.129184905 =20 Previous reports that the 1.2 version was slower changed as more data was processed. It was, in fact, only slightly slower than 1.1. If Somik's recent changes improve performance as much as we expect, subsequent numbers should be even better. I thought it would be nice to share these numbers. I will post numbers from a run with the latest optimizations within a few days. =20 Note that the HTMLTitleScanner, HTMLMetaTagScanner and HTMLScriptScanner are being used in this set of tests and each element is being tested with"instanceof" to catch key tag information of relevance to our application. The HTMLScriptScanner is there only to make sure we skip over any scripts. =20 |