Thread: RE: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run
Brought to you by:
derrickoswald
From: Claude D. <CD...@ar...> - 2002-07-11 16:17:46
|
We're not quite done yet... ;-) =20 Here are some numbers that reflect the differences with the larger files. This set is 57,952 files (6,256,488,243 bytes), many of which are several megabyte log file dumps to HTML (average file size for this set is 107,959 bytes). These are especially problematic for the Swing parser: =20 Time for Swing (in minutes): 10,305 (0.0937279 docs/sec) Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec) Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec) =20 Note that this run was done on a single box with no other parallel runs. Also, there was a variance of about 1000 files between runs that are reflected in the speed numbers. But I provided the average in the paragraph above, so you will not get exact results from recalculating from those numbers. Still, everything needs to be looked at in perspective. =20 Notable here is that the 1.2 version seems to be a tiny bit slower on big files. This is almost certainly due to string reallocation. As contiguous content gets larger, which can happen in any application that works heavily with string objects. It might be worth looking at whether this is addressable. Overall, though, HTMParser 1.2 is clearly an improvement over the most commonly used Java/HTML parser (ie: Swing) in use today ;-). -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Wednesday, July 10, 2002 5:19 PM To: htm...@li...; htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-developer] Final Statistics from Trek Run Hi Claude, Thanks a ton for all these tests. Do you think you could write an article on this that we could put up ? =20 Regards Somik |
From: Claude D. <CD...@ar...> - 2002-07-12 02:05:25
|
VGhlIDEuMiBudW1iZXJzIGFyZSBiYXNlZCBvbiB0aGUgMDcwNyBidWlsZC4NCg0KCS0tLS0tT3Jp Z2luYWwgTWVzc2FnZS0tLS0tIA0KCUZyb206IFNvbWlrIFJhaGEgW21haWx0bzpzb21pa0B5YWhv by5jb21dIA0KCVNlbnQ6IFRodSA3LzExLzIwMDIgMzo0NCBQTSANCglUbzogaHRtbHBhcnNlci1k ZXZlbG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0OyBodG1scGFyc2VyLXVzZXJAbGlzdHMuc291 cmNlZm9yZ2UubmV0IA0KCUNjOiANCglTdWJqZWN0OiBbSHRtbHBhcnNlci1kZXZlbG9wZXJdIFJl OiBGaW5hbCBTdGF0aXN0aWNzIGZyb20gVHJlayBSdW4NCgkNCgkNCglIaSBDbGF1ZGUNCgkgDQoJ VGltZSBmb3IgU3dpbmcgKGluIG1pbnV0ZXMpOiAxMCwzMDUgKDAuMDkzNzI3OSBkb2NzL3NlYykN CglUaW1lIGZvciBIVE1MUGFyc2VyIDEuMSAoaW4gbWludXRlcyk6IDI5NCAoMy4yOTQzODc3IGRv Y3Mvc2VjKQ0KCVRpbWUgZm9yIEhUTUxQYXJzZXIgMS4yIChpbiBtaW51dGVzKTogMzExICgzLjE2 NjUwNTggZG9jcy9zZWMpDQoNCglXaGljaCB2ZXIgb2YgMS4yIGlzIHRoaXMgKGlzIGl0IHRoZSBs YXRlc3QpID8gVGhlIHByZXZpb3VzIG9uZSBoYWQgc2VyaW91cyBpc3N1ZXMgd2l0aCBzdHJpbmcg YWxsb2NhdGlvbnMsIGJ1dCB0aGUgbGF0ZXN0IG91Z2h0IHRvIGJlIGZhc3RlciBmb3IgYmlnZ2Vy IGZpbGVzIHRoYW4gMS4xLg0KCSANCglSZWdhcmRzLA0KCVNvbWlrDQoNCg== |
From: Somik R. <so...@ya...> - 2002-07-12 02:23:47
|
> The 1.2 numbers are based on the 0707 build. Ok, I will profile some more and try to remove any other bottlenecks. I was also thinking of making a head scanner. That would allow me to remove the title and meta scanners from the registered list, and add them only when they are really needed (on encountering the head tag). Regards, Somik |
From: Claude D. <CD...@ar...> - 2002-07-12 04:08:33
|
WW91IG1heSBuZWVkIHRvIGhhdmUgeW91ciB1bml0IHRlc3RzIGNvdmVyIGEgbGFyZ2VyIHNldC4g SSd2ZSBvZnRlbiBmb3VuZCB0aGUgSmF2YURvYyBzZXQgdXNlZnVsIGZvciBzbWFsbGVyIHRlc3Rz LiBUaGVyZSBhcmUgYWJvdXQgODAwMCBkb2N1bWVudHMgaW4gdGhlcmUgd2l0aCBhIHZhcmlldHkg b2Ygc2l6ZXMsIHRob3VnaCB0aGV5IGFyZSBub3QgbmVjZXNzYXJpbHkgcmVwcmVzZW50YXRpdmUg b2YgdGhlIGxhcmdlciBlY29sb2d5IG9mIHRoZSBJbnRlcm5ldC4gVGhlIHJlYWwgdHJpY2sgaXMg dG8gcHV0IGEgdGhyZXNob2xkIG9uIHRoZSB1bml0IHRlc3QgdGhhdCBmbGFncyB5b3UgaWYgeW91 IGV2ZXIgbWFrZSBhIGNoYW5nZSB0aGF0IHNsb3dzIHRoaW5ncyBkb3duLCBhdCB3aGljaCBwb2lu dCB5b3UgY2FuIGV2YWx1YXRlIHdoZXRoZXIgdGhlIHRyYWRlb2ZmIGJldHdlZW4gYSBuZXcgZmVh dHVyZSBvciByZWZhY3RvcmluZyBjaG9pY2UgaXMgd29ydGggdGhlIHBlcmZvcm1hbmNlIGhpdC4N CiANCllvdSd2ZSBkb25lIGEgcHJldHR5IGV4Y2VwdGlvbmFsIGpvYiBhbmQgc2hvdWxkIGJlIHBy b3VkIG9mIHRoZSB3b3JrIHlvdSd2ZSBkb25lLiBwZXJzb25hbGx5LCBJIGNvdWxkbid0IGJlIG1v cmUgcGxlYXNlZCB0aGF0IG91ciBwcm9kdWN0IGlzIDE1JSsgZmFzdGVyIHRoYW5rcyB0byB5b3Vy IGRlc2lnbiBhbmQgaW1wbGVtZW50YXRpb24uIFRoYW5rcyENCg0KCS0tLS0tT3JpZ2luYWwgTWVz c2FnZS0tLS0tIA0KCUZyb206IFNvbWlrIFJhaGEgW21haWx0bzpzb21pa0B5YWhvby5jb21dIA0K CVNlbnQ6IFRodSA3LzExLzIwMDIgNzoyMyBQTSANCglUbzogaHRtbHBhcnNlci1kZXZlbG9wZXJA bGlzdHMuc291cmNlZm9yZ2UubmV0IA0KCUNjOiANCglTdWJqZWN0OiBSZTogW0h0bWxwYXJzZXIt ZGV2ZWxvcGVyXSBSZTogRmluYWwgU3RhdGlzdGljcyBmcm9tIFRyZWsgUnVuDQoJDQoJDQoNCgk+ IFRoZSAxLjIgbnVtYmVycyBhcmUgYmFzZWQgb24gdGhlIDA3MDcgYnVpbGQuDQoJDQoJT2ssIEkg d2lsbCBwcm9maWxlIHNvbWUgbW9yZSBhbmQgdHJ5IHRvIHJlbW92ZSBhbnkgb3RoZXIgYm90dGxl bmVja3MuIEkgd2FzDQoJYWxzbyB0aGlua2luZyBvZiBtYWtpbmcgYSBoZWFkIHNjYW5uZXIuIFRo YXQgd291bGQgYWxsb3cgbWUgdG8gcmVtb3ZlIHRoZQ0KCXRpdGxlIGFuZCBtZXRhIHNjYW5uZXJz IGZyb20gdGhlIHJlZ2lzdGVyZWQgbGlzdCwgYW5kIGFkZCB0aGVtIG9ubHkgd2hlbg0KCXRoZXkg YXJlIHJlYWxseSBuZWVkZWQgKG9uIGVuY291bnRlcmluZyB0aGUgaGVhZCB0YWcpLg0KCQ0KCVJl Z2FyZHMsDQoJU29taWsNCgkNCgkNCgkNCgkNCgktLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tDQoJVGhpcyBzZi5uZXQgZW1haWwgaXMgc3BvbnNv cmVkIGJ5OlRoaW5rR2Vlaw0KCVBDIE1vZHMsIENvbXB1dGluZyBnb29kaWVzLCBjYXNlcyAmIG1v cmUNCglodHRwOi8vdGhpbmtnZWVrLmNvbS9zZg0KCV9fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fDQoJSHRtbHBhcnNlci1kZXZlbG9wZXIgbWFpbGluZyBsaXN0 DQoJSHRtbHBhcnNlci1kZXZlbG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0DQoJaHR0cHM6Ly9s aXN0cy5zb3VyY2Vmb3JnZS5uZXQvbGlzdHMvbGlzdGluZm8vaHRtbHBhcnNlci1kZXZlbG9wZXIN CgkNCg0K |
From: Somik R. <so...@ya...> - 2002-07-12 01:06:02
|
MessageHi Claude Time for Swing (in minutes): 10,305 (0.0937279 docs/sec) Time for HTMLParser 1.1 (in minutes): 294 (3.2943877 docs/sec) Time for HTMLParser 1.2 (in minutes): 311 (3.1665058 docs/sec) Which ver of 1.2 is this (is it the latest) ? The previous one had = serious issues with string allocations, but the latest ought to be = faster for bigger files than 1.1. Regards, Somik |