[Htmlparser-developer] Fw: [Htmlparser-user] Testing/feedback, question
Brought to you by:
derrickoswald
|
From: Somik R. <so...@ya...> - 2002-06-26 01:39:24
|
----- Original Message -----=20
From: Somik Raha=20
To: htm...@li...=20
Sent: Wednesday, June 26, 2002 10:13 AM
Subject: Re: [Htmlparser-user] Testing/feedback, question
Dear Claude,
Great mail to read. Bytway, as I understand you've used v1.1 for =
these
tests. However, I have made some special optimizations in v1.2, =
particularly
to improve scalability. The String node parser now creates only one
HTMLStringNode object for continuous text. So if you had 10,000 lines, =
v1.1
would create 10,000 objects, while v1.2 would create only one. The other
scanners also have been optimized.
I think this would result in a substantial improvement in your test
results.
Bytway, do you think you can write an article about your tests - we
could put it up on the HTMLParser page.
Also, send me your sourceforge id, I'd like to add you as a =
developer to
this project, so that you can check in improvement directly to CVS.
Regards,
Somik
----- Original Message -----
From: "Claude Duguay" <CD...@ar...>
To: <htm...@li...>
Sent: Wednesday, June 26, 2002 2:57 AM
Subject: RE: [Htmlparser-user] Testing/feedback, question
Here are some test results I thought you may be interested in:
We ran about 58k files through our conversion process using both the
old, Swing-based HTML parser and the new HTMLParser solution yesterday.
Some of these files are not HTML and are routed to other parsers, but
this particular set of files was especially problematic with the Swing
parser.
The exact nature of the Swing parser problem is a reallocation of buffer
space with too small an increment deep down inside the parser code. In
effect, some ungodly low number (4-8) of bytes are alllocated as the
string grows each time, causing an array copy each time with a growing
string. This is problematic when handling files with large text content
between a specific set of tags, such as large log listings between <PRE>
tags.
Using the old (Swing) parser, we processed 57952 documents, encountered
67 errors, ran in 10305 minutes (several days), with an original
aggregate file size of 6,252,739,014 bytes and a converted document
collection size around 761,653,928 bytes.
Using the new (HTMLParser) parser, we processed 58113 documents,
encountered 69 errors, ran in 294 minutes, with an original aggregate
file size of 6,256,488,243 bytes and a converted document collection
size around 431,198,296 bytes.
While this is not a conclusive test - there are clearly discrepencies
between the two conversion runs that need to be resolved, such as
different output size counts, which are attributable to changes we have
made - the timing different is impressive:
Going from 10305 minutes to 294 minutes, is just over 35 times faster.
This is mostly attributable to the problematic files in this test set,
which took on the order of hours to process each. Yet clearly the
HTMLParser solution overcomes a serious bug in the Swing parser (which
cannot be patched by anyone but Sun or it's Java license holders - given
the way the Java license agreement it written).
Note that the same low-level reallocation of string resources in the
Swing parser is less problematic in cases where less text is found
between each tag, but the performance differences should still be
sigificant taken over a large set of files. I will share what I can as
we learn more.
-------------------------------------------------------
This sf.net email is sponsored by: Jabber Inc.
Don't miss the IM event of the season | Special offer for OSDN members!
JabConf 2002, Aug. 20-22, Keystone, CO http://www.jabberconf.com/osdn
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
-------------------------------------------------------
This sf.net email is sponsored by: Jabber Inc.
Don't miss the IM event of the season | Special offer for OSDN members!=20
JabConf 2002, Aug. 20-22, Keystone, CO http://www.jabberconf.com/osdn
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|