[Htmlparser-developer] Fw: [Htmlparser-user] Testing/feedback, question
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-06-26 01:39:24
|
----- Original Message -----=20 From: Somik Raha=20 To: htm...@li...=20 Sent: Wednesday, June 26, 2002 10:13 AM Subject: Re: [Htmlparser-user] Testing/feedback, question Dear Claude, Great mail to read. Bytway, as I understand you've used v1.1 for = these tests. However, I have made some special optimizations in v1.2, = particularly to improve scalability. The String node parser now creates only one HTMLStringNode object for continuous text. So if you had 10,000 lines, = v1.1 would create 10,000 objects, while v1.2 would create only one. The other scanners also have been optimized. I think this would result in a substantial improvement in your test results. Bytway, do you think you can write an article about your tests - we could put it up on the HTMLParser page. Also, send me your sourceforge id, I'd like to add you as a = developer to this project, so that you can check in improvement directly to CVS. Regards, Somik ----- Original Message ----- From: "Claude Duguay" <CD...@ar...> To: <htm...@li...> Sent: Wednesday, June 26, 2002 2:57 AM Subject: RE: [Htmlparser-user] Testing/feedback, question Here are some test results I thought you may be interested in: We ran about 58k files through our conversion process using both the old, Swing-based HTML parser and the new HTMLParser solution yesterday. Some of these files are not HTML and are routed to other parsers, but this particular set of files was especially problematic with the Swing parser. The exact nature of the Swing parser problem is a reallocation of buffer space with too small an increment deep down inside the parser code. In effect, some ungodly low number (4-8) of bytes are alllocated as the string grows each time, causing an array copy each time with a growing string. This is problematic when handling files with large text content between a specific set of tags, such as large log listings between <PRE> tags. Using the old (Swing) parser, we processed 57952 documents, encountered 67 errors, ran in 10305 minutes (several days), with an original aggregate file size of 6,252,739,014 bytes and a converted document collection size around 761,653,928 bytes. Using the new (HTMLParser) parser, we processed 58113 documents, encountered 69 errors, ran in 294 minutes, with an original aggregate file size of 6,256,488,243 bytes and a converted document collection size around 431,198,296 bytes. While this is not a conclusive test - there are clearly discrepencies between the two conversion runs that need to be resolved, such as different output size counts, which are attributable to changes we have made - the timing different is impressive: Going from 10305 minutes to 294 minutes, is just over 35 times faster. This is mostly attributable to the problematic files in this test set, which took on the order of hours to process each. Yet clearly the HTMLParser solution overcomes a serious bug in the Swing parser (which cannot be patched by anyone but Sun or it's Java license holders - given the way the Java license agreement it written). Note that the same low-level reallocation of string resources in the Swing parser is less problematic in cases where less text is found between each tag, but the performance differences should still be sigificant taken over a large set of files. I will share what I can as we learn more. ------------------------------------------------------- This sf.net email is sponsored by: Jabber Inc. Don't miss the IM event of the season | Special offer for OSDN members! JabConf 2002, Aug. 20-22, Keystone, CO http://www.jabberconf.com/osdn _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- This sf.net email is sponsored by: Jabber Inc. Don't miss the IM event of the season | Special offer for OSDN members!=20 JabConf 2002, Aug. 20-22, Keystone, CO http://www.jabberconf.com/osdn _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |