htmlparser-announce Mailing List for HTML Parser (Page 3)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2002 |
Jan
(6) |
Feb
|
Mar
(2) |
Apr
(1) |
May
|
Jun
(4) |
Jul
(3) |
Aug
(3) |
Sep
(1) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(3) |
May
(2) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2004 |
Jan
(1) |
Feb
(1) |
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(6) |
2007 |
Jan
|
Feb
(6) |
Mar
(6) |
Apr
(6) |
May
(1) |
Jun
(1) |
Jul
(1) |
Aug
(27) |
Sep
(7) |
Oct
(4) |
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2009 |
Jan
|
Feb
|
Mar
(1) |
Apr
(15) |
May
(83) |
Jun
(72) |
Jul
(39) |
Aug
(14) |
Sep
(16) |
Oct
(30) |
Nov
(5) |
Dec
(4) |
2010 |
Jan
|
Feb
(1) |
Mar
(37) |
Apr
(57) |
May
(74) |
Jun
(66) |
Jul
(44) |
Aug
(54) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Somik R. <so...@ya...> - 2003-03-24 01:22:12
|
Hi Folks, This week's integration release has two important fixes : Integration build 1.3 - 20030323 -------------------------------- [1] Fixed bug 702547 - single quotes parsed more robustly now [2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a method isEmptyXmlTag(). #2 refers to tags like <tag/>. Thanks to Joe Robbins for a fine bug report that helped in putting in the fix for #1 faster. Thanks also to Marc Novakowski for the other report. Thanks are also due to Huang-Chun Yu for uncovering a serious bug with the script scanning mechanism. The parser can currently handle script tags like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular tags. Such pages are quite widespread and ought to be supported. I was curious if anyone has ideas on solving this - given the existing design - fresh ideas often lead to a better perspective. If you have some ideas, feel free to join the developer list (http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and post. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-16 21:36:46
|
Hi Folks, This is a major milestone release. A massive refactoring has been completed (took two weeks) - which has brought all the robust error handling cases into CompositeTagScanner. This means, all tags that have children will be able to do error correction uniformly. Form tag (and table tags too) should be robust. Table tags are not yet in the standard set of scanners (you still need to add them manually). They should make the cut next week. We have a new method - registerDomScanners() in Parser - that allows you to build html dom objects. Interesting fact, as a result of the refactorings, the LOC of the scanners package has reduced from 1553 to 1355 (I was surprised at the digits). Documentation has been updated - we've started putting up answers by our list members to common questions. Pls feel free to update the Wiki and improve it. No login is required. From the change log: Integration build 1.3 - 20030316 -------------------------------- [1] Added method finishedParsing() to NodeVisitor [2] LinkScanner uses CompositeTagScanner.scan() [3] BulletScanner added [4] FormScanner uses CompositeTagScanner.scan() [5] AppletScanner uses CompositeTagScanner.scan() We highly recommend an upgrade to this version. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-03 03:52:40
|
Hi Folks, In this week's release, the change log is : Integration build 1.3 - 20030302 -------------------------------- [1] Fixed bug in LinkScanner [2] Cleaned up StringNode interface [3] Cleaned up RemarkNode interface [4] Refactored Parser, created ParserHelper Regards, Somik |
From: Somik R. <so...@ya...> - 2003-02-24 18:11:59
|
I was trying to integrate the changes of the latest parser with some existing projects at work - and of course, I had to modify the code to use the new API. I had some suggestions - as I know many of you will be facing the same issue. I use Eclipse, and I hope most of you use a decent IDE that supports refactoring. Get the parser into your IDE, and let all your other project code refer to it (thats how it is setup in my IDE). Then, rename Parser to HTMLParser using your refactoring tool. Rename it back to Parser, and all your existing code will automatically get fixed. Do this for some other classes like HTMLNode/Node, etc.. and within minutes it should be done. Regards, Somik --- Somik Raha <so...@ya...> wrote: > Hi Folks, > This week's release is out. I've finally taken > heed of all the feedback > I had been receiving about the terrible naming > convention, and have removed > "HTML" from all class names. In addition, > HTMLEnumeration is now > NodeIterator and SimpleEnumeration is > SimpleNodeIterator. HTMLParser is just > Parser. > > This is a big step, so to make it easy for > everyone, there have been no > major bug fixes that will require you to upgrade > right away. I apologize in > advance for inconvenience caused - I hope you don't > curse me too much for > having to modify your programs. I had the option of > doing it in stages, and > forcing you to modify some small thing in every > release, or get it over with > in one sweep. I chose the latter bcos there were too > many changes and > suffering over a long period of time didn't make > sense. Hopefully, once you > have migrated to the new names, you will appreciate > not having to type > "HTML" each time. > > The BodyScanner contributed by Dhaval Udani is > finally in (Dhaval - > sorry for the delay). > The interesting part is that the documentation > accompanying the package > is now the latest one on the site - it has been > ripped off a Php Wiki. I am > thinking that the ripping program might be useful > for those who wish to > provide wiki content as offline documentation (any > feedback on this is > welcome). > > From the change log : > Integration build 1.3 - 20030223 > -------------------------------- > [1] Modification of documentation packaging > - the new documentation is actually produced > by a tiny program that coverts wiki pages > into documentation (works with PhpWiki) > [2] Inclusion of BodyScanner, BodyTag > [3] HTMLVisitor is now NodeVisitor - and has an > extra param to > visit itself > [4] HTMLParser is now Parser. No class has HTML > prefix anymore. > [5] HTMLEnumeration is now NodeIterator, > SimpleEnumeration is > SimpleNodeIterator > > Regards, > Somik > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SlickEdit Inc. > Develop an edge. > The most comprehensive and flexible code editor you > can use. > Code faster. C/C++, C#, Java, HTML, XML, many more. > FREE 30-Day Trial. > www.slickedit.com/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-02-24 06:15:44
|
Hi Folks, This week's release is out. I've finally taken heed of all the feedback I had been receiving about the terrible naming convention, and have removed "HTML" from all class names. In addition, HTMLEnumeration is now NodeIterator and SimpleEnumeration is SimpleNodeIterator. HTMLParser is just Parser. This is a big step, so to make it easy for everyone, there have been no major bug fixes that will require you to upgrade right away. I apologize in advance for inconvenience caused - I hope you don't curse me too much for having to modify your programs. I had the option of doing it in stages, and forcing you to modify some small thing in every release, or get it over with in one sweep. I chose the latter bcos there were too many changes and suffering over a long period of time didn't make sense. Hopefully, once you have migrated to the new names, you will appreciate not having to type "HTML" each time. The BodyScanner contributed by Dhaval Udani is finally in (Dhaval - sorry for the delay). The interesting part is that the documentation accompanying the package is now the latest one on the site - it has been ripped off a Php Wiki. I am thinking that the ripping program might be useful for those who wish to provide wiki content as offline documentation (any feedback on this is welcome). From the change log : Integration build 1.3 - 20030223 -------------------------------- [1] Modification of documentation packaging - the new documentation is actually produced by a tiny program that coverts wiki pages into documentation (works with PhpWiki) [2] Inclusion of BodyScanner, BodyTag [3] HTMLVisitor is now NodeVisitor - and has an extra param to visit itself [4] HTMLParser is now Parser. No class has HTML prefix anymore. [5] HTMLEnumeration is now NodeIterator, SimpleEnumeration is SimpleNodeIterator Regards, Somik |
From: Somik R. <so...@ya...> - 2003-02-16 04:33:26
|
Hi Folks, Integration release 1.3-20030215 is out. From the change log: Integration build 1.3 - 20030215 -------------------------------- [1] Added HtmlScanner [2] Removed Table, Div and Span from registry of scanners, can still be added individually [3] Reference test directory of project home page to maybe cure some sporadic errors in BeanTest. [4] Added setAttribute method [5] Cleaned up HTMLNode interface (removed TYPE, getType() and print()) With HtmlScanner, you can now get the entire page - sort of a DOM model in a Html object. Useful for testing. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-02-03 07:21:35
|
Hi Folks, Integration release 1.3-20030202 is out. From the change log : Integration build 1.3 - 20030202 -------------------------------- [1] Renamed HTMLCompositeTagScanner to CompositeTagScanner [2] Renamed HTMLTag.getParameter() to HTMLTag.getAttribute() [3] Added TableScanner [4] Added HtmlPage [5] Added SpanScanner [6] Added assertType in HTMLParserTestCase [7] Added TextExtractingVisitor [8] Added non-recursive visiting (flag in HTMLVisitor) [9] Added DivScanner [10] Modified collectInto to use NodeList [11] Added collectInto(NodeList, Class) [12] CompositeTagScanner can handle single xml-like tags e.g. <div/> [13] Fixed bug 678969 - StringParser was not going into ignore mode on encountering double quotes [14] Added LabelScanner Dhaval Udani has contributed LabelScanner. (He has also contributed a BodyScanner which will make it next week's release). We've shipped this time with two tests failing- both tests replicate the same bug - 677874 - "mishandling of double quotes". I made this release for two reasons : [1] This bug is not a new addition but was always there - its a deep bug in AttributeParser (previously known as ParameterParser) - and it might take a little time to fix [2] There are lot of new additions which we'd like to get out there - we finally have a table scanner! [3] Important bug fixes have been made which further stabilize the parser's performance (and at least one user was desperately waiting for the fix) Notable addition - HTMLNode.collectInto() has a new mode of operation - using the class type. Suppose you need to get to a node (e.g. images) that is within a composite (like a table), you can do : NodeList imageList = new ImageList(); tableTag.collectInto(imageList,HTMLImageTag.class); You can also do this directly from the parser - like so : HTMLNode node [] = parser.extractAllNodesThatAre(HTMLLinkTag.class); And here's some more news - we now have our own wiki (finally!). Go to http://htmlparser.sourceforge.net/docs/ This is a free-for-all wiki. It is a little too much for me to write the entire documentation on my own - so I'd highly appreciate if the user/developer community pitches in - that would be a great benefit for the community. The current documentation on the site is already obsolete, and I am going to take it down soon (hopefully by the next release). Regards, Somik |
From: Somik R. <so...@ya...> - 2003-01-25 23:41:45
|
Hi Folks, The next integration release is out. From the change log : Integration build 1.3 - 20030125 -------------------------------- [1] HTMLCompositeTagScanner now takes an array of match strings [2] toHTML(HTMLRenderer ...) was replaced by UrlModifyingVisitor [3] Fixed NullPointerException in HTMLScriptTag.toString() [4] Fixed bug in HTMLStringNode (breaking up empty lines into seperate string nodes) [5] Fixed thread safety issue and introduced parser helpers [6] Fixed bug 664404 - spewing incorrect line breaks in HTMLRemarkNode.toHTML() [7] Added assertXmlEquals() in HTMLParserTestCase [8] Added better option tag support [9] Replaced instanceof with getType() mechanism - much faster [10] Incorporated NodeList instead of Vector in HTMLCompositeTag [11] Added HTMLRemarkNode support in Visitor [12] Fixed bug 673379 (infinite loop on encountering links like ".someurl.html") Among the notable additions is assertXmlEquals() - this is present to enable us to perform xml testing. This method actually creates the parser and performs a node for node comparison. Reconstruction has improved a lot - you will find that the parser now does not add unnecessary line breaks - and preserves the html as it came in. One significant addition is the use of NodeList instead of Vector. The integration has been performed, so there should be a significant performance increase - check http://htmlparser.sourceforge.net/performance/simpleEnumerationPerformance.h tml In the coming week, we will be setting up a wiki on sourceforge, where we can collaboratively create documentation - hopefully that will finally take the burden out of the documentation process. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-01-13 04:50:14
|
Hi Folks, This week's integration release is out. This release has significant contributions from Derrick Oswald and Josh Kerievsky. Derrick is building a nice UI for the parser - and making tons of improvements. Thanks to Josh's insight, we have done some major refactorings on the scanners - resulting in a massive drop in code duplication. Here are some statistics - the scanners package in the last release had 1693 lines of code. In the current release, this has dropped to 1300 lines of code. We have a new class HTMLCompositeTagScanner which does the hard-work for picking up child tags. Most scanners use this code. HTMLTagScanner too does some useful work- and from this release, new scanners dont need to override evaluate() or scan(). Take a look at the refactored scanner code and you might be surprised with its size and simplicity. Here's the change log : Integration build 1.3 - 20030112 -------------------------------- [1] Assume charset is correct for JVM's without Charset class to check it [2] Beanize the parser [3] Switch to swingui junit runner by default [4] Half baked beans [5] Fix javadoc warnings in JDK 1.4 [6] Added StringFindingVisitor + test code + new visitors packages [7] Fixed bug 659723, but HTMLStringNode is not thread-safe anymore. [8] JDK 1.2 compilability [9] Modified HTMLEnumeration interface (made less verbose) [10] Added HTMLCompositeTagScanner [11] Refactored following scanners to use HTMLCompositeTagScanner : (i) HTMLStyleScnner (ii) HTMLSelectScanner (iii) HTMLFrameSetScanner (iv) HTMLTitleScanner (v) HTMLTextAreaScanner (vi) HTMLScriptScanner (vii) HTMLFrameSetScanner [12] Made StringNode the last parse attempt, so now Reader trys in this order: remark tag endtag string (this will return more HTMLStringNode objects than it did before). [13] Improve speed by performing tag/string triage based on '<' as next character. [14] Refactored HTMLTagScanner. The following scanners use refactored code: (i) HTMLBaseHREFScanner (ii) HTMLDoctypeScanner (iii) HTMLFrameScanner (iv) HTMLJspScanner (v) HTMLMetaTagScanner Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-29 08:09:57
|
Hi Folks, The integration release for this week is out. You can download it from http://htmlparser.sourceforge.net Integration build 1.3 - 20021228 -------------------------------- [1] Added URLConnection constructors to HTMLParser [2] Honour charset parameter on HTTP header and in HTML meta tag [3] Following tags now inherit from HTMLCompositeTag (i) HTMLFormTag (ii) HTMLLinkTag (iii) HTMLSelectTag (iv) HTMLFrameSetTag (v) HTMLTitleTag (vi) HTMLTextAreaTag (vii) HTMLStyleTag (viii) HTMLScriptTag (ix) HTMLAppletTag [4] Performed Refactoring "Introduce Parameter Object" on HTMLTag, HTMLCompositeTag, HTMLLinkTag, HTMLFormTag [5] Refactored HTMLFormTag, pulling up the search methods into HTMLCompositeTag [6] Added HTMLVector, which can return HTMLSimpleEnumeration - a no-exception flavor of HTMLEnumeration [7] Refactored HTMLEnumeration - created new interface - HTMLPeekingEnumeration Notes : HTMLVector is not yet integrated with the tags. That should happen in the next release. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-22 03:24:26
|
Hi Folks, Finally, after 8 months of hard work, we have the next production release of the parser. 1.2 has tons of bug fixes and features. The change log difference b/w 1.2 and 1.1 is too big to be listed in this mail - check the change log when you are downloading (its also in the download package). Documentation has been considerably improved (the Sample programs would be the place to start). There's a section on the patterns in action as well. You can modify the rendering process for links and images, as well as provide collecting parameters to pick up nodes that you wish (currently images and links supported). Below is the change log (as compared to last week's integration release) : Production Release 1.2 ---------------------- [1] Rewrote HTMLLinkProcessor.extract() so URL class does all the heavy lifting [2] Partially fixed bug 654746 - HTMLLinkScanner error, code review needed [3] Rendering bug fixed - allowing uniform rendering for links and images [4] Fixed bug 655917, made HTMLParameterParser.parseParameters() thread-safe [5] Refactored HTMLFormTag (introduced POST and GET static members) [6] Bug fixed in HTMLFormTag.getInputTag() (NullPointerException when input tag has no name) [7] Added ability to get textarea tag from HTMLFormTag. [8] Added search capability in HTMLFormTag [9] Fixed bug 655627 - JSP tags with < sign (for loops) were not being parsed correctly [10] Fixed bug 655603 - JSP tags within src of script not recognized correctly when using single apostrophes [11] Fixed bug 655580 - JSP tags within title tags not recognized correctly [12] Fixed bug 655599 - Erroneous end-of-line characters were being added in string nodes [13] Fixed bug 656870 - HTMLFormScanner goes into infinite loop if a previous link has not been closed Thanks to Derrick Oswald and Dhaval Udani for their work on the last few releases. Thanks to Joe Robins for pointing out an important bug in HTMLFormScanner. A special mention for Dhaval - all his bug reports come with testcases making it really easy for us to reproduce the bug and fix them. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-15 09:29:43
|
Hi Folks, Candidate 6 is out, and there are some goodies in this one.. Thanks to Derrick Oswald and Leslie Rohde (our two new developers) who have put in their time. From the Change Log : Integration Build 1.2 - 20021215 -------------------------------- [1] Modified API of HTMLImageTag (refactored name of image loc), HTMLLinkTag (added getters) [2] Fixed bug 650457 - removeEscapeCharacters() incorrect [3] Fixed bug 652263 - HTMLParser and null feedback [4] Changed encoding used from 8859_4 to 8859_1 [5] HTMLRemarkNode returns string data in toPlainTextString() (This is a rollback) [6] Fixed bug 652746 - HTMLFormTag gets links correctly now [7] Fixed bug 653720 - HTMLNode uses sun specific class [8] Improved StringExtractor parser application [9] Major design improvement, implemented Collection-Parameter pattern - in HTMLNode.collectInto() [10] Fixed reset crash bug. Reader providers have to explicitly call mark and reset now. This is now documented in HTMLParser.java. [11] Fixed bug 649269 in HTMLLinkTag.isHttpLink(), now correctly identifies relative links as Http links. A major API improvement has occurred - HTMLNode now has a new method - collectInto(), which uses a collection parameter to collect nodes. A sample program demonstrating this feature is at : http://htmlparser.sourceforge.net/samples/linksEmbedded.html Thanks to everyone who participated in the discussions and architecture changes. There has been a rollback as well, we've taken out the mark and reset mechanism, and this is now the responsibility of the reader supplier. Cheers, Somik |
From: Somik R. <so...@ya...> - 2002-12-09 01:28:26
|
Hi Folks, This week's release is Candidate 5. We've had talented developers joining us over the weekend, hence, you can expect improvements in quality in the coming weeks. Hopefully, we should have our production release ready by New Year's... From the change log : Integration Build 1.2 - 20021208 --------------------------------- [1] Fixed bug in base href scanner - would always expect href [2] Refactored HTMLFormScanner [3] Refactored HTMLRenderer to use the Visitor pattern- enabling connections with links and images [4] HTMLStringNode returns a blank string in toPlainTextString() [5] HTMLFormTag returns string information in toPlainTextString() #5 is an important fix as now, we wont lose any meaningful string info contained inside forms when we issue calls like node.toPlainTextString(). Get the latest release from http://htmlparser.sourceforge.net The site update is continuing at an even pace. There is a new section on writing tests for HTMLParser. We're also trying to introduce a philosophy called "Communicate with TestCases". If you've found a bug, write a testcase for it, and submit that in your report. Of course, you dont have to do this, but if you do, we'd be able to make the fix much faster (and motivated to make the fix). Writing a testcase for the parser is super simple - you can check the philosophy and an example on the documentation page. http://htmlparser.sourceforge.net/design/index.html Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-02 02:56:54
|
Hi Folks, Candidate Release 4 is out. This actually contains a few minor API = changes which wont affect your application, but have been done to = improve the OO design of the system. HTMLFormScanner has been improved. = The major work in this release went in refactoring 201 testcases - so as = to make it more readable, and follow the Once-And-Only-Once paradigm. = Well, the package size dropped about 12KB (after zipping), so you can = estimate how much refactoring was done.. All tests are passing. From the Change Log,=20 Integration Build 1.2 - 20021201 -------------------------------- [1] Refactored HTMLNode, API improved, now HTMLNode stores nodeBegin and nodeEnd. [2] Refactored Testing framework - to reduce the code size = substantially. [3] HTMLFormScanner improved to include Input,TextArea, Select and = Option scanners within You can get it from http://htmlparser.sourceforge.net There's an all-new Contributors Page (linked from the main site). Just = in case I missed anybody, or you have info to add, pls let me know. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-11-26 06:38:12
|
Hi Folks, Candidate 3 is out. You can get it from http://htmlparser.sourceforge.net The website is getting an overhaul, though this is in progress. You will find a new samples page. If anyone wishes to contribute a simple program to add to the catalog, please feel free to come forward. From the change log, in this release : Integration Build 1.2 - 20021125 -------------------------------- [1] Incorporated Bug Fix for HTMLLinkProcessor to parse dynamic urls [2] Refactored package names to org.htmlparser [3] Added documentation [4] Can handle url with spaces in it [5] Fixed bug 643352 - going into infinite loop on bad img within link [6] Refactored HTMLLinkTag - unnecessary boolean variables removed Regards, Somik |
From: Somik R. <so...@ya...> - 2002-11-09 18:44:13
|
Hi Folks, Candidate Release 2 is out. Changes are : [1] Updated javadoc [2] Added support for multiple calls to elements() [sequentially, not = parallelly] The latter implies, you can complete one round of parsing, and make = another call to HTMLParser.elements() to begin another, without needing = to recreate the parser object. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-10-31 12:12:38
|
Hi Folks, HTMLParser 20021031 (C1) is out. This is candidate release 1. If = there are no issues, then this will become a production release. =20 There are bug fixes in this release, and some improvements. Most = important improvement - allowing renderers to be plugged in so as to = allow customization of functionality of toHTML(). Check the javadoc of = com.kizna.html.HTMLNode. Feedback will help us finalize this version, and is eagerly awaited. = Regards, Somik |
From: Somik R. <so...@ya...> - 2002-10-16 10:59:16
|
Hi Folks, Integration release 1.2-20021016 is out. You can get it from http://htmlparser.sourceforge.net Here's the change log : Integration Build 1.2 - 20021016 -------------------------------- [1] Fixed bug 621117 - JSP tags not recognized if within string node [2] Fixed bug 617228 - Links with > symbol in query strings were not being recognized. [3] build.xml completely automatic - no manual changes needed before running [4] build.xml included in release package, inside src.zip [5] Refactored HTMLTag - design modified, introduced HTMLTagParser helper class [6] Optimized scanning process - 20% faster now There have been some refactorings and optimizations in this release. Most notably, the scanners are not enumerated sequentially anymore. Instead, they are stored inside hashtables, and are identified by the first word that occurs in a tag (in uppercase). Now, we have a default implementation of evaluate() which returns true, and most of the scanners dont override this if their evaluation is simply based on matching the first word. However, if the matching logic is complex, then evaluate() should be overridden. An additional method has been introduced in HTMLTagScanner() which all scanners have to override - and that is - getID() - which will be used to register the scanner into the hashtable (called only once) inside addScanner(). In addition feedback is being incorporated - you will find feedback if you run the testcases. The performance improvement is substantial - on running com.kizna.htmlTests.PerformanceTest.java - I could see a reduction of 500 ms (with all scanners registered) from 2500 ms to 2000 ms (run on the MySQL installation guide page). For developers (or folks who want to join) - the build script has been included in the distribution (it is a whole lot more powerful now - autodetects code version, etc..). Making your package ready for distribution is exceedingly simple now - so do go ahead and explore. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-10-02 03:18:41
|
Hi Folks, The latest integration release of HTMLParser has some bug fixes, but = the biggest improvements is the addition of a base ref scanner. Now, = pages with base ref urls can be easily picked up, and images and links = resolved accordingly. You can download it from http://htmlparser.sourceforge.net =20 Regards, Somik |
From: Somik R. <so...@ya...> - 2002-09-01 03:48:33
|
Hi Folks, 2002_08_31 is out. Changes : [1] Feedback integrated into the API. Not yet functional - but will be = over the next few releases. The API change has been put in early. This = is the last planned change in the API for production release - 1.2. [2] End of Line String implemented across all scanners.=20 You can download it from http://htmlparser.sourceforge.net Regards, Somik |
From: Somik R. <so...@ya...> - 2002-08-26 01:46:18
|
Hi Folks, Integration Release 1.2-2002_08_26 is out. Major improvement is = handling of newline characters is totally customizable. You can set your = own line separator ("\n" or "\r\n"), or have the parser auto-detect it = from your JVM. This is useful when you perform platform-specific = reconstructions using toHTML(). So, you wont see the funny characters = that occur due to cross-platform incompatibility of end-of-line = characters.=20 For the complete change log, check the download page. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-08-10 08:12:31
|
Hi Folks, The next integration release (v1.2-2002-08-11) is out. Has = significant bug fixes and API changes. Check http://htmlparser.sourceforge.net=20 Regards, Somik ********************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012, JAPAN Phone : +81-3-5475-2646 Fax : +81-3-3445-9089 Web : http://www.kizna.com Mail : so...@ki... ********************************** |
From: Somik R. <so...@ya...> - 2002-08-04 07:07:14
|
Hi Folks, HTMLParser 1.2- 2002_08_04 is out. Major API changes have occurred - = chained exception handling, which will allow applications to handle = exceptions. Lots of important bug fixes done. Note : 1 known bug still exists in parseParameters() - so you would = see two failing testcases, but this bug is minor, and will be fixed in = the next release. We would appreciate feedback on the API changes in the = user list. Check http://htmlparser.sourceforge.net.=20 Regards, Somik |
From: Somik R. <so...@ya...> - 2002-07-28 07:20:37
|
Hi Folks, This release contains a lot of important bug fixes. You can get it = from http://htmlparser.sourceforge.net Regards, Somik |
From: Somik R. <so...@ya...> - 2002-07-21 06:06:03
|
Hi Folks, A new integration release is out - 2002-07-21. It contains 4 bug = fixes, and the code is refactored and a bit more optimized. Regards, Somik |