htmlparser-developer Mailing List for HTML Parser (Page 19)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Derrick O. <Der...@ro...> - 2003-01-20 13:20:35
|
Somik, My instincts would say to use the simplest mechanism possible. In this case it would be instanceof, since the getType() way involves extra fields and accessor methods. But what problem are you trying to solve? Is it the "if (node instanceof HTMLLinkTag)" that seems to be needed everywhere? Perhaps HTMLNode should have a "getLink()" method that returns null but is overridden in HTMLLinkTag? Similarly, rationalization of toString(), getPlainTextString(), getHTML() and any required new methods to return appropriate renditions of the text within the node could eliminate the instanceof operations in StringExtractor and elsewhere. My $0.02 worth. Derrick Somik Raha wrote: >Hi Derrick, > It was really nice to read your reply. I tried a more accurate test (no, >I didnt include instanceof HTMLNode, as our matches are at most one level >up). The results (attached graph) show that it is almost the same - there is >no perceivable improvement in this case. I guess if one goes a couple of >layers up, the benefits would start to show. > > Which brings me to the next question - knowing that we have no >perceptible improvement to gain, should we recommend the use of the >object-oriented way ? > >Regards, >Somik > >----- Original Message ----- >From: "Derrick Oswald" <Der...@ro...> >To: <htm...@li...> >Sent: Saturday, January 18, 2003 6:36 AM >Subject: Re: [Htmlparser-developer] Java Performance question > > |
From: Somik R. <so...@ya...> - 2003-01-20 07:19:32
|
Hi Derrick, It was really nice to read your reply. I tried a more accurate test (no, I didnt include instanceof HTMLNode, as our matches are at most one level up). The results (attached graph) show that it is almost the same - there is no perceivable improvement in this case. I guess if one goes a couple of layers up, the benefits would start to show. Which brings me to the next question - knowing that we have no perceptible improvement to gain, should we recommend the use of the object-oriented way ? Regards, Somik ----- Original Message ----- From: "Derrick Oswald" <Der...@ro...> To: <htm...@li...> Sent: Saturday, January 18, 2003 6:36 AM Subject: Re: [Htmlparser-developer] Java Performance question > Somik, > > I think there are a couple of reasons. First is your instanceof test is > always immediately succeeding. The penalty for instanceof is when it has > to walk the inheritance heirarchy (usually all the way up to Object) to > determine failure, which would happen often if you were trying to > determine what to do with an unknown node type. Second, your getType() > involves a virtual method call that would normally not be done more than > once. That is, you would typically get the unknown type once and compare > it to each of the final types you are aware of, which would effectively > move line 45 of the second dissassembly below (generated by "javap -c > org.htmlparser.tests.InstanceofPerformanceTest") out of the kernel loop > and replace it with a "lload 8" probably: > > Method void doInstanceofTest(long[], int, long) > <snip> > 35 lconst_0 // for (i = 0 > 36 lstore 7 > 38 goto 57 > 41 aload_0 // this > 42 getfield #10 <Field org.htmlparser.HTMLNode node> // get > InstancofPerformanceTest 'node' member variable > 45 instanceof #21 <Class org.htmlparser.tags.HTMLTag> // node > instanceof HTMLTag > 48 ifeq 51 // { } > 51 lload 7 // i++ > 53 lconst_1 > 54 ladd > 55 lstore 7 > 57 lload 7 // i < numTimes > 59 lload_3 > 60 lcmp > 61 iflt 41 // repeat > </snip> > > > Method void doGetTypeTest(long[], int, long) > <snip> > 35 lconst_0 // for (i = 0 > 36 lstore 7 > 38 goto 59 > 41 aload_0 // this > 42 getfield #10 <Field org.htmlparser.HTMLNode node> // get > InstancofPerformanceTest 'node' member variable > 45 invokevirtual #23 <Method java.lang.String getType()> // > getType() virtual method call > 48 ldc #24 <String "NODE"> // 'retrieve' String "NODE" from > HTMLNode, but since it's final it's a local copy > 50 if_acmpne 53 // == > 53 lload 7 // i++ > 55 lconst_1 > 56 ladd > 57 lstore 7 > 59 lload 7 // i < numTimes > 61 lload_3 > 62 lcmp > 63 iflt 41 // repeat > </snip> > > A fairer test might be: > > type = node.getType(); > for (... > if (type == "BOGUS") > {} > else if (type == "FAKE") > {} > else if (type == "NODE") > {} > > vs. > > for (... > if (node instanceof HTMLFrameTag) > {} > else if (node instanceof HTMLFormTag) > {} > else if (node instanceof HTMLNode) > {} > > Derrick > > > Somik Raha wrote: > > > Hi Folks, > > I was tinkering around with instanceof and under the impression > > that it causes a performance hit, I tried replacing it with a > > polymorphic mechanism - by which HTMLNode has a method getType(), and > > so do the other basic nodes. A match is then attempted like so : > > if (node.getType()==HTMLTag.TYPE) > > > > instead of > > > > if (node instanceof HTMLTag) > > > > I have taken care that getType does not do object creation - it is a > > static object. One would expect the former to be faster. > > But in a performance test (InstanceofPerformanceTest in > > org.htmlparser.tests) - I find the opposite behaviour. > > Here's a graph showing the response of instanceof in blue and > > getType()==HTMLTag.TYPE in pink - > > http://htmlparser.sourceforge.net/design/pics/performance.gif > > > > Does anyone have explanations ? > > > > Regards, > > Somik > > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: Thawte.com - A 128-bit supercerts will > allow you to extend the highest allowed 128 bit encryption to all your > clients even if they use browsers that are limited to 40 bit encryption. > Get a guide here:http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0030en > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-01-18 14:29:44
|
Somik, I think there are a couple of reasons. First is your instanceof test is always immediately succeeding. The penalty for instanceof is when it has to walk the inheritance heirarchy (usually all the way up to Object) to determine failure, which would happen often if you were trying to determine what to do with an unknown node type. Second, your getType() involves a virtual method call that would normally not be done more than once. That is, you would typically get the unknown type once and compare it to each of the final types you are aware of, which would effectively move line 45 of the second dissassembly below (generated by "javap -c org.htmlparser.tests.InstanceofPerformanceTest") out of the kernel loop and replace it with a "lload 8" probably: Method void doInstanceofTest(long[], int, long) <snip> 35 lconst_0 // for (i = 0 36 lstore 7 38 goto 57 41 aload_0 // this 42 getfield #10 <Field org.htmlparser.HTMLNode node> // get InstancofPerformanceTest 'node' member variable 45 instanceof #21 <Class org.htmlparser.tags.HTMLTag> // node instanceof HTMLTag 48 ifeq 51 // { } 51 lload 7 // i++ 53 lconst_1 54 ladd 55 lstore 7 57 lload 7 // i < numTimes 59 lload_3 60 lcmp 61 iflt 41 // repeat </snip> Method void doGetTypeTest(long[], int, long) <snip> 35 lconst_0 // for (i = 0 36 lstore 7 38 goto 59 41 aload_0 // this 42 getfield #10 <Field org.htmlparser.HTMLNode node> // get InstancofPerformanceTest 'node' member variable 45 invokevirtual #23 <Method java.lang.String getType()> // getType() virtual method call 48 ldc #24 <String "NODE"> // 'retrieve' String "NODE" from HTMLNode, but since it's final it's a local copy 50 if_acmpne 53 // == 53 lload 7 // i++ 55 lconst_1 56 ladd 57 lstore 7 59 lload 7 // i < numTimes 61 lload_3 62 lcmp 63 iflt 41 // repeat </snip> A fairer test might be: type = node.getType(); for (... if (type == "BOGUS") {} else if (type == "FAKE") {} else if (type == "NODE") {} vs. for (... if (node instanceof HTMLFrameTag) {} else if (node instanceof HTMLFormTag) {} else if (node instanceof HTMLNode) {} Derrick Somik Raha wrote: > Hi Folks, > I was tinkering around with instanceof and under the impression > that it causes a performance hit, I tried replacing it with a > polymorphic mechanism - by which HTMLNode has a method getType(), and > so do the other basic nodes. A match is then attempted like so : > if (node.getType()==HTMLTag.TYPE) > > instead of > > if (node instanceof HTMLTag) > > I have taken care that getType does not do object creation - it is a > static object. One would expect the former to be faster. > But in a performance test (InstanceofPerformanceTest in > org.htmlparser.tests) - I find the opposite behaviour. > Here's a graph showing the response of instanceof in blue and > getType()==HTMLTag.TYPE in pink - > http://htmlparser.sourceforge.net/design/pics/performance.gif > > Does anyone have explanations ? > > Regards, > Somik |
From: Somik R. <so...@ya...> - 2003-01-18 08:33:19
|
Hi Folks, I was tinkering around with instanceof and under the impression that = it causes a performance hit, I tried replacing it with a polymorphic = mechanism - by which HTMLNode has a method getType(), and so do the = other basic nodes. A match is then attempted like so : if (node.getType()=3D=3DHTMLTag.TYPE) instead of if (node instanceof HTMLTag) I have taken care that getType does not do object creation - it is a = static object. One would expect the former to be faster. But in a performance test (InstanceofPerformanceTest in = org.htmlparser.tests) - I find the opposite behaviour. Here's a graph showing the response of instanceof in blue and = getType()=3D=3DHTMLTag.TYPE in pink - http://htmlparser.sourceforge.net/design/pics/performance.gif Does anyone have explanations ? Regards, Somik |
From: Somik R. <so...@ya...> - 2003-01-16 17:50:00
|
Hi Dhaval, Writing a scanner is now a trivial task. You can go ahead and do it. You dont need to write scan or evaluate methods anymore. Simply derive from HTMLCompositeTagScanner, provide the getID(), and the factory method for creating your tag. Take a look at the new scanner code (HTMLTitleScanner, HTMLScriptScanner,...) Cheers, Somik --- dha...@or... wrote: > Hi all, > > Is anyone writing or planning to write a <BODY> > tag-scanner pair. I need > it in my work so I thought if anyone else is doing > it then I won't > duplicate the effort otherwise I have to do it. Do > let me know. > > Regards, > > Dhaval Udani > Senior Analyst > M-Line, QPEG > OrbiTech Solutions Ltd. > +91-22-28290019 Extn. 1457 > > > __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: <dha...@or...> - 2003-01-16 17:06:07
|
Hi all, Is anyone writing or planning to write a <BODY> tag-scanner pair. I need it in my work so I thought if anyone else is doing it then I won't duplicate the effort otherwise I have to do it. Do let me know. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 |
From: Somik R. <so...@ya...> - 2003-01-14 23:31:35
|
Hi, It seems to be a printing bug in HTMLScriptTag - the parser works fine (if you use it in your application). Regards, Somik --- agente007 <e-a...@ex...> wrote: > > The integration 1.3 of HTMLParser don't work. > (The 1.2 work!) > > Appears: > > E:\DOCTORADO\htmlparser\bin>java -jar > ..\lib\htmlparser.jar http://www.yahoo.com > > HTMLParser v1.3 (Integration Build Jan 12, 2003) > Parsing http://www.yahoo.com > Begin Tag : html; begins at : 0; ends at : 5 > Begin Tag : head; begins at : 0; ends at : 5 > TITLE: Yahoo! > java.lang.NullPointerException > at > org.htmlparser.tags.HTMLScriptTag.toString(HTMLScriptTag.java:97) > at > org.htmlparser.HTMLNode.print(HTMLNode.java:91) > at > org.htmlparser.HTMLParser.parse(HTMLParser.java:974) > at > org.htmlparser.HTMLParser.main(HTMLParser.java:1086) > > > > What happend? > > Regards > > JJ > > _______________________________________________ > Join Excite! - http://www.excite.com > The most personalized portal on the Web! > __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: agente007 <e-a...@ex...> - 2003-01-14 23:27:11
|
The integration 1.3 of HTMLParser don't work. (The 1.2 work!) Appears: E:\DOCTORADO\htmlparser\bin>java -jar ..\lib\htmlparser.jar http://www.yahoo.com HTMLParser v1.3 (Integration Build Jan 12, 2003) Parsing http://www.yahoo.com Begin Tag : html; begins at : 0; ends at : 5 Begin Tag : head; begins at : 0; ends at : 5 TITLE: Yahoo! java.lang.NullPointerException at org.htmlparser.tags.HTMLScriptTag.toString(HTMLScriptTag.java:97) at org.htmlparser.HTMLNode.print(HTMLNode.java:91) at org.htmlparser.HTMLParser.parse(HTMLParser.java:974) at org.htmlparser.HTMLParser.main(HTMLParser.java:1086) What happend? Regards JJ _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |
From: Joshua K. <jo...@in...> - 2003-01-13 15:27:36
|
Folks, Of late, Somik and I have been using the HTML Parser for various tasks related to XML documents, not HTML. We have found that the parser's scanners make our life easier than it would be if we were to use Xerces. Yet I find it awkward for our code to interact with classes like HTMLNode when we are in fact working with XML nodes. I wonder if it makes sense to make the parser more generic so that it is clear that it is useful for both HTML and XML. I'd welcome any feedback on this idea. best regards jk |
From: Somik R. <so...@ya...> - 2003-01-13 04:50:15
|
Hi Folks, This week's integration release is out. This release has significant contributions from Derrick Oswald and Josh Kerievsky. Derrick is building a nice UI for the parser - and making tons of improvements. Thanks to Josh's insight, we have done some major refactorings on the scanners - resulting in a massive drop in code duplication. Here are some statistics - the scanners package in the last release had 1693 lines of code. In the current release, this has dropped to 1300 lines of code. We have a new class HTMLCompositeTagScanner which does the hard-work for picking up child tags. Most scanners use this code. HTMLTagScanner too does some useful work- and from this release, new scanners dont need to override evaluate() or scan(). Take a look at the refactored scanner code and you might be surprised with its size and simplicity. Here's the change log : Integration build 1.3 - 20030112 -------------------------------- [1] Assume charset is correct for JVM's without Charset class to check it [2] Beanize the parser [3] Switch to swingui junit runner by default [4] Half baked beans [5] Fix javadoc warnings in JDK 1.4 [6] Added StringFindingVisitor + test code + new visitors packages [7] Fixed bug 659723, but HTMLStringNode is not thread-safe anymore. [8] JDK 1.2 compilability [9] Modified HTMLEnumeration interface (made less verbose) [10] Added HTMLCompositeTagScanner [11] Refactored following scanners to use HTMLCompositeTagScanner : (i) HTMLStyleScnner (ii) HTMLSelectScanner (iii) HTMLFrameSetScanner (iv) HTMLTitleScanner (v) HTMLTextAreaScanner (vi) HTMLScriptScanner (vii) HTMLFrameSetScanner [12] Made StringNode the last parse attempt, so now Reader trys in this order: remark tag endtag string (this will return more HTMLStringNode objects than it did before). [13] Improve speed by performing tag/string triage based on '<' as next character. [14] Refactored HTMLTagScanner. The following scanners use refactored code: (i) HTMLBaseHREFScanner (ii) HTMLDoctypeScanner (iii) HTMLFrameScanner (iv) HTMLJspScanner (v) HTMLMetaTagScanner Regards, Somik |
From: Derrick O. <Der...@ro...> - 2003-01-09 04:11:55
|
Karle, I've removed the reference to getPath() in HTMLLinkProcessor and removed the BeanInfo classes. It compiles under JDK 1.2.2 and 1.3.0 on NT and JDK 1.4.1_01 on Linux. Unit tests are all clear except for 21 failures when using JDK 1.2.2 on NT. These are mostly due to attribute rearrangement by toHTML() in various tags and URL resolution problems in HTMLLinkProcessor. Derrick Kaarle Kaila wrote: > As I looked at the code in CVS yesterday it did not compile OK with > JDK 1.3 > I needed to do it with JDK 1.4.1 > > Kaarle > |
From: Kaarle K. <kaa...@ik...> - 2003-01-08 19:33:47
|
At 09:41 8.1.2003 -0800, you wrote: >Hi Dhaval, Holger, > > > I agree with Holger out here. Our base Java version > > to support should be > > JDK 1.2. > >I agree about JDK 1.2. I think its high time to move >on. Some folks were using this with JDK 1.1 ages back >in an applet. As I looked at the code in CVS yesterday it did not compile OK with JDK 1.3 I needed to do it with JDK 1.4.1 Kaarle >But about the Collections issue - did you take a look >at HTMLVector ? Having our own vector class will >enable us to avoid the performance hit in class >casting. > >Regards, >Somik > >__________________________________________________ >Do you Yahoo!? >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. >http://mailplus.yahoo.com > > >------------------------------------------------------- >This SF.NET email is sponsored by: >SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! >http://www.vasoftware.com >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2003-01-08 17:41:32
|
Hi Dhaval, Holger, > I agree with Holger out here. Our base Java version > to support should be > JDK 1.2. I agree about JDK 1.2. I think its high time to move on. Some folks were using this with JDK 1.1 ages back in an applet. But about the Collections issue - did you take a look at HTMLVector ? Having our own vector class will enable us to avoid the performance hit in class casting. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: <dha...@or...> - 2003-01-08 14:00:36
|
I agree with Holger out here. Our base Java version to support should be JDK 1.2. -----Original Message----- From: Holger.Stenzhorn [mailto:Hol...@xt...] Sent: Wednesday, January 08, 2003 7:23 PM To: htmlparser-developer Subject: AW: [Htmlparser-developer] Latest code Problems Hi! ...stupid question: Do we really need to still support JDK 1.1? I know that relying on the new features of 1.4 would blow many current apps, so we should not do this for the time being. But the integration of e.g. the Collections framework introduced in JDK 1.2 (which is a long time ago) into HTMLParser should be possible and done. =20 So to add my opinion to the poll: I would like to drop support for JDK 1.1.=20 Holger -----Urspr=FCngliche Nachricht----- Von: Derrick Oswald [mailto:Der...@ro...] Gesendet: Mittwoch, 8. Januar 2003 14:11 An: htm...@li... Betreff: Re: [Htmlparser-developer] Latest code Problems <snip> > =20 > [2] HTMLLinkBeanInfo and HTMLTextBeanInfo dont compile. Are you=20 > relying on something in JDK 1.4 ? Bytway, htmlparser is JDK 1.1=20 > compliant. I am not sure if that should change, but then again, it=20 > really depends on the users of the parser. 1.1 compatibility is news to me. This has probably been broken for a=20 while (see ArrayList in ChainedException and Iterator in Translate). Version 1.1 is usually mandated by old browser JVM support, or legacy=20 (unsupported) operating systems. I don't think it's an issue here since it's not running as an applet. The use of the Vector class (required=20 under JDK 1.1) is a bit of a performance hit since the class is=20 synchronized. For now just delete those 'BeanInfo' files and it should compile OK, but I'll see if I can fix it for you, and then a decision can be made about=20 continuing 1.1 support. </snip> ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld =3D Something 2 See! http://www.vasoftware.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld - Something 2 See! http://www.vasoftware.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Holger S. <Hol...@xt...> - 2003-01-08 13:53:01
|
Hi! ...stupid question: Do we really need to still support JDK 1.1? I know that relying on the new features of 1.4 would blow many current = apps, so we should not do this for the time being. But the integration of e.g. the Collections framework introduced in JDK = 1.2 (which is a long time ago) into HTMLParser should be possible and = done. =20 So to add my opinion to the poll: I would like to drop support for JDK = 1.1.=20 Holger -----Urspr=FCngliche Nachricht----- Von: Derrick Oswald [mailto:Der...@ro...] Gesendet: Mittwoch, 8. Januar 2003 14:11 An: htm...@li... Betreff: Re: [Htmlparser-developer] Latest code Problems <snip> > =20 > [2] HTMLLinkBeanInfo and HTMLTextBeanInfo dont compile. Are you=20 > relying on something in JDK 1.4 ? Bytway, htmlparser is JDK 1.1=20 > compliant. I am not sure if that should change, but then again, it=20 > really depends on the users of the parser. 1.1 compatibility is news to me. This has probably been broken for a=20 while (see ArrayList in ChainedException and Iterator in Translate). Version 1.1 is usually mandated by old browser JVM support, or legacy=20 (unsupported) operating systems. I don't think it's an issue here since = it's not running as an applet. The use of the Vector class (required=20 under JDK 1.1) is a bit of a performance hit since the class is=20 synchronized. For now just delete those 'BeanInfo' files and it should compile OK, but = I'll see if I can fix it for you, and then a decision can be made about=20 continuing 1.1 support. </snip> ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld =3D Something 2 See! http://www.vasoftware.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-01-08 13:06:11
|
Somik Raha wrote: > Hi Derrick, Everyone, > > I checked out the latest code, and I found the following problems : > [1] testHTTPCharset and testHTMLCharset still fail. This time the > failures are : Could not open http://www.ibm.co.jp (I tried changing > it to www.ibm.com/jp <http://www.ibm.com/jp> but that didnt help. I'm baffled, it runs through for me (all 294 tests, thank you). Is anyone else having trouble running these tests? > > [2] HTMLLinkBeanInfo and HTMLTextBeanInfo dont compile. Are you > relying on something in JDK 1.4 ? Bytway, htmlparser is JDK 1.1 > compliant. I am not sure if that should change, but then again, it > really depends on the users of the parser. 1.1 compatibility is news to me. This has probably been broken for a while (see ArrayList in ChainedException and Iterator in Translate). Version 1.1 is usually mandated by old browser JVM support, or legacy (unsupported) operating systems. I don't think it's an issue here since it's not running as an applet. The use of the Vector class (required under JDK 1.1) is a bit of a performance hit since the class is synchronized. For now just delete those 'BeanInfo' files and it should compile OK, but I'll see if I can fix it for you, and then a decision can be made about continuing 1.1 support. > > [3] I am wondering if org.htmlparser.beans should exist outside the > main htmlparser module - in a module of its own within the htmlparser > project. That way the parser workspace could be focussed on the > parsing functionality. What do you think ? Hmm. There's a lot of overhead in adding another configuration item. Any other developers have opinions on this? > > [4] I've fixed the bug that was being caught by testExtractLinkBug2 - > based on Sam's suggestion, but I think more work needs to be done. > Currently, HTMLStringNode has turned thread-unsafe due to its > customizability. Since it is static, thread-safety goes out the > window. The obvious refactoring would be to have non-static class(es) > for the basic automata - which would have a 1:1 mapping with the > parser instance. > > Regards, > Somik > > |
From: Holger S. <Hol...@xt...> - 2003-01-08 10:36:47
|
Hi! First of all: Thanx for all your comments!=20 Second, my comments to your comments :-) - Logging: I have been using the Jakarta Log4J and also the Commons = Logging for some time now and my experience with that was very good so = far. It is easy and intuitive to use and also quite powerful. But the = point Claude is making in his mail about depending on other projects is = also true, so his proposal of a feedback utility class is good in my = view and would provide a nice facade to the outside world. Question: = Java 1.4, as you all know, actually provides a built-in logging = facility. HTMLParser is targeted also at Java version 1.2 and 1.3, so = the usage of this built-in facility is prohibitive, right? - Naming Convention: I actually wrote the same thing about get/setURL = last week to Somik. I would expect the getURL() method to return a URL = object just as the standard Java classes do (e.g. java.net.URI, = java.net.HttpURLConnection, ...). So either do split up the functions as = you propose or change the function alltogether to let it return a URL = object that can encapsulate both a filename and a URL string (and parse = that one for correctness directly when generating the object). - Bean Pattern and Parse Methods: I actually thought of using that = pattern too since I use it a lot in other code too. The reason why I = propose the parse(XXX) methods is conformity: All standard XML parsers = like javax.xml.parsers.DocumentBuilder/SAXParser or = org.jdom.input.DOMBuilder/SAXBuilder use the same or very similar API = usage patterns. In this way users that deploy our HTMLParser and some = XML parser in their work (like I do for example) would have a very = homogenous way of accessing the APIs. What is also important to note = here: The parse method would be only a facade to the users of the = HTMLParser. Internally I would also apply the bean pattern that you = propose. So I think there would be not much code duplication at all, if = any. Well, if I look at your code snipplet, then there is not much = difference to my API proposal, actually only one line would change: parser =3D new HTMLParser(); parser.registerScanners(); while (<more>) { parser.setCharset("<encoding>"); parser.setXXXX(<whatever>); parser.parse(<what>); enumeration =3D parser.getResultHTMLEnumeration(); } Still one more addition to the above: Just planting in the parse() = methods in the HTMLParser code as it is right now would be indeed a = misnomer. That is why I think a refactoring should take place. Well, = this refactoring would be a good thing to do anyways whether you add the = parse() methods or not.=20 - HTMLVector and Vistors (to Somik): I did already take a brief look at. = I will dig deeper into it as soon as possible. Perhaps I can readily = trash some of my ideas if I looked more carefully at that stuff. :-)=20 But still :What do you think about that? Holger -----Urspr=FCngliche Nachricht----- Von: Derrick Oswald [mailto:Der...@ro...] Gesendet: Mittwoch, 8. Januar 2003 03:03 An: htm...@li... Betreff: Re: [Htmlparser-developer] Request for comments: Proposal for changes in HTMLParser API Holger Stenzhorn wrote: <snip> >According to my idea you would have do the following: >First you create one HTMLParser object by calling the empty = constructor: >- HTMLParser() >(This single HTMLParser object can be reused in consecutive parsing = actions.) > =20 > I believe you can do this now (see my recent submission 'Beanize the=20 parser', described below). >Third, you can add one or more (instead of only one as right now) = feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback = htmlParserFeedback). > The feedback object was under consideration for replacement by the=20 generic logging facade provided by Jakarta,=20 http://jakarta.apache.org/commons/logging.html which does allow for=20 multiple 'loggers'. > >Then you would use one of the following parse methods: >- void parse(java.lang.String string) >- void parse(java.io.File file) >- void parse(java.io.InputStream inputStream) >- void parse(java.io.Reader reader) >- void parse(java.net.URL url) >- void parse(java.net.URI uri) (but this would require JDK 1.4, so = better leave this out for now) >(Remark: I know there already is a method parse(java.lang.String = string) in the HTMLParser class where the parameter is the name of a = filter. Question: Is this function used a lot or at all? Can it be = renamed or dropped and its functionality reimplemented in another way?) > The HTMLParser setXXX() methods, i.e. setURL(), setConnection() and=20 setReader(), provide the facility you want, so I would suggest using=20 this same 'bean' pattern instead of the misnomer parse(), because it=20 really isn't parsed till later. Following this naming convention, the existing setURL() which handles=20 file names as well as URLs should probably be broken up into two=20 methods, setFileName() and setURLString(), but it's very handy to have a = single method that understands both for command line interpretation.=20 Resist the temptation to overload it [as in setURL(URL url)], or you'll = break a very useful bean pattern. I might suggest the current setURL()=20 be renamed to setSource(). The parse(String) method you mention presumably takes HTML text and=20 wraps it in a reader like HTMLParserTestCase.createParser() does. This=20 should be called setHTML(). So we have: setSource("http://..." or "/usr/local") setURLString("http://...") setFileName("/usr/local/...") setHTML("<html><head>...") setFile(new File("/usr/local/..")) setInputStream(new BufferedInputStream()) setReader(new FileReader("/usr/local/..")) setURL(new URL("http://...")) setConnection(url.getConnection()) I would suggest that all these channel through a common initialization=20 method to avoid repeating the same code over and over and to ensure=20 correctly resetting all necessary things. For reuse, all of these methods would need to set field resourceLocn=20 somehow so that a stale source is not used in warning messages so=20 a setResourceLocation() is probably needed that just sets the field. And = most would need to set the encoding in order to correctly convert raw=20 bytes into characters. Since setEncoding() resets the current reader or = connection to handle a charset directive in the HTML header, a=20 setCharset() method that just sets the character encoding probably is=20 needed (or vica-versa). That would mean the typical re-usage would then = be: parser =3D new HTMLParser(); parser.registerScanners(); while (<more>) { parser.setResourceLocation("<where>"); parser.setCharset("<encoding>"); parser.setXXXX(<whatever>); enumeration =3D parser.getResultHTMLEnumeration(); } However, since the HTMLParser object is fairly light weight, it may be=20 better to just create another one whenever it's needed and if you're=20 really concerned about memory churn, just move the scanners into place: parser =3D new HTMLParser(); parser.registerScanners(); scanners =3D parser.getScanners() while (<more>) { parser =3D new HTMLParser(); parser.setScanners(scanners); ... ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld =3D Something 2 See! http://www.vasoftware.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2003-01-08 05:44:53
|
Hi Holger, >Finally you would get the results with: >- java.util.List getResultList() that returns a List containing HTMLNode objects >Returning simply a List is good in my opinion since this integrates the HTMLParser nicely into the standard Java collections >framework. It also makes it future save for the later applicability of Generics found in Java 1.5. This is a good suggestion. But, the drawback of this approach is that we have to keep casting to the objects we want. I feel there is a significant performance improvement to be had, by creating our own "list" object. You will find HTMLVector already in the source - but not yet integrated with the code (that requires a bit of work), which addresses this issue. Would you like to take that up ? >The solution for retrieving results with getResultXXX() methods would also allow to simply add some more and different result >retriever methods, e.g. >- org.htmlparser.util.HTMLEnumeration getResultHTMLEnumeration() or >- org.htmlparser.util.HTMLTree getResultHTMLTree() that would retrieve an (to be programmed) HTMLTree (similar to a w3c >Document) Getting the results from the parser is a very important area, and we've been adding some visitors which we've found very useful. HTMLTree sounds really interesting. It would be nice if you can also check out the existing visitors. Regards, Somik ******************************************** Somik Raha Extreme Programmer and Coach Industrial Logic, Inc. so...@in... http://industriallogic.com Voice : 510-540-8336 Fax : 510-540-8936 ******************************************** Periodic reassessment means looking at things which are taken for granted, things which seem beyond doubt. Periodic reassessment means challenging all assumptions. It is not a matter of reassessing something because there is a need to reassess it; there may be no need at all. It is a matter of reassessing something simply because it is there and has not been assessed for a long time. It is a deliberate and quite unjustified attempt to look at things in a new way. --- Edward De Bono in Lateral Thinking, Chapter 5, The Use of Lateral Thinking |
From: Somik R. <so...@ya...> - 2003-01-08 05:29:33
|
Hi Derrick, Everyone, I checked out the latest code, and I found the following problems : [1] testHTTPCharset and testHTMLCharset still fail. This time the = failures are : Could not open http://www.ibm.co.jp (I tried changing it = to www.ibm.com/jp but that didnt help. [2] HTMLLinkBeanInfo and HTMLTextBeanInfo dont compile. Are you relying = on something in JDK 1.4 ? Bytway, htmlparser is JDK 1.1 compliant. I am = not sure if that should change, but then again, it really depends on the = users of the parser. [3] I am wondering if org.htmlparser.beans should exist outside the main = htmlparser module - in a module of its own within the htmlparser = project. That way the parser workspace could be focussed on the parsing = functionality. What do you think ? [4] I've fixed the bug that was being caught by testExtractLinkBug2 - = based on Sam's suggestion, but I think more work needs to be done. = Currently, HTMLStringNode has turned thread-unsafe due to its = customizability. Since it is static, thread-safety goes out the window. = The obvious refactoring would be to have non-static class(es) for the = basic automata - which would have a 1:1 mapping with the parser = instance. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-01-08 05:00:04
|
Hi Folks, We've had a new developer joining us last week. I am sorry, I ought = to have written earlier, but was very busy.. Here's a brief bio of = Holger in his own words. "...very briefly about my programming background: I have been developing = software in Java since the beginning of 1997 in various postions, first = as a research assistant while being student of computational linguistics = at the Saarland University and at the DFKI (the German Research Center = for Artificial Intelligence), then as a visitor at the CSLI (Center for = the Study of Language and Information) at Stanford University and = finally as software enineer at XtraMind Technologies. As my diploma = thesis I implemented a system for generating natural language called = XtraGen that is based on Java and XML-technologies. At my work positions = I have been involved (among others) in two projects that combined = artificial intelligence methods for information retrieval, extraction = and presentation with web technologies (Mietta and Mietta-II)." Holger - welcome to the dev team of htmlparser. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-01-08 04:57:24
|
> I've dropped changes to use the junit swing runner by default. > I hope that's OK. > If you want the old behaviour (awt runner) use: > java org.htmlparser.tests.AllTests -awt No problem. I have stopped using the test ui as Eclipse comes with pretty neat JUnit support - AllTests is also not required - it can pickup all tests from directories recursively. But of course, we'd have to keep AllTests as not everyone uses Eclipse. Regards Somik |
From: Claude D. <CD...@ar...> - 2003-01-08 04:25:43
|
SWYgeW91IHJlcGxhY2UgdGhlIEZlZWRiYWNrIG1lY2hhbmlzbSB3aXRoIHRoZSBsb2dnaW5nIGZh Y2FkZSBwcm92aWRlZCBieSBKYWthcnRhLCB5b3Ugd2lsbCBiZSBjb3VwbGVkIHRvIGEgdGhpcmQg cGFydHkgbGlicmFyeS4gSXQgd291bGQgYmUgYmV0dGVyIGlmIHlvdSBzaW1wbHkgcHJvdmlkZWQg YSBmZWVkYmFjayB1dGlsaXR5IGNsYXNzIHRoYXQgY291bGQgYmUgdXNlZCB0byByZWRpcmVjdCBv dXRwdXQgdG8gdGhlIGxvZ2dpbmcgQVBJIGluc3RlYWQuIEJlIHdlYXJ5IG9mIG92ZXIgZW5naW5l ZXJpbmcuIEl0IGlzIGEgY29tbW9uIHBpdGZhbGwgb2YgbW9zdCBkZXZlbG9wZXJzIGFuZCBsZWFk cyB0byB1bm5lY2Vzc2FyeSBjb21wbGV4aXR5IGFuZCB0aWdodCBjb3VwbGluZyByYXRoZXIgdGhh biBsaWJyYXJ5IGluZGVwZW5kZW5jZSBhbmQgZ29vZCBBUElzLg0KDQoJLS0tLS1PcmlnaW5hbCBN ZXNzYWdlLS0tLS0gDQoJRnJvbTogRGVycmljayBPc3dhbGQgW21haWx0bzpEZXJyaWNrT3N3YWxk QHJvZ2Vycy5jb21dIA0KCVNlbnQ6IFR1ZSAxLzcvMjAwMyA2OjAzIFBNIA0KCVRvOiBodG1scGFy c2VyLWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQgDQoJQ2M6IA0KCVN1YmplY3Q6IFJl OiBbSHRtbHBhcnNlci1kZXZlbG9wZXJdIFJlcXVlc3QgZm9yIGNvbW1lbnRzOiBQcm9wb3NhbCBm b3IgY2hhbmdlcyBpbiBIVE1MUGFyc2VyIEFQSQ0KCQ0KCQ0KDQoJSG9sZ2VyIFN0ZW56aG9ybiB3 cm90ZToNCgkNCgk8c25pcD4NCgkNCgk+QWNjb3JkaW5nIHRvIG15IGlkZWEgeW91IHdvdWxkIGhh dmUgZG8gdGhlIGZvbGxvd2luZzoNCgk+Rmlyc3QgeW91IGNyZWF0ZSBvbmUgSFRNTFBhcnNlciBv YmplY3QgYnkgY2FsbGluZyB0aGUgZW1wdHkgY29uc3RydWN0b3I6DQoJPi0gSFRNTFBhcnNlcigp DQoJPihUaGlzIHNpbmdsZSBIVE1MUGFyc2VyIG9iamVjdCBjYW4gYmUgcmV1c2VkIGluIGNvbnNl Y3V0aXZlIHBhcnNpbmcgYWN0aW9ucy4pDQoJPiANCgk+DQoJSSBiZWxpZXZlIHlvdSBjYW4gZG8g dGhpcyBub3cgKHNlZSBteSByZWNlbnQgc3VibWlzc2lvbiAnQmVhbml6ZSB0aGUNCglwYXJzZXIn LCBkZXNjcmliZWQgYmVsb3cpLg0KCQ0KCT5UaGlyZCwgeW91IGNhbiBhZGQgb25lIG9yIG1vcmUg KGluc3RlYWQgb2Ygb25seSBvbmUgYXMgcmlnaHQgbm93KSBmZWVkYmFja3MgYnkgY2FsbGluZyBh ZGRIVE1MUGFyc2VyRmVlZGJhY2soSFRNTFBhcnNlckZlZWRiYWNrIGh0bWxQYXJzZXJGZWVkYmFj aykuDQoJPg0KCVRoZSBmZWVkYmFjayBvYmplY3Qgd2FzIHVuZGVyIGNvbnNpZGVyYXRpb24gZm9y IHJlcGxhY2VtZW50IGJ5IHRoZQ0KCWdlbmVyaWMgbG9nZ2luZyBmYWNhZGUgcHJvdmlkZWQgYnkg SmFrYXJ0YSwNCglodHRwOi8vamFrYXJ0YS5hcGFjaGUub3JnL2NvbW1vbnMvbG9nZ2luZy5odG1s IHdoaWNoIGRvZXMgYWxsb3cgZm9yDQoJbXVsdGlwbGUgJ2xvZ2dlcnMnLg0KCQ0KCT4NCgk+VGhl biB5b3Ugd291bGQgdXNlIG9uZSBvZiB0aGUgZm9sbG93aW5nIHBhcnNlIG1ldGhvZHM6DQoJPi0g dm9pZCBwYXJzZShqYXZhLmxhbmcuU3RyaW5nIHN0cmluZykNCgk+LSB2b2lkIHBhcnNlKGphdmEu aW8uRmlsZSBmaWxlKQ0KCT4tIHZvaWQgcGFyc2UoamF2YS5pby5JbnB1dFN0cmVhbSBpbnB1dFN0 cmVhbSkNCgk+LSB2b2lkIHBhcnNlKGphdmEuaW8uUmVhZGVyIHJlYWRlcikNCgk+LSB2b2lkIHBh cnNlKGphdmEubmV0LlVSTCB1cmwpDQoJPi0gdm9pZCBwYXJzZShqYXZhLm5ldC5VUkkgdXJpKSAo YnV0IHRoaXMgd291bGQgcmVxdWlyZSBKREsgMS40LCBzbyBiZXR0ZXIgbGVhdmUgdGhpcyBvdXQg Zm9yIG5vdykNCgk+KFJlbWFyazogSSBrbm93IHRoZXJlIGFscmVhZHkgaXMgYSBtZXRob2QgcGFy c2UoamF2YS5sYW5nLlN0cmluZyBzdHJpbmcpIGluIHRoZSBIVE1MUGFyc2VyIGNsYXNzIHdoZXJl IHRoZSBwYXJhbWV0ZXIgaXMgdGhlIG5hbWUgb2YgYSBmaWx0ZXIuIFF1ZXN0aW9uOiBJcyB0aGlz IGZ1bmN0aW9uIHVzZWQgYSBsb3Qgb3IgYXQgYWxsPyBDYW4gaXQgYmUgcmVuYW1lZCBvciBkcm9w cGVkIGFuZCBpdHMgZnVuY3Rpb25hbGl0eSByZWltcGxlbWVudGVkIGluIGFub3RoZXIgd2F5PykN Cgk+DQoJVGhlIEhUTUxQYXJzZXIgc2V0WFhYKCkgbWV0aG9kcywgaS5lLiBzZXRVUkwoKSwgc2V0 Q29ubmVjdGlvbigpIGFuZA0KCXNldFJlYWRlcigpLCBwcm92aWRlIHRoZSBmYWNpbGl0eSB5b3Ug d2FudCwgc28gSSB3b3VsZCBzdWdnZXN0IHVzaW5nDQoJdGhpcyBzYW1lICdiZWFuJyBwYXR0ZXJu IGluc3RlYWQgb2YgdGhlIG1pc25vbWVyIHBhcnNlKCksIGJlY2F1c2UgaXQNCglyZWFsbHkgaXNu J3QgcGFyc2VkIHRpbGwgbGF0ZXIuDQoJDQoJRm9sbG93aW5nIHRoaXMgbmFtaW5nIGNvbnZlbnRp b24sIHRoZSBleGlzdGluZyBzZXRVUkwoKSB3aGljaCBoYW5kbGVzDQoJZmlsZSBuYW1lcyBhcyB3 ZWxsIGFzIFVSTHMgc2hvdWxkIHByb2JhYmx5IGJlIGJyb2tlbiB1cCBpbnRvIHR3bw0KCW1ldGhv ZHMsIHNldEZpbGVOYW1lKCkgYW5kIHNldFVSTFN0cmluZygpLCBidXQgaXQncyB2ZXJ5IGhhbmR5 IHRvIGhhdmUgYQ0KCXNpbmdsZSBtZXRob2QgdGhhdCB1bmRlcnN0YW5kcyBib3RoIGZvciBjb21t YW5kIGxpbmUgaW50ZXJwcmV0YXRpb24uDQoJIFJlc2lzdCB0aGUgdGVtcHRhdGlvbiB0byBvdmVy bG9hZCBpdCBbYXMgaW4gc2V0VVJMKFVSTCB1cmwpXSwgb3IgeW91J2xsDQoJYnJlYWsgYSB2ZXJ5 IHVzZWZ1bCBiZWFuIHBhdHRlcm4uIEkgbWlnaHQgc3VnZ2VzdCB0aGUgY3VycmVudCBzZXRVUkwo KQ0KCWJlIHJlbmFtZWQgdG8gc2V0U291cmNlKCkuDQoJDQoJVGhlIHBhcnNlKFN0cmluZykgbWV0 aG9kIHlvdSBtZW50aW9uIHByZXN1bWFibHkgdGFrZXMgSFRNTCB0ZXh0IGFuZA0KCXdyYXBzIGl0 IGluIGEgcmVhZGVyIGxpa2UgSFRNTFBhcnNlclRlc3RDYXNlLmNyZWF0ZVBhcnNlcigpIGRvZXMu ICBUaGlzDQoJc2hvdWxkIGJlIGNhbGxlZCBzZXRIVE1MKCkuDQoJDQoJU28gd2UgaGF2ZToNCgkg ICAgc2V0U291cmNlKCJodHRwOi8vLi4uIiBvciAiL3Vzci9sb2NhbCIpDQoJICAgIHNldFVSTFN0 cmluZygiaHR0cDovLy4uLiIpDQoJICAgIHNldEZpbGVOYW1lKCIvdXNyL2xvY2FsLy4uLiIpDQoJ ICAgIHNldEhUTUwoIjxodG1sPjxoZWFkPi4uLiIpDQoJICAgIHNldEZpbGUobmV3IEZpbGUoIi91 c3IvbG9jYWwvLi4iKSkNCgkgICAgc2V0SW5wdXRTdHJlYW0obmV3IEJ1ZmZlcmVkSW5wdXRTdHJl YW0oKSkNCgkgICAgc2V0UmVhZGVyKG5ldyBGaWxlUmVhZGVyKCIvdXNyL2xvY2FsLy4uIikpDQoJ ICAgIHNldFVSTChuZXcgVVJMKCJodHRwOi8vLi4uIikpDQoJICAgIHNldENvbm5lY3Rpb24odXJs LmdldENvbm5lY3Rpb24oKSkNCgkNCglJIHdvdWxkIHN1Z2dlc3QgdGhhdCBhbGwgdGhlc2UgY2hh bm5lbCB0aHJvdWdoIGEgY29tbW9uIGluaXRpYWxpemF0aW9uDQoJbWV0aG9kIHRvIGF2b2lkIHJl cGVhdGluZyB0aGUgc2FtZSBjb2RlIG92ZXIgYW5kIG92ZXIgYW5kIHRvIGVuc3VyZQ0KCWNvcnJl Y3RseSByZXNldHRpbmcgYWxsIG5lY2Vzc2FyeSB0aGluZ3MuDQoJDQoJRm9yIHJldXNlLCBhbGwg b2YgdGhlc2UgbWV0aG9kcyB3b3VsZCBuZWVkIHRvIHNldCBmaWVsZCByZXNvdXJjZUxvY24NCglz b21laG93IHNvIHRoYXQgYSBzdGFsZSBzb3VyY2UgaXMgbm90IHVzZWQgaW4gd2FybmluZyBtZXNz YWdlcyBzbw0KCWEgc2V0UmVzb3VyY2VMb2NhdGlvbigpIGlzIHByb2JhYmx5IG5lZWRlZCB0aGF0 IGp1c3Qgc2V0cyB0aGUgZmllbGQuIEFuZA0KCW1vc3Qgd291bGQgbmVlZCB0byBzZXQgdGhlIGVu Y29kaW5nIGluIG9yZGVyIHRvIGNvcnJlY3RseSBjb252ZXJ0IHJhdw0KCWJ5dGVzIGludG8gY2hh cmFjdGVycy4gIFNpbmNlIHNldEVuY29kaW5nKCkgcmVzZXRzIHRoZSBjdXJyZW50IHJlYWRlciBv cg0KCWNvbm5lY3Rpb24gdG8gaGFuZGxlIGEgY2hhcnNldCBkaXJlY3RpdmUgaW4gdGhlIEhUTUwg aGVhZGVyLCBhDQoJc2V0Q2hhcnNldCgpIG1ldGhvZCB0aGF0IGp1c3Qgc2V0cyB0aGUgY2hhcmFj dGVyIGVuY29kaW5nIHByb2JhYmx5IGlzDQoJbmVlZGVkIChvciB2aWNhLXZlcnNhKS4gVGhhdCB3 b3VsZCBtZWFuIHRoZSB0eXBpY2FsIHJlLXVzYWdlIHdvdWxkIHRoZW4gYmU6DQoJDQoJcGFyc2Vy ID0gbmV3IEhUTUxQYXJzZXIoKTsNCglwYXJzZXIucmVnaXN0ZXJTY2FubmVycygpOw0KCXdoaWxl ICg8bW9yZT4pDQoJew0KCSAgICBwYXJzZXIuc2V0UmVzb3VyY2VMb2NhdGlvbigiPHdoZXJlPiIp Ow0KCSAgICBwYXJzZXIuc2V0Q2hhcnNldCgiPGVuY29kaW5nPiIpOw0KCSAgICBwYXJzZXIuc2V0 WFhYWCg8d2hhdGV2ZXI+KTsNCgkgICAgZW51bWVyYXRpb24gPSBwYXJzZXIuZ2V0UmVzdWx0SFRN TEVudW1lcmF0aW9uKCk7DQoJfQ0KCQ0KCUhvd2V2ZXIsIHNpbmNlIHRoZSBIVE1MUGFyc2VyIG9i amVjdCBpcyBmYWlybHkgbGlnaHQgd2VpZ2h0LCBpdCBtYXkgYmUNCgliZXR0ZXIgdG8ganVzdCBj cmVhdGUgYW5vdGhlciBvbmUgd2hlbmV2ZXIgaXQncyBuZWVkZWQgYW5kIGlmIHlvdSdyZQ0KCXJl YWxseSBjb25jZXJuZWQgYWJvdXQgbWVtb3J5IGNodXJuLCBqdXN0IG1vdmUgdGhlIHNjYW5uZXJz IGludG8gcGxhY2U6DQoJDQoJcGFyc2VyID0gbmV3IEhUTUxQYXJzZXIoKTsNCglwYXJzZXIucmVn aXN0ZXJTY2FubmVycygpOw0KCXNjYW5uZXJzID0gcGFyc2VyLmdldFNjYW5uZXJzKCkNCgl3aGls ZSAoPG1vcmU+KQ0KCXsNCgkgICAgcGFyc2VyID0gbmV3IEhUTUxQYXJzZXIoKTsNCgkgICAgcGFy c2VyLnNldFNjYW5uZXJzKHNjYW5uZXJzKTsNCgkgICAgLi4uDQoJDQoJDQoJDQoJDQoJDQoJLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KCVRo aXMgU0YuTkVUIGVtYWlsIGlzIHNwb25zb3JlZCBieToNCglTb3VyY2VGb3JnZSBFbnRlcnByaXNl IEVkaXRpb24gKyBJQk0gKyBMaW51eFdvcmxkID0gU29tZXRoaW5nIDIgU2VlIQ0KCWh0dHA6Ly93 d3cudmFzb2Z0d2FyZS5jb20NCglfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fXw0KCUh0bWxwYXJzZXItZGV2ZWxvcGVyIG1haWxpbmcgbGlzdA0KCUh0bWxwYXJz ZXItZGV2ZWxvcGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldA0KCWh0dHBzOi8vbGlzdHMuc291cmNl Zm9yZ2UubmV0L2xpc3RzL2xpc3RpbmZvL2h0bWxwYXJzZXItZGV2ZWxvcGVyDQoJDQoNCg== |
From: Derrick O. <Der...@ro...> - 2003-01-08 01:58:32
|
Holger Stenzhorn wrote: <snip> >According to my idea you would have do the following: >First you create one HTMLParser object by calling the empty constructor: >- HTMLParser() >(This single HTMLParser object can be reused in consecutive parsing actions.) > > I believe you can do this now (see my recent submission 'Beanize the parser', described below). >Third, you can add one or more (instead of only one as right now) feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback htmlParserFeedback). > The feedback object was under consideration for replacement by the generic logging facade provided by Jakarta, http://jakarta.apache.org/commons/logging.html which does allow for multiple 'loggers'. > >Then you would use one of the following parse methods: >- void parse(java.lang.String string) >- void parse(java.io.File file) >- void parse(java.io.InputStream inputStream) >- void parse(java.io.Reader reader) >- void parse(java.net.URL url) >- void parse(java.net.URI uri) (but this would require JDK 1.4, so better leave this out for now) >(Remark: I know there already is a method parse(java.lang.String string) in the HTMLParser class where the parameter is the name of a filter. Question: Is this function used a lot or at all? Can it be renamed or dropped and its functionality reimplemented in another way?) > The HTMLParser setXXX() methods, i.e. setURL(), setConnection() and setReader(), provide the facility you want, so I would suggest using this same 'bean' pattern instead of the misnomer parse(), because it really isn't parsed till later. Following this naming convention, the existing setURL() which handles file names as well as URLs should probably be broken up into two methods, setFileName() and setURLString(), but it's very handy to have a single method that understands both for command line interpretation. Resist the temptation to overload it [as in setURL(URL url)], or you'll break a very useful bean pattern. I might suggest the current setURL() be renamed to setSource(). The parse(String) method you mention presumably takes HTML text and wraps it in a reader like HTMLParserTestCase.createParser() does. This should be called setHTML(). So we have: setSource("http://..." or "/usr/local") setURLString("http://...") setFileName("/usr/local/...") setHTML("<html><head>...") setFile(new File("/usr/local/..")) setInputStream(new BufferedInputStream()) setReader(new FileReader("/usr/local/..")) setURL(new URL("http://...")) setConnection(url.getConnection()) I would suggest that all these channel through a common initialization method to avoid repeating the same code over and over and to ensure correctly resetting all necessary things. For reuse, all of these methods would need to set field resourceLocn somehow so that a stale source is not used in warning messages so a setResourceLocation() is probably needed that just sets the field. And most would need to set the encoding in order to correctly convert raw bytes into characters. Since setEncoding() resets the current reader or connection to handle a charset directive in the HTML header, a setCharset() method that just sets the character encoding probably is needed (or vica-versa). That would mean the typical re-usage would then be: parser = new HTMLParser(); parser.registerScanners(); while (<more>) { parser.setResourceLocation("<where>"); parser.setCharset("<encoding>"); parser.setXXXX(<whatever>); enumeration = parser.getResultHTMLEnumeration(); } However, since the HTMLParser object is fairly light weight, it may be better to just create another one whenever it's needed and if you're really concerned about memory churn, just move the scanners into place: parser = new HTMLParser(); parser.registerScanners(); scanners = parser.getScanners() while (<more>) { parser = new HTMLParser(); parser.setScanners(scanners); ... |
From: Holger S. <Hol...@xt...> - 2003-01-07 13:12:53
|
Hi everybody! I am the new kid on the developer block because I joined the HTMLParser = just last week. And now, as my first deed I would like to propose some = changes to the API in the main HTMLParser class. Since these changes are = quite incisive in my opinion, I kindly ask you for some comments on = these propositions. First of all, the current status-quo of the HTMLParser is: As the first thing you have to create a new HTMLParser each time you = want to parse from some new HTML source be it a file, a url, etc.. Then = you register the scanners. And then you retrieve the HTMLNodes by = calling the elements() method. If you want to parse another document the = whole procedure starts from the beginning. According to my idea you would have do the following: First you create one HTMLParser object by calling the empty constructor: - HTMLParser() (This single HTMLParser object can be reused in consecutive parsing = actions.) Second, you register the scanners the same way as it is done now by = calling registerScanners(). Third, you can add one or more (instead of only one as right now) = feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback = htmlParserFeedback). Then you would use one of the following parse methods: - void parse(java.lang.String string) - void parse(java.io.File file) - void parse(java.io.InputStream inputStream) - void parse(java.io.Reader reader) - void parse(java.net.URL url) - void parse(java.net.URI uri) (but this would require JDK 1.4, so = better leave this out for now) (Remark: I know there already is a method parse(java.lang.String string) = in the HTMLParser class where the parameter is the name of a filter. = Question: Is this function used a lot or at all? Can it be renamed or = dropped and its functionality reimplemented in another way?) Finally you would get the results with: - java.util.List getResultList() that returns a List containing HTMLNode = objects Returning simply a List is good in my opinion since this integrates the = HTMLParser nicely into the standard Java collections framework. It also = makes it future save for the later applicability of Generics found in = Java 1.5. The solution for retrieving results with getResultXXX() methods would = also allow to simply add some more and different result retriever = methods, e.g. - org.htmlparser.util.HTMLEnumeration getResultHTMLEnumeration() or=20 - org.htmlparser.util.HTMLTree getResultHTMLTree() that would retrieve = an (to be programmed) HTMLTree (similar to a w3c Document)=20 etc. The implementation that would transform all of the above said in real = code can be done in two distinct, consecutive steps: - First step: Add the methods to the existing HTMLParser class and fit = them into the class by changing the rest of the class only minimally and = (most importantly) only internally. This could be done fairly quickly. - Second step: Refactor the HTMLParser, but keep the existing interfaces = to the outside world (e.g. the existing constructors) and deprecate = them. Bye and thanks in advance for your comments, Holger -------------------------------------------------------- Holger Stenzhorn Software Engineer XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbr=FCcken Phone: +49 (681) 302-5100 Fax: +49 (681) 302-5109 ho...@xt... www.xtramind.com -------------------------------------------------------- |
From: agente007 <e-a...@ex...> - 2003-01-05 22:13:56
|
Hello. I can not discharge the file htmlparser1_3_20021228.zip in https://sourceforge.net/project/showfiles.php?group_id=24399&release_id=129477 When making clic in the connection the address www.lop.com appears. What does it happen? A greeting. JuanJo _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |