htmlparser-developer Mailing List for HTML Parser (Page 16)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Derrick O. <Der...@ro...> - 2003-04-09 21:59:23
|
Ling, The StringExtractor gets every StringNode. If you don't want the comments (script) try this: #import org.htmlparser.beans.StringBean; public class TryBeans { public static void main (String[] args) { StringBean sb = new StringBean (); sb.setURL ("http://www.cnnfn.com/2001/11/29/companies/enron/"); System.out.println (sb.getStrings ()); } } See http://htmlparser.sourceforge.net/docs/index.php/JavaBeans for more details. Derrick Mr LING MA wrote: >When I try to use htmlparser stringextractor on page: > >http://www.cnnfn.com/2001/11/29/companies/enron/ > >the comment tags below is also outputted. Can this >be an error of style tag or comment tag? > >Thanks > >Ling Ma > >OUTPUT after extracted tag: ><!-- >adSetTarget('_top'); >htmlAdWH( (new ><snip> > > |
From: Derrick O. <Der...@ro...> - 2003-04-09 21:58:49
|
Ling, The StringExtractor gets every StringNode. If you don't want the comments (script) try this: #import org.htmlparser.beans.StringBean; public class TryBeans { public static void main (String[] args) { StringBean sb = new StringBean (); sb.setURL ("http://www.cnnfn.com/2001/11/29/companies/enron/"); System.out.println (sb.getStrings ()); } } See http://htmlparser.sourceforge.net/docs/index.php/JavaBeans for more details. Derrick Mr LING MA wrote: >When I try to use htmlparser stringextractor on page: > >http://www.cnnfn.com/2001/11/29/companies/enron/ > >the comment tags below is also outputted. Can this >be an error of style tag or comment tag? > >Thanks > >Ling Ma > >OUTPUT after extracted tag: ><!-- >adSetTarget('_top'); >htmlAdWH( (new >Array(93106768,93108498,93108099,93108099))[document.adoffset||0] >, 160, 600); >//--> >160AD end right column top popunder ad >generic/popunder_launch.720x300 ><!-- >if (document.adPopupFile) { > if (document.adPopupInterval == null) { > document.adPopupInterval = 0; > } > if (document.adPopunderInterval == null) { > document.adPopunderInterval = >document.adPopupInterval; > } > if (document.adPopupDomain != null) { > adSetPopDm(document.adPopupDomain); > } > adSetPopupWH(93165927, 720, 300, >document.adPopupFile, document.adPopunderInterval, 20, >50, -1); >} >// --> > CNNmoney contact us | magazine customer service >| <a href="/" class="footerlink">advertising</a> | >site map | CNN/Money glossary | press room OTHER >NEWS: CNN | SI | Fortune | Business 2.0 | Time © >2003 Cable News Network LP, LLLP. An AOL Time Warner >Company ALL RIGHTS RESERVED.Terms under which this >service is provided to you. privacy >policy Reprints of site stories are >available.endclickprintexclude ><!-- >var clickExpire = "-1"; >if(window.btnDone) btnDone(); >//--> > >__________________________________________________ >Do you Yahoo!? >Yahoo! Tax Center - File online, calculators, forms, and more >http://tax.yahoo.com > > >------------------------------------------------------- >This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger >for complex code. Debugging C/C++ programs can leave you feeling lost and >disoriented. TotalView can help you find your way. Available on major UNIX >and Linux platforms. Try it free. www.etnus.com >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |
From: Mr L. MA <law...@ya...> - 2003-04-09 20:35:07
|
When I try to use htmlparser stringextractor on page: http://www.cnnfn.com/2001/11/29/companies/enron/ the comment tags below is also outputted. Can this be an error of style tag or comment tag? Thanks Ling Ma OUTPUT after extracted tag: <!-- adSetTarget('_top'); htmlAdWH( (new Array(93106768,93108498,93108099,93108099))[document.adoffset||0] , 160, 600); //--> 160AD end right column top popunder ad generic/popunder_launch.720x300 <!-- if (document.adPopupFile) { if (document.adPopupInterval == null) { document.adPopupInterval = 0; } if (document.adPopunderInterval == null) { document.adPopunderInterval = document.adPopupInterval; } if (document.adPopupDomain != null) { adSetPopDm(document.adPopupDomain); } adSetPopupWH(93165927, 720, 300, document.adPopupFile, document.adPopunderInterval, 20, 50, -1); } // --> CNNmoney contact us | magazine customer service | <a href="/" class="footerlink">advertising</a> | site map | CNN/Money glossary | press room OTHER NEWS: CNN | SI | Fortune | Business 2.0 | Time © 2003 Cable News Network LP, LLLP. An AOL Time Warner Company ALL RIGHTS RESERVED.Terms under which this service is provided to you. privacy policy Reprints of site stories are available.endclickprintexclude <!-- var clickExpire = "-1"; if(window.btnDone) btnDone(); //--> __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: Somik R. <so...@ya...> - 2003-04-05 20:02:22
|
Hi Folks, This week's integration release is out. From the change log: Integration Build 1.3 - 20030405 -------------------------------- [1] Fixed bug 712888 (scanning nested custom tags) [2] Redesigned assertXmlEquals() [3] Fixed bug in Parser.removeScanner() [4] Fixed unnecessary addition of ACTION attribute in Form tag [5] Fixed Bullet scanner out of memory exception [6] Replaced scanner HashTable with Map Regards, Somik |
From: Joseph R. <jmr...@tg...> - 2003-04-04 19:10:20
|
Somik Raha wrote: > The difference is subtle. NodeIterator throws > exceptions. SimpleNodeIterator does not. This was bcos > a SimpleNodeIterator used inside a collection would > not need to throw parser exception, as it is not > parsing - just iterating. However, a NodeIterator > requires its implementations to throw exceptions > depending on the parse. > One solution to your problem could be - have > NodeList return a NodeIterator as well. What do you > think ? You mean have a second method which would return a NodeIterator instead of a simpleNodeIterator? That would work. Then I could just use that method instead of the existing one when I wanted to recurse, and all would be fine. _____________________________________________________________ Joe Robins Tel: 212-918-5057 Thaumaturgix, Inc. Fax: 212-918-5001 19 W. 44th St., 18th Floor Email: jmr...@tg... New York, NY 10036 http://www.tgix.com thau'ma-tur-gy, n. the working of miracles. |
From: Marc N. <ma...@ke...> - 2003-04-04 18:54:06
|
Thanks Somik! Now that I have CVS with SSL working from Idea, I'll try = to help out wherever possible. =20 Marc -----Original Message----- From: Somik Raha [mailto:so...@ya...] Sent: Thursday, April 03, 2003 10:00 PM To: HTMLParser Developer List Subject: [Htmlparser-developer] Please welcome Marc Novakowski Hi Folks, We've had a new developer join us this week - Marc Novakowski. Here's a brief bio: =20 Originally from Canada, Marc has been living in the San Francisco Bay = Area since 1999 working for various startup companies. He is currently = a senior engineer for a company called Kenamea, working on Java = development on projects involving secure, reliable, messaging systems. = He has a BSc in Computer Engineering from the University of Alberta, and = has been a professional programmer for over ten years. =20 On his experience with htmlparser, in his own words... =20 I'm using the parser as part of a framework that translates "custom" XML = tags embedded in HTML into special javascript code. The javascript is = essentially a UI construction and event model system. What this allows = us to do is use the XML "language" to build up a library of components = that can be instantiated in your application very easily using simple = XML. It also means that the app developer doesn't have to know = javascript, only the XML syntax. Actually, we're building an IDE as = well so they won't really have to know the XML syntax either -- they'll = be able to just drag and drop components into their application. The main reason I'm not using a standard XML parser such as xerces is = that it would die horribly on the HTML. Even if I were to manually = extract the "custom" tags and parse only that, it is possible that there = is some javascript or HTML as children of the custom tags which would = case xerces to die. When I saw how extensible the htmlparser library = was, it was a no-brainer. So far it's been very easy to add new "tags" = and "scanners" and define what sort of behavior the new tags have. The way I've been finding the latest set of bugs is by replacing the = "htmlparser.jar" I use in my component system with the latest jar and = running the tests for my own code. When those tests break, I try to = figure out what in htmlparser has changed and log a bug report. Now = that I have developer access, I can add new test cases that reveal these = problems (and maybe even fix them!). Marc -> Welcome to the dev team! =20 Cheers, Somik =20 |
From: Somik R. <so...@ya...> - 2003-04-04 06:02:54
|
Dhaval Udani wrote: The redundancy is acceptable. It would be useful for searches. My point is different. Input and Textarea are child nodes of Form and are stored as a NodeList. Similarly Option is a child of Select and should be stored as a NodeList instead of List as is at present. I see what you mean, and I agree with you. Pls feel free to make this = modification. (bytway, be sure to check out the latest version from cvs, = there've been some changes) Regards, Somik |
From: Somik R. <so...@ya...> - 2003-04-04 06:01:17
|
Dhaval Udani wrote: However I believe that teh JSP tag need tnot be parsed separately but if it is passed as a part of the <a> tag itself it would be alrite. What I am trying to say is that the toHtml() method called on this tag should be able to correctly output the text as described above. If that is being achieved I think we are ok. The two arguments against are: [1] Browsers dont render such tags=20 [2] Its horribly difficult to parse the attributes out Anyone wants to take a shot at modifying the AttributeParser.. (we = already have a failing test for this)? (Not a challenge, but a request) Cheers, Somik |
From: Somik R. <so...@ya...> - 2003-04-04 05:59:26
|
Hi Folks, We've had a new developer join us this week - Marc Novakowski. Here's a brief bio: Originally from Canada, Marc has been living in the San Francisco Bay = Area since 1999 working for various startup companies. He is currently = a senior engineer for a company called Kenamea, working on Java = development on projects involving secure, reliable, messaging systems. = He has a BSc in Computer Engineering from the University of Alberta, and = has been a professional programmer for over ten years. On his experience with htmlparser, in his own words... I'm using the parser as part of a framework that translates "custom" XML = tags embedded in HTML into special javascript code. The javascript is = essentially a UI construction and event model system. What this allows = us to do is use the XML "language" to build up a library of components = that can be instantiated in your application very easily using simple = XML. It also means that the app developer doesn't have to know = javascript, only the XML syntax. Actually, we're building an IDE as = well so they won't really have to know the XML syntax either -- they'll = be able to just drag and drop components into their application. The main reason I'm not using a standard XML parser such as xerces is = that it would die horribly on the HTML. Even if I were to manually = extract the "custom" tags and parse only that, it is possible that there = is some javascript or HTML as children of the custom tags which would = case xerces to die. When I saw how extensible the htmlparser library = was, it was a no-brainer. So far it's been very easy to add new "tags" = and "scanners" and define what sort of behavior the new tags have. The way I've been finding the latest set of bugs is by replacing the = "htmlparser.jar" I use in my component system with the latest jar and = running the tests for my own code. When those tests break, I try to = figure out what in htmlparser has changed and log a bug report. Now = that I have developer access, I can add new test cases that reveal these = problems (and maybe even fix them!). Marc -> Welcome to the dev team! Cheers, Somik |
From: Derrick O. <Der...@ro...> - 2003-04-03 12:28:18
|
I'm working on the bean test that's failing.... just very slowly ;) The threading test still fails. I'll look at that too... eventually. Derrick Somik Raha wrote: > Hi Derrick, > I was wondering if you've had the time to look into the failing > tests. Bytway, I have been meaning to ask you - did you try the > threading test again with the integration release ? Its not failed > once on my end- though I don't have a multiprocessor machine to test with. > > Regards, > Somik |
From: <dha...@or...> - 2003-04-03 07:48:30
|
Also no attributes not specified in the tag originally should be displayed as a result of the toHtml() call. For example, the following is happening: <FORM></FORM> is reproduced as=20 <FORM ACTION=3D""></FORM> It should be correctly reproduced as <FORM></FORM> =20 If this is so, it is definitely wrong. Can you write a testcase for this ? I am a bit surprised, bcos FormTag does not even have a toHtml() - it uses CompositeTag's toHtml(). =20 I have logged it already with another bug at #713907. Its not really a test case since I am trying to depict another abnormal behaviour as well. Do check it out. =A0 Regards, Dhaval |
From: <dha...@or...> - 2003-04-03 07:36:09
|
I think we need to support all types of tags specially JSP tags embedded within HTML. =A0 However I believe that teh JSP tag need tnot be parsed separately but if it is passed as a part of the <a> tag itself it would be alrite. What I am trying to say is that the toHtml() method called on this tag should be able to correctly output the text as described above. If that is being achieved I think we are ok. =A0 Regards,=20 Dhaval Udani=20 Senior Analyst=20 M-Line, QPEG=20 OrbiTech Solutions Ltd.=20 +91-22-28290019 Extn. 1457=20 =A0 -----Original Message----- From: somik [mailto:so...@ya...] Sent: Thursday, April 03, 2003 12:22 PM To: htmlparser-developer Cc: somik Subject: [Htmlparser-developer] Jsp within attributes - do we need to support? =20 =20 =20 Hi Folks, =A0=A0=A0 It would be good to have some feedback on the latest failing test- testJspWithinAttributes(). Tags of the form: <a href=3D"<%=3DApplication("sURL")%>/literature/index.htm"> Should we support these tags ? Regards, Somik =A0 =20 |
From: <dha...@or...> - 2003-04-03 07:36:07
|
Hi Somik, The SelectTag class has a List of OptionTags underneath it. However the FormTag has a NodeList of InputTags and TextArea tags. I think these 2 should be synchronized for consistency. Can you explain more ? FormTag does have the standard children's list, in addition to which, it has the input tags and textareas. There is redundancy, but its there to provide helpful searches. [Udani, Dhaval H.]=A0 The redundancy is acceptable. It would be useful for searches. My point is different.=A0Input and Textarea are=A0child nodes of Form and are stor= ed as=A0a NodeList. Similarly Option is a child of Select and should be stored as a NodeList instead of List as is at present. =A0 For example, the following is happening: <FORM></FORM> is reproduced as=20 <FORM ACTION=3D""></FORM> It should be correctly reproduced as <FORM></FORM> If this is so, it is definitely wrong. Can you write a testcase for this ? I am a bit surprised, bcos FormTag does not even have a toHtml() - it uses CompositeTag's toHtml(). [Udani, Dhaval H.]=A0 Will do so immediately.=A0 =A0 Regards, Dhaval |
From: Somik R. <so...@ya...> - 2003-04-03 06:50:10
|
Hi Folks, It would be good to have some feedback on the latest failing test- = testJspWithinAttributes(). Tags of the form: <a href=3D"<%=3DApplication("sURL")%>/literature/index.htm"> Should we support these tags ? Regards, Somik |
From: Somik R. <so...@ya...> - 2003-04-03 06:48:26
|
Hi Derrick, I was wondering if you've had the time to look into the failing = tests. Bytway, I have been meaning to ask you - did you try the = threading test again with the integration release ? Its not failed once = on my end- though I don't have a multiprocessor machine to test with. =20 Regards, Somik |
From: Somik R. <so...@ya...> - 2003-04-03 06:46:20
|
Hi Dhaval, I was checking out the code of form scanner and I saw that it contained a list of all the INPUT tags and all the TEXTAREA tags. In addition we need to add the list of SELECT tags also out here.. Thanks for catching this. Done. The SelectTag class has a List of OptionTags underneath it. However the FormTag has a NodeList of InputTags and TextArea tags. I think these 2 should be synchronized for consistency. Can you explain more ? FormTag does have the standard children's list, = in addition to which, it has the input tags and textareas. There is = redundancy, but its there to provide helpful searches. Also no attributes not specified in the tag originally should be displayed as a result of the toHtml() call. For example, the following is happening: <FORM></FORM> is reproduced as=20 <FORM ACTION=3D""></FORM> It should be correctly reproduced as <FORM></FORM> If this is so, it is definitely wrong. Can you write a testcase for this = ? I am a bit surprised, bcos FormTag does not even have a toHtml() - it = uses CompositeTag's toHtml(). Also I was wondering if it would be possible to store attributes in Hastable in the case in which they are present on the page and hence reproduce them similarly. This will minimise the difference between an input HTML and a parsed output HTML. Only during comparisons or get operations we can synchronize the keys to either upper/lower case for comparison. Hmm... that is not a bad idea. Though, we have to ask Kaarle -- he = handles the AttributeParser. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-04-03 04:08:50
|
Hi Joseph, The difference is subtle. NodeIterator throws exceptions. SimpleNodeIterator does not. This was bcos a SimpleNodeIterator used inside a collection would not need to throw parser exception, as it is not parsing - just iterating. However, a NodeIterator requires its implementations to throw exceptions depending on the parse. One solution to your problem could be - have NodeList return a NodeIterator as well. What do you think ? Regards, Somik --- Joseph Robins <jmr...@tg...> wrote: > After finding myself poking through more and more of > the code for the > HTML parser, and needing to fix bugs, I've decided > to join this list. > And I figure that there's no better way to join the > list than with a > question. :-) (I've looked through the archives, > and I don't see the > answer in there. Sorry if this was discussed before > and I missed it.) > > Is there a reason that the iterators that > Parser.elements() and > CompositeTag.children() are different classes, and > are incompatible? I > wanted to write some code along the lines of: > > -------------------------------------------------------------- > > Parser parser = new Parser(url); > NodeIterator iter = parser.elements(); > doParse(iter); > > ... > > private void doParse(NodeIterator iter) { > while(iter.hasMoreNodes()) { > Node node = iter.nextNode(); > doStuff(); > if(node instanceof CompositeTag) { > doParse(((CompositeTag)node).children()); > } > } > } > > -------------------------------------------------------------- > > Unfortunately, because the iterators are different > (and don't even share > a superclass), I can't do this, and have to > duplicate my doParse method > with two different signatures. > > This seems like a natural thing to want to do. For > example, when > parsing a page, a form tag might contain a lot of > other elements (text, > links, etc.) in it that we want to get, and the only > way to do that is > to iterate inside. > > _____________________________________________________________ > Joe Robins Tel: 212-918-5057 > Thaumaturgix, Inc. Fax: 212-918-5001 > 19 W. 44th St., 18th Floor Email: jmr...@tg... > New York, NY 10036 http://www.tgix.com > > thau'ma-tur-gy, n. the working of miracles. > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ValueWeb: > Dedicated Hosting for just $79/mo with 500 GB of > bandwidth! > No other company gives more support or power for > your dedicated server > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: Joseph R. <jmr...@tg...> - 2003-04-02 19:37:38
|
After finding myself poking through more and more of the code for the HTML parser, and needing to fix bugs, I've decided to join this list. And I figure that there's no better way to join the list than with a question. :-) (I've looked through the archives, and I don't see the answer in there. Sorry if this was discussed before and I missed it.) Is there a reason that the iterators that Parser.elements() and CompositeTag.children() are different classes, and are incompatible? I wanted to write some code along the lines of: -------------------------------------------------------------- Parser parser = new Parser(url); NodeIterator iter = parser.elements(); doParse(iter); ... private void doParse(NodeIterator iter) { while(iter.hasMoreNodes()) { Node node = iter.nextNode(); doStuff(); if(node instanceof CompositeTag) { doParse(((CompositeTag)node).children()); } } } -------------------------------------------------------------- Unfortunately, because the iterators are different (and don't even share a superclass), I can't do this, and have to duplicate my doParse method with two different signatures. This seems like a natural thing to want to do. For example, when parsing a page, a form tag might contain a lot of other elements (text, links, etc.) in it that we want to get, and the only way to do that is to iterate inside. _____________________________________________________________ Joe Robins Tel: 212-918-5057 Thaumaturgix, Inc. Fax: 212-918-5001 19 W. 44th St., 18th Floor Email: jmr...@tg... New York, NY 10036 http://www.tgix.com thau'ma-tur-gy, n. the working of miracles. |
From: <dha...@or...> - 2003-04-02 13:51:40
|
Hi, =A0 I was checking out the code of form scanner and I saw that it contained a list of all the INPUT tags and all the TEXTAREA tags. In addition we need to add the list of SELECT tags also out here.. =A0 The SelectTag class has a List of OptionTags underneath it. However the FormTag has a NodeList of InputTags and TextArea tags. I think these 2 should be synchronized for consistency. =A0 Also no attributes not specified in the tag originally should be displayed as a result of the toHtml() call. =A0 For example, the following is happening: <FORM></FORM> =A0 is reproduced as=20 =A0 <FORM ACTION=3D""></FORM> =A0 It should be correctly reproduced as <FORM></FORM> =A0 Also I was wondering if it would be possible to store attributes in Hastable in the case in which they are present on the page and hence reproduce them similarly. This will minimise the difference between an input HTML and a parsed output HTML. Only during comparisons or get operations we can=A0synchronize the keys to either upper/lower case for comparison. =A0 I would be happy to take up any activity once we decide on its feasibility. =A0 Regards, Dhaval =A0 |
From: Derrick O. <Der...@ro...> - 2003-03-31 08:19:26
|
Somik, Sorry, same error as before. Now testJspWithinAttributes is also complaining (in two places). Derrick Somik Raha wrote: >Hi Derrick, > Thanks a lot.. I have removed the boolean, and made it totally >thread-safe now (I think). > Could you try again? > >Regards, >Somik > > |
From: Somik R. <so...@ya...> - 2003-03-31 04:43:54
|
Hi Folks, This week's integration release is packed with goodies! From the change log: Integration Build 1.3 - 20030330 -------------------------------- [1] fixed bug (an enhancement really) 694477 quotes in content-type header [2] fix bug #699886 and #707447 by using a buffered stream reader with infinite mark [3] fixed bug in CompositeTagScanner, filter not being set correctly [4] fixed thread safety issue in TagParser (bug 711073) [5] fixed out of memory error when parsing custom composite tags (bug 709152) [6] fixed bug 701159, 696455 - redesigned script scanner. Javascript parsing is now much more robust. As you can see, a lot of bug fixes have gone in. There are three major fixes - one by Derrick Oswald (#2) addresses the charset issue. The parser should now be able to handle different charsets dynamically. We hope you can test this and give us feedback. The second big change is a redesign of the way Javascript is handled by the parser. It had been riddled with problems for some time, so we've changed its internals. The new implementation is much more robust, and hopefully we can get some feedback on that too. There were some thread safety issues (thanks to Joe Robbins for reporting this). These have been addressed in this release, and the parser should be totally thread-safe now. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-31 04:21:08
|
Seems like this mail was more for thinking aloud. After sending this, I = figured that I could still use the scanner, and forget the automata part = but do the scanning by hand and not rely on NodeReader. That did the = trick, the solution was simple. And thanks to the existing testcases = (they're a lifesaver), the design change is complete. We now support = javascript like never before! Cheers, Somik ----- Original Message -----=20 From: Somik Raha=20 To: HTMLParser Developer List=20 Sent: Sunday, March 30, 2003 4:59 PM Subject: [Htmlparser-developer] Redesigning Javascript support Hi Folks, The Achilles heel of the parser has been the ScriptScanner, and = never more evident than the last two weeks. The system has been = screaming for redesign - if only I had the ears :). I am planning to = take out the ScriptScanner - instead, a finite automata that triggers on = <script and finishes on </script> is in order. The scanner approach = breaks down for javascript code, especially when the script tries to = render html. Just wanted to let you know that this change is coming up.. May = not be with this integration release, but probably in the next one. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-31 04:20:30
|
Hi Derrick, Thanks a lot.. I have removed the boolean, and made it totally thread-safe now (I think). Could you try again? Regards, Somik ----- Original Message ----- From: "Derrick Oswald" <Der...@ro...> To: <htm...@li...> Sent: Sunday, March 30, 2003 7:31 PM Subject: [Htmlparser-developer] testThreadSafety error > Somik, > > Running testThreadSafety() can give different errors or no errors at all > depending on the run (typical of race conditions). > The errors might only crop up on SMP systems (I've got a dual CPU). > > When it fails, the actual link appears empty even though the link text > is not (see the debug outputs below): > id = 5 > link = "" > linkText = " Server: sf-web2 \t SourceForge.net: Modify: 711073 - > HTMLTagParser not threadsafe as a static variable in > HTMLTag\t\t\tfunction help_window(helpurl) {\t\tHelpWin = window.open( > \'http://sourceforge.net\' + > helpurl,\'HelpWindow\',\'scrollbars=yes,resizable=yes,toolbar=no,height=400, width=400\');\t}\t// > \t\t\t This is temp javascript for the jump button. If we could actually > have a jump script on the server side that would be ideal \tfunction > jump(targ,selObj,restore){ //v3.0\tif > (selObj.options[selObj.selectedIndex].value) > \t\teval(targ+\".location=\'\"+selObj.options[selObj.selectedIndex].value+\" \'\");\tif > (restore) selObj.selectedIndex=0;\t}\t//" > result = false > > Here's some example JUnit dumps: > **************************************************************************** **************************** > > junit.framework.AssertionFailedError: Thread 55, link 1: > > EXPECTED result has 35 extra characters at the end. They are : > Position : 0 , Code = 104 > Position : 1 , Code = 116 > Position : 2 , Code = 116 > Position : 3 , Code = 112 > <snip> > Position : 33 , Code = 109 > Position : 34 , Code = 108 > Mismatch of strings at char posn 0 > > String Expected upto mismatch = > > String Actual upto mismatch = > > String Expected MISMATCH CHARACTER = h, code = 104 > > **** COMPLETE STRING EXPECTED **** > http://normallink.com/sometext.html > > **** COMPLETE STRING ACTUAL*** > > at > org.htmlparser.tests.ParserTestCase.assertStringEquals(ParserTestCase.java:1 25) > at > org.htmlparser.tests.parserHelperTests.TagParserTest.testThreadSafety(TagPar serTest.java:186) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) > > **************************************************************************** **************************** > > junit.framework.AssertionFailedError: Thread 26, link 2: > > EXPECTED result has 116 extra characters at the end. They are : > Position : 0 , Code = 47 > Position : 1 , Code = 99 > Position : 2 , Code = 103 > Position : 3 , Code = 105 > <snip> > Position : 114 , Code = 109 > Position : 115 , Code = 108 > Mismatch of strings at char posn 0 > > String Expected upto mismatch = > > String Actual upto mismatch = > > String Expected MISMATCH CHARACTER = /, code = 47 > > **** COMPLETE STRING EXPECTED **** > > /cgi-bin/view_search?query_text=postdate>20020701&txt_clr=White&bg_clr=Red&u rl=http://localhost/Testing/Report1.html > > > **** COMPLETE STRING ACTUAL*** > > at > org.htmlparser.tests.ParserTestCase.assertStringEquals(ParserTestCase.java:1 25) > at > org.htmlparser.tests.parserHelperTests.TagParserTest.testThreadSafety(TagPar serTest.java:191) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) > > Hope this helps, > > Derrick > > Somik Raha wrote: > > > > testThreadSafety > > > > Thanks for reporting this - on my end this one's passing. I had left > > one last variable in TagParser- and I thought it would affect Thread > > safety. So I rigged up that test, but surprisingly it passed every > > time on my end. Can you send me the failure message ? I might need to > > rework TagParser again. > > > > > > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: > The Definitive IT and Networking Event. Be There! > NetWorld+Interop Las Vegas 2003 -- Register today! > http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-03-31 03:24:06
|
Somik, Running testThreadSafety() can give different errors or no errors at all depending on the run (typical of race conditions). The errors might only crop up on SMP systems (I've got a dual CPU). When it fails, the actual link appears empty even though the link text is not (see the debug outputs below): id = 5 link = "" linkText = " Server: sf-web2 \t SourceForge.net: Modify: 711073 - HTMLTagParser not threadsafe as a static variable in HTMLTag\t\t\tfunction help_window(helpurl) {\t\tHelpWin = window.open( \'http://sourceforge.net\' + helpurl,\'HelpWindow\',\'scrollbars=yes,resizable=yes,toolbar=no,height=400,width=400\');\t}\t// \t\t\t This is temp javascript for the jump button. If we could actually have a jump script on the server side that would be ideal \tfunction jump(targ,selObj,restore){ //v3.0\tif (selObj.options[selObj.selectedIndex].value) \t\teval(targ+\".location=\'\"+selObj.options[selObj.selectedIndex].value+\"\'\");\tif (restore) selObj.selectedIndex=0;\t}\t//" result = false Here's some example JUnit dumps: ******************************************************************************************************** junit.framework.AssertionFailedError: Thread 55, link 1: EXPECTED result has 35 extra characters at the end. They are : Position : 0 , Code = 104 Position : 1 , Code = 116 Position : 2 , Code = 116 Position : 3 , Code = 112 <snip> Position : 33 , Code = 109 Position : 34 , Code = 108 Mismatch of strings at char posn 0 String Expected upto mismatch = String Actual upto mismatch = String Expected MISMATCH CHARACTER = h, code = 104 **** COMPLETE STRING EXPECTED **** http://normallink.com/sometext.html **** COMPLETE STRING ACTUAL*** at org.htmlparser.tests.ParserTestCase.assertStringEquals(ParserTestCase.java:125) at org.htmlparser.tests.parserHelperTests.TagParserTest.testThreadSafety(TagParserTest.java:186) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) ******************************************************************************************************** junit.framework.AssertionFailedError: Thread 26, link 2: EXPECTED result has 116 extra characters at the end. They are : Position : 0 , Code = 47 Position : 1 , Code = 99 Position : 2 , Code = 103 Position : 3 , Code = 105 <snip> Position : 114 , Code = 109 Position : 115 , Code = 108 Mismatch of strings at char posn 0 String Expected upto mismatch = String Actual upto mismatch = String Expected MISMATCH CHARACTER = /, code = 47 **** COMPLETE STRING EXPECTED **** /cgi-bin/view_search?query_text=postdate>20020701&txt_clr=White&bg_clr=Red&url=http://localhost/Testing/Report1.html **** COMPLETE STRING ACTUAL*** at org.htmlparser.tests.ParserTestCase.assertStringEquals(ParserTestCase.java:125) at org.htmlparser.tests.parserHelperTests.TagParserTest.testThreadSafety(TagParserTest.java:191) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) Hope this helps, Derrick Somik Raha wrote: > > testThreadSafety > > Thanks for reporting this - on my end this one's passing. I had left > one last variable in TagParser- and I thought it would affect Thread > safety. So I rigged up that test, but surprisingly it passed every > time on my end. Can you send me the failure message ? I might need to > rework TagParser again. > > |
From: Somik R. <so...@ya...> - 2003-03-31 00:57:32
|
Hi Folks, The Achilles heel of the parser has been the ScriptScanner, and = never more evident than the last two weeks. The system has been = screaming for redesign - if only I had the ears :). I am planning to = take out the ScriptScanner - instead, a finite automata that triggers on = <script and finishes on </script> is in order. The scanner approach = breaks down for javascript code, especially when the script tries to = render html. Just wanted to let you know that this change is coming up.. May not = be with this integration release, but probably in the next one. Regards, Somik |