htmlparser-developer Mailing List for HTML Parser (Page 13)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: <dha...@or...> - 2003-05-09 04:18:12
|
The assertStringValueMatches() method in ParserTestCase contains reference to String.replaceAll() method which I believe is introduced in JDK 1.4. Can you please remove the same? Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 -----Original Message----- From: DerrickOswald [mailto:Der...@ro...] Sent: Friday, May 09, 2003 9:49 AM To: htmlparser-developer Cc: DerrickOswald Subject: [Htmlparser-developer] changes.txt OK, I'm going to try using the cvs2cl script to automatically create the change log for a release. That means you don't have to update changes.txt when you drop code. yeaahhh! But, you do have to make the commit messages as good or better than the changes.txt message was. boooo! For guidelines on what to put in commit messages see: http://www.red-bean.com/cvs2cl/changelogs.html Remember, the commit messages will now be visible to end users, so try to use whole sentences and valid grammar. Also, the script uses a time window and identical message text to unify separate file log messages into a chronological sequence of activity. So, the rule is, drop everything as close together in time as you can (one drop is best of course, but sometimes it doesn't work that way), and use the same log message for all files related to a particular change. Derrick ------------------------------------------------------- Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara The only event dedicated to issues related to Linux enterprise solutions www.enterpriselinuxforum.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-09 04:03:57
|
OK, I'm going to try using the cvs2cl script to automatically create the change log for a release. That means you don't have to update changes.txt when you drop code. yeaahhh! But, you do have to make the commit messages as good or better than the changes.txt message was. boooo! For guidelines on what to put in commit messages see: http://www.red-bean.com/cvs2cl/changelogs.html Remember, the commit messages will now be visible to end users, so try to use whole sentences and valid grammar. Also, the script uses a time window and identical message text to unify separate file log messages into a chronological sequence of activity. So, the rule is, drop everything as close together in time as you can (one drop is best of course, but sometimes it doesn't work that way), and use the same log message for all files related to a particular change. Derrick |
From: Derrick O. <Der...@ro...> - 2003-05-09 03:47:45
|
I have set up the syncmail script to automatically send email about CVS commit operations to a newly created htmlparser-cvs mailing list. Thus, you can monitor the code repository for bug fixes and enhancements. A new list was chosen over sending to the htmlparser-developer list directly to provide an opt-out mechanism and to keep the traffic light on the htmlparser-developer list. If you want to subscribe to the CVS notification list, go to the lists area: http://sourceforge.net/mail/?group_id=24399 and choose Subscribe/Unsubscribe/Preferences for the htmlparser-cvs list. Derrick |
From: <dha...@or...> - 2003-05-09 03:37:51
|
Derrick, the same problem then exists with the FormTag. Even if Input tag is registered explicitly (as well as inside the FormScanner), tags of type InputTag are not retrieved. I'll file a bug report for the same. Related to the string filters I also have one issue. Ideally there should be a contract for all the constructors of children of Tag class. Currently we see different types of constructors in the different tags. When I started out using the HTMLParser, this string filter totally confused me. I was gicing some random input, I did'nt know y, did'nt know where it was used and did'nt really see any benefit. Hence I bel;ieve the no-arg constructor is a must for easy usage in all Tags. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 -----Original Message----- From: DerrickOswald [mailto:Der...@ro...] Sent: Thursday, May 08, 2003 5:22 PM To: htmlparser-developer Cc: DerrickOswald Subject: Re: [Htmlparser-developer] Registering of scanners You should get back explicit tags (not generic org.htmlparser.tags.Tag objects) for each scanner you register. If not, it's a bug. On another topic, I have noticed a problem with string filters. Scanners register with a filter string like: parser.addScanner(new InputTagScanner("-i")); or addScanner(linkScanner.createImageScanner(ImageTag.IMAGE_TAG_FILTER)); By the way, both of these are "-i". Since the filters are compared with == rather than a string comparison, you would have to get the filter from the scanner itself for collectInto to work: x = new InputTagScanner("-i"); parser.addScanner(x); ... div.collectInto(nodeList,x.getFilter()); It looks like collectInto(NodeList, String) needs fixing and collectInto(NodeList, Class) is the way to go. Maybe we should deprecate the former. Derrick dha...@or... wrote: >INPUT tag is found as a tag. No problem with that. However if the >InputTagScanner is also registered then a class of type >org.htmlparser.tags.InputTag is not found. > >I am not too sure about the expected behaviour and hence do not know >whether its a bug or not. Do let me know what to do. > >dhaval > > >-----Original Message----- >From: DerrickOswald [mailto:Der...@ro...] >Sent: Thursday, May 08, 2003 7:03 AM >To: htmlparser-developer >Cc: DerrickOswald >Subject: Re: [Htmlparser-developer] Registering of scanners > > >Dahval, > >DIV is a CompositeTag. >Are you saying the INPUT tag isn't one of the children of the DIV tag. >If that's the case file a bug report. > >Derrick > >dha...@or... wrote: > > > >>In the HTMLParser version of 27th April I believe registering of Div >>Scanner and Table scanner was added to the automatic list i.e via >>Parser.registerScanners(). >> >>Due to this I am unable to recognize any of the tags underneath the DIV >>tag. >> >>For example I had INPUT tag underneath the DIV tag. I used >>registerScanners and I also registered InputTagScanner. However I could >>detect DIV tags but not INPUT tags. >> >>Regards, >> >>Dhaval Udani >>Senior Analyst >>M-Line, QPEG >>OrbiTech Solutions Ltd. >>+91-22-28290019 Extn. 1457 >> >> >> >> >> >> ------------------------------------------------------- Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara The only event dedicated to issues related to Linux enterprise solutions www.enterpriselinuxforum.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-08 11:37:22
|
You should get back explicit tags (not generic org.htmlparser.tags.Tag objects) for each scanner you register. If not, it's a bug. On another topic, I have noticed a problem with string filters. Scanners register with a filter string like: parser.addScanner(new InputTagScanner("-i")); or addScanner(linkScanner.createImageScanner(ImageTag.IMAGE_TAG_FILTER)); By the way, both of these are "-i". Since the filters are compared with == rather than a string comparison, you would have to get the filter from the scanner itself for collectInto to work: x = new InputTagScanner("-i"); parser.addScanner(x); ... div.collectInto(nodeList,x.getFilter()); It looks like collectInto(NodeList, String) needs fixing and collectInto(NodeList, Class) is the way to go. Maybe we should deprecate the former. Derrick dha...@or... wrote: >INPUT tag is found as a tag. No problem with that. However if the >InputTagScanner is also registered then a class of type >org.htmlparser.tags.InputTag is not found. > >I am not too sure about the expected behaviour and hence do not know >whether its a bug or not. Do let me know what to do. > >dhaval > > >-----Original Message----- >From: DerrickOswald [mailto:Der...@ro...] >Sent: Thursday, May 08, 2003 7:03 AM >To: htmlparser-developer >Cc: DerrickOswald >Subject: Re: [Htmlparser-developer] Registering of scanners > > >Dahval, > >DIV is a CompositeTag. >Are you saying the INPUT tag isn't one of the children of the DIV tag. >If that's the case file a bug report. > >Derrick > >dha...@or... wrote: > > > >>In the HTMLParser version of 27th April I believe registering of Div >>Scanner and Table scanner was added to the automatic list i.e via >>Parser.registerScanners(). >> >>Due to this I am unable to recognize any of the tags underneath the DIV >>tag. >> >>For example I had INPUT tag underneath the DIV tag. I used >>registerScanners and I also registered InputTagScanner. However I could >>detect DIV tags but not INPUT tags. >> >>Regards, >> >>Dhaval Udani >>Senior Analyst >>M-Line, QPEG >>OrbiTech Solutions Ltd. >>+91-22-28290019 Extn. 1457 >> >> >> >> >> >> |
From: <dha...@or...> - 2003-05-08 04:44:51
|
Hi, OPTION tag currently extends Tag class. Hence the following input: <OPTION><LABEL>Hello World</LABEL></OPTION> gets translated into (after a toHtml() call) <OPTION></OPTION> <LABEL> Hello World </LABEL> </OPTION> As can be seen the OPTION tag has got closed explicitly even though there was an ending tag. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 |
From: <dha...@or...> - 2003-05-08 03:55:57
|
INPUT tag is found as a tag. No problem with that. However if the InputTagScanner is also registered then a class of type org.htmlparser.tags.InputTag is not found. I am not too sure about the expected behaviour and hence do not know whether its a bug or not. Do let me know what to do. dhaval -----Original Message----- From: DerrickOswald [mailto:Der...@ro...] Sent: Thursday, May 08, 2003 7:03 AM To: htmlparser-developer Cc: DerrickOswald Subject: Re: [Htmlparser-developer] Registering of scanners Dahval, DIV is a CompositeTag. Are you saying the INPUT tag isn't one of the children of the DIV tag. If that's the case file a bug report. Derrick dha...@or... wrote: >In the HTMLParser version of 27th April I believe registering of Div >Scanner and Table scanner was added to the automatic list i.e via >Parser.registerScanners(). > >Due to this I am unable to recognize any of the tags underneath the DIV >tag. > >For example I had INPUT tag underneath the DIV tag. I used >registerScanners and I also registered InputTagScanner. However I could >detect DIV tags but not INPUT tags. > >Regards, > >Dhaval Udani >Senior Analyst >M-Line, QPEG >OrbiTech Solutions Ltd. >+91-22-28290019 Extn. 1457 > > > > ------------------------------------------------------- Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara The only event dedicated to issues related to Linux enterprise solutions www.enterpriselinuxforum.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Marc N. <ma...@ke...> - 2003-05-08 01:20:36
|
Works for me. I usually just copy my commit message into the text file = anyways! :) -----Original Message----- From: Derrick Oswald [mailto:Der...@ro...] Sent: Wednesday, May 07, 2003 6:28 PM To: htm...@li... Subject: Re: [Htmlparser-developer] changes.txt and changes.html The out of sync part is my fault. I didn't realize it was a manual=20 process to update changes.html too. I agree, it should be dropped. Maybe we shouldn't be manually updating changes.txt either. From=20 "Introduction to SourceForge.net Project CVS Services for Developers"=20 (https://sourceforge.net/docman/display_doc.php?docid=3D768&group_id=3D1)= Commit Messages and Automatic ChangeLog Generation If your project team has been careful to provide meaningful commit messages when they have committed changes to your repository, you will have the option of using an automated tool to generate ChangeLog files for inclusion on your web site or in your file releases. The = <http://cvsbook.red-bean.com/cvsbook.html#cvs2cl_--_Generate_GNU-Style_Ch= angeLogs>cvs2cl = (http://cvsbook.red-bean.com/cvsbook.html#cvs2cl_--_Generate_GNU-Style_Ch= angeLogs) script may be used from your workstation in order to generate a ChangeLog file based on the commit messages for your repository. I've tried this for the htmlparser project, getting the changes commited = this week so far (./cvs2cl.pl -l "-d'>May 4, 2003'") and it works pretty well: 2003-05-07 18:04 derrickoswald =20 * docs/: changes.txt, changes.html: update changelog =20 2003-05-07 18:00 derrickoswald =20 * src/org/htmlparser/: tests/utilTests/NodeListTest.java, util/NodeList.java: added removeAll() to NodeList - Dhaval =20 2003-05-07 16:16 polarys =20 * docs/changes.html, docs/changes.txt, src/org/htmlparser/scanners/ScriptScanner.java, src/org/htmlparser/tests/scannersTests/ScriptScannerTest.java: Fixed NPE in ScriptScanner when a script tag was not ended = before the end of document =20 2003-05-06 07:44 derrickoswald =20 * docs/changes.txt: update changelog =20 2003-05-06 07:37 derrickoswald =20 * src/org/htmlparser/parserHelper/ParserHelper.java: Fix #732517 Paser(String) c'tor not handling relative path local file =20 2003-05-05 20:45 derrickoswald =20 * docs/changes.txt: update changelog =20 2003-05-05 20:43 derrickoswald =20 * src/org/htmlparser/: scanners/LabelScanner.java, scanners/SelectTagScanner.java, tags/SelectTag.java: nodelist modifications from Dhaval =20 2003-05-04 23:12 derrickoswald =20 * src/org/htmlparser/: Node.java, NodeReader.java, Parser.java, RemarkNode.java, RemarkNodeParser.java, StringNode.java, parserHelper/AttributeParser.java, parserHelper/CompositeTagScannerHelper.java, parserHelper/StringParser.java, parserHelper/TagParser.java, parserapplications/LinkExtractor.java, <snip - it's every file in the source tree> visitors/StringFindingVisitor.java, visitors/TagFindingVisitor.java, visitors/TextExtractingVisitor.java, visitors/UrlModifyingVisitor.java: update version headers to 1.3-20030504 If this kind of thing is acceptable, then all we need to do is convince=20 all developers that the commit messages are really important and need to = be carefully constructed=20 (http://www.red-bean.com/cvs2cl/changelogs.html), since they will then=20 be automatically added to the changelog when a release happens. I'm=20 willing to add that task to the list of things done at release time. The only other thing the changes.txt is probably used for is=20 notification of commits, and that should be handled by the new=20 htmlparser-cvs mailing list. Generating the changelog from commit=20 messages would get rid of all the 'update changelog' emails too. There is also a cvs2html script (http://cvs.sslug.dk/cvs2html/) that=20 provides not only the changelog entries but also provides a html diff = view. Derrick Marc Novakowski wrote: >I just checked in a fix for a NPE I was getting in ScriptScanner when I = parsed a document that had a <script> tag that wasn't closed before the = end of document. As usual, I also updated the changes.txt and = changes.html files. I noticed these two files are a little out of sync = with each other, which begs the question of why do we have two files = that have (or are supposed to have) the exact same content, only with = different file extensions? > >It seems silly that commiters have to edit both of these files. Maybe = we should get rid of "changes.html" since it's just a text file anyways = (no markup). > >On another note, there are currently 6 tests failing (5 of which = weren't failing as of a few weeks ago). Anyone who's checked in changes = recently might want to take a look at the failures. > >Marc > > > =20 > ------------------------------------------------------- Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara The only event dedicated to issues related to Linux enterprise solutions www.enterpriselinuxforum.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-08 01:18:29
|
Dahval, DIV is a CompositeTag. Are you saying the INPUT tag isn't one of the children of the DIV tag. If that's the case file a bug report. Derrick dha...@or... wrote: >In the HTMLParser version of 27th April I believe registering of Div >Scanner and Table scanner was added to the automatic list i.e via >Parser.registerScanners(). > >Due to this I am unable to recognize any of the tags underneath the DIV >tag. > >For example I had INPUT tag underneath the DIV tag. I used >registerScanners and I also registered InputTagScanner. However I could >detect DIV tags but not INPUT tags. > >Regards, > >Dhaval Udani >Senior Analyst >M-Line, QPEG >OrbiTech Solutions Ltd. >+91-22-28290019 Extn. 1457 > > > > |
From: Derrick O. <Der...@ro...> - 2003-05-08 01:13:55
|
The out of sync part is my fault. I didn't realize it was a manual process to update changes.html too. I agree, it should be dropped. Maybe we shouldn't be manually updating changes.txt either. From "Introduction to SourceForge.net Project CVS Services for Developers" (https://sourceforge.net/docman/display_doc.php?docid=768&group_id=1) Commit Messages and Automatic ChangeLog Generation If your project team has been careful to provide meaningful commit messages when they have committed changes to your repository, you will have the option of using an automated tool to generate ChangeLog files for inclusion on your web site or in your file releases. The <http://cvsbook.red-bean.com/cvsbook.html#cvs2cl_--_Generate_GNU-Style_ChangeLogs>cvs2cl (http://cvsbook.red-bean.com/cvsbook.html#cvs2cl_--_Generate_GNU-Style_ChangeLogs) script may be used from your workstation in order to generate a ChangeLog file based on the commit messages for your repository. I've tried this for the htmlparser project, getting the changes commited this week so far (./cvs2cl.pl -l "-d'>May 4, 2003'") and it works pretty well: 2003-05-07 18:04 derrickoswald * docs/: changes.txt, changes.html: update changelog 2003-05-07 18:00 derrickoswald * src/org/htmlparser/: tests/utilTests/NodeListTest.java, util/NodeList.java: added removeAll() to NodeList - Dhaval 2003-05-07 16:16 polarys * docs/changes.html, docs/changes.txt, src/org/htmlparser/scanners/ScriptScanner.java, src/org/htmlparser/tests/scannersTests/ScriptScannerTest.java: Fixed NPE in ScriptScanner when a script tag was not ended before the end of document 2003-05-06 07:44 derrickoswald * docs/changes.txt: update changelog 2003-05-06 07:37 derrickoswald * src/org/htmlparser/parserHelper/ParserHelper.java: Fix #732517 Paser(String) c'tor not handling relative path local file 2003-05-05 20:45 derrickoswald * docs/changes.txt: update changelog 2003-05-05 20:43 derrickoswald * src/org/htmlparser/: scanners/LabelScanner.java, scanners/SelectTagScanner.java, tags/SelectTag.java: nodelist modifications from Dhaval 2003-05-04 23:12 derrickoswald * src/org/htmlparser/: Node.java, NodeReader.java, Parser.java, RemarkNode.java, RemarkNodeParser.java, StringNode.java, parserHelper/AttributeParser.java, parserHelper/CompositeTagScannerHelper.java, parserHelper/StringParser.java, parserHelper/TagParser.java, parserapplications/LinkExtractor.java, <snip - it's every file in the source tree> visitors/StringFindingVisitor.java, visitors/TagFindingVisitor.java, visitors/TextExtractingVisitor.java, visitors/UrlModifyingVisitor.java: update version headers to 1.3-20030504 If this kind of thing is acceptable, then all we need to do is convince all developers that the commit messages are really important and need to be carefully constructed (http://www.red-bean.com/cvs2cl/changelogs.html), since they will then be automatically added to the changelog when a release happens. I'm willing to add that task to the list of things done at release time. The only other thing the changes.txt is probably used for is notification of commits, and that should be handled by the new htmlparser-cvs mailing list. Generating the changelog from commit messages would get rid of all the 'update changelog' emails too. There is also a cvs2html script (http://cvs.sslug.dk/cvs2html/) that provides not only the changelog entries but also provides a html diff view. Derrick Marc Novakowski wrote: >I just checked in a fix for a NPE I was getting in ScriptScanner when I parsed a document that had a <script> tag that wasn't closed before the end of document. As usual, I also updated the changes.txt and changes.html files. I noticed these two files are a little out of sync with each other, which begs the question of why do we have two files that have (or are supposed to have) the exact same content, only with different file extensions? > >It seems silly that commiters have to edit both of these files. Maybe we should get rid of "changes.html" since it's just a text file anyways (no markup). > >On another note, there are currently 6 tests failing (5 of which weren't failing as of a few weeks ago). Anyone who's checked in changes recently might want to take a look at the failures. > >Marc > > > > |
From: Marc N. <ma...@ke...> - 2003-05-07 20:20:41
|
I just checked in a fix for a NPE I was getting in ScriptScanner when I = parsed a document that had a <script> tag that wasn't closed before the = end of document. As usual, I also updated the changes.txt and = changes.html files. I noticed these two files are a little out of sync = with each other, which begs the question of why do we have two files = that have (or are supposed to have) the exact same content, only with = different file extensions? It seems silly that commiters have to edit both of these files. Maybe = we should get rid of "changes.html" since it's just a text file anyways = (no markup). On another note, there are currently 6 tests failing (5 of which weren't = failing as of a few weeks ago). Anyone who's checked in changes = recently might want to take a look at the failures. Marc |
From: <dha...@or...> - 2003-05-07 14:36:00
|
In the HTMLParser version of 27th April I believe registering of Div Scanner and Table scanner was added to the automatic list i.e via Parser.registerScanners(). Due to this I am unable to recognize any of the tags underneath the DIV tag. For example I had INPUT tag underneath the DIV tag. I used registerScanners and I also registered InputTagScanner. However I could detect DIV tags but not INPUT tags. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 |
From: <dha...@or...> - 2003-05-07 14:34:09
|
Derrick, On the HTMLParser documentation page, there is currently a link for Javadocs of version 1.3. The link is an absolute URL pointing to the version on the Internet. Since the Javadoc version is distributed alongwith the package, you could make it a relative URL. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 |
From: Derrick O. <Der...@ro...> - 2003-05-06 23:11:07
|
Setting up the list takes 6-24 hours. I'll post to all the other lists when it's ready. Marc Novakowski wrote: >Can someone either add me to the syncmail list, or set up syncmail to send to the developer list? > >Thanks, >Marc > > > |
From: Marc N. <ma...@ke...> - 2003-05-06 22:31:40
|
Can someone either add me to the syncmail list, or set up syncmail to = send to the developer list? Thanks, Marc -----Original Message----- From: Somik Raha [mailto:so...@ya...] Sent: Saturday, May 03, 2003 10:15 AM To: htm...@li... Subject: Re: [Htmlparser-user] cvs syncmail Hi Derrick, > I've set this up for myself, but if people think > it's useful I can set=20 > up an email list. I've set this up too for myself - but didn't think of setting up a list for it - that would be a neat way of doing it without having everyone setup watches.. I'd vote that the messages could go to the dev list directly. Regards, Somik __________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@ro...> - 2003-05-06 01:00:17
|
You're right, there aren't many places in the code making calls to the feedback subsystem. The reason to internalize the Apache logging classes is to avoid another jar. It's not changing rapidly anyway. Derrick dha...@or... wrote: >Its not that all our files use the Feedback and hence I really don't >think all files should require change. I've explained below. > > >How about making ParserFeedback extend Log interface of Logging Wrapper. >Then DefaultHTMLParserFeedback should declare feedback as follows : > >private static ParserFeedback feedback; > >Obviously all mode related code would have to be deprecated/deleted. We >could have a single argument constructor as follows > >public DefaultParserFeedback (Class clazz) >{ > feedback = LogFactory.getInstance(clazz) >} > >and a no-arg constructor as > >public DefaultParserFeedback () >{ > feedback = LogFactory.getInstance(this.class) >} > >Classes using the same will create feedbacks as : > >new DefaultParserFeedback(this.class); > >Hope I've not taken things too simply and have given a comprehensive >idea of how the Logging Wrapper should be incorporated with the Parser. > >Regards, >Dhaval > > > > |
From: <dha...@or...> - 2003-05-05 07:26:12
|
I agree with Derrick. Usage should decide the different library jars and Derrick a poll would definitely be in order. The groupings you have suggested are quite appropriate. I would also like to suggest incrementing libraries i.e. entire parser_core.jar be incorporated in parser_edit.jar and similarly for parser_applications.jar. Let not the developer keep 3 things in CLASSPATH!!! Dhaval -----Original Message----- From: DerrickOswald [mailto:Der...@ro...] Sent: Saturday, May 03, 2003 5:47 PM To: htmlparser-developer Cc: DerrickOswald Subject: [Htmlparser-developer] configuration items Since it's a library incorporated within other applications, size is always an issue. There are two aspects though, disk footprint (jar size) and memory usage. Usually, there is a speed/memory usage trade-off to be made, which is only sometimes reflected in the disk footprint size. With current desktop hardware, people usually trade off memory for speed. It's only with embedded or mobile applications you concentrate on disk size and memory consumption. Regarding your picture, the layers won't necessarily follow the current package structure. For example, logging is integral to the core parser to report problems, and the beans layer removes all HTML tags so it can't be used by upper layers. In order to decide the breakdown in layers, a poll of users regarding typical use-cases might be in order. Lets say there are two major groupings: 1) extraction of all or part of the information on a page to be consumed by another application. 2) rewriting URLs, content, specific tags, clean-up, reformatting or pretty printing HTML text This would suggest three configuration items (jars): parser_applications.jar - Sample applications, GUI tools, beans, tests parser_edit.jar - Rewriting tools, DOM type heirarchical editing, visitors, smart tags parser_core.jar - Read-only core parser, stream of undifferentiated tags If a programs parser usage involves extraction, it need only use the parser_core.jar and pass through the data in a stream-like fashion. But if rewriting is in order, they use both parser_core.jar and parser_edit.jar and the parser presents the full HTML document as a heirarchy of tag specific nodes. All else goes into parser_applications. We could probably get parser_core.jar below 25KB, or in that range. Derrick Somik Raha wrote: <snip> > [1] I find the parser's differentiating factor is its size - time and > time again the feedback I've received is that folks love its being > below 100K. Size almost directly maps on to simplicity. And that > impacts the other important area - performance. > > [2] I hate to pay for what I don't need - when folks get tons of stuff > that they don't need, they are paying for the needs of a few. > > At the same time, I think it is a challenge to be able to accomodate > new requests and still keep the parser light. I see a natural layer > forming: > > > ,----------------------------------------. > | Sample Applications, GUI | > | ,'''''''''''''''''''''''''''''''`. | > | | Logging Mechanism | | > | | ,''''''''''''''''''''''''''| | | > | | | Beans | | | > | | | +--------------------b | | | > | | | | Scanners | | | | > | | | | ,---------------Y | | | | > | | | | | Core Parser | | | | | > | | | | `.............../ | | | | > | | | L____________________| | | | > | | | | | | > | | '`'''''''''''''''''''''''''' | | > | | default, log4j, jdk1.4 | | > | `................................/ | > |________________________________________| > > If we can perform this seperation in the design and the packaging, it > might allow people to choose what they need. We don't have to follow > the "one size fits all" policy. > > What are your thoughts? I am not sure how we'd achieve this seperation > or whether it really makes sense - so please jump in with your two cents.. > > Regards, > Somik > <snip> ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: <dha...@or...> - 2003-05-05 07:21:11
|
>> Now for the setLabel method of LabelTag. Considering all things(i.e. as >> much as my mind can think of), how about using a NodeList as a >> parameter? Developers can either use the orginal NodeList with some >> modifications or create an entirely new one and pass it to the method >> which in turn will effetively replace the childTags variable in >> CompositeTag. An overloaded String parameter based one can also be given >> with information related to its possible slow performance due to >> internal parsing. All this is only to shield users from inner-level >> code. Otherwise everything that is required to be done can get done but >> only after getting some knowledge over these mailing lists......thanx to >> somik and Derrick. > It is doubtful that users of the parser will normally need this. Advanced > users would - who write their own scanners. We'd probably like them to know > something about the internals - not too much of course. Striking the right > balance is hard. Its always good to revisit a problem. Now that you know > what you do, what do you think is the minimum and simplest, given that you > don't want to lose performance ? (You are in the advanced user category now) Lets go in for the NodeList based approach. (However I do believe that the String absed approach will give a lot of flexibility even though it may impact performance.......which may not be crucial to all). > Hmm.. I seem to have misplaced them (apologies). Can you mail them to > Derrick ? > Derrick-> Dhaval is inside a firewall, and does not have access to our CVS > repository. Will do. Derrick, I'll take the latest sources and give my changes over them. Also will add removeAll method to NodeList. Regards, Dhaval |
From: <dha...@or...> - 2003-05-05 07:02:40
|
Hi Derrick, > There didn't appear to be anything in the licence > (http://jakarta.apache.org/commons/license.html) that precludes > repackaging the code into the htmlparser tree. It isn't very large. We > would have to say Derrick, is there any real need to repackage the logging code with our code. Primarily this would mean that we really won't be able to take advantage of any new enhancements/bug-fixes there are in the Logging Wrapper in the Apache tree. Why not just use their latest binary? > There is however, the task of re-working every file in the source tree > to use the logging wrapper mechanism, which is non-trivial (190 files, > 83 of which are tests). Its not that all our files use the Feedback and hence I really don't think all files should require change. I've explained below. > Of course for backwards compatibility, we should just deprecate the > ParserFeedback (et al) and provide an implementation for it in terms of > the new logging code. How about making ParserFeedback extend Log interface of Logging Wrapper. Then DefaultHTMLParserFeedback should declare feedback as follows : private static ParserFeedback feedback; Obviously all mode related code would have to be deprecated/deleted. We could have a single argument constructor as follows public DefaultParserFeedback (Class clazz) { feedback = LogFactory.getInstance(clazz) } and a no-arg constructor as public DefaultParserFeedback () { feedback = LogFactory.getInstance(this.class) } Classes using the same will create feedbacks as : new DefaultParserFeedback(this.class); Hope I've not taken things too simply and have given a comprehensive idea of how the Logging Wrapper should be incorporated with the Parser. Regards, Dhaval |
From: Derrick O. <Der...@ro...> - 2003-05-05 03:43:16
|
This is the second candidate release for wrapping up version 1.3 and proceeding to version 1.4. Integration Build 1.3 - 20030504 -------------------------------- [1] Fixed bug #728609 - Stack Overflow error. Caused composite tag scanner design to evolve further. Ul-Li scanner relationship can be interesting study model for designing other scanner relationships. [2] Fixed bug #729334 NodeReader::readElement() Null point dereferenced [3] Fixed bug #729368 Embedded quote and split tag [4] Fixed bug #731684 Body and title tags with attributes not parsed This will lead to a change in behaviour, BODY tags with attributes will now have all the html body encapsulated within them as children. If you were handling BODY tags as composite tags before, this shouldn't make any difference. |
From: Derrick O. <Der...@ro...> - 2003-05-03 14:33:51
|
Sourceforge CVS services can be set up to send email notification when commits occur. I've set this up for myself, but if people think it's useful I can set up an email list. |
From: Derrick O. <Der...@ro...> - 2003-05-03 12:12:22
|
Oops, I should have said that the parser_core.jar outputs a stream of undifferentiated *nodes* Derrick Oswald wrote: > > Since it's a library incorporated within other applications, size is > always an issue. > There are two aspects though, disk footprint (jar size) and memory usage. > Usually, there is a speed/memory usage trade-off to be made, which is > only sometimes reflected in the disk footprint size. > With current desktop hardware, people usually trade off memory for speed. > It's only with embedded or mobile applications you concentrate on disk > size and memory consumption. > > Regarding your picture, the layers won't necessarily follow the > current package structure. > For example, logging is integral to the core parser to report > problems, and the beans layer removes all HTML tags so it can't be > used by upper layers. In order to decide the breakdown in layers, a > poll of users regarding typical use-cases might be in order. > > Lets say there are two major groupings: > > 1) extraction of all or part of the information on a page to be > consumed by another application. > 2) rewriting URLs, content, specific tags, clean-up, reformatting or > pretty printing HTML text > > This would suggest three configuration items (jars): > > parser_applications.jar - Sample applications, GUI tools, beans, tests > parser_edit.jar - Rewriting tools, DOM type heirarchical editing, > visitors, smart tags > parser_core.jar - Read-only core parser, stream of undifferentiated tags > > If a programs parser usage involves extraction, it need only use the > parser_core.jar and pass through the data in a stream-like fashion. > But if rewriting is in order, they use both parser_core.jar and > parser_edit.jar and the parser presents the full HTML document as a > heirarchy of tag specific nodes. All else goes into parser_applications. > > We could probably get parser_core.jar below 25KB, or in that range. > > Derrick > > Somik Raha wrote: > <snip> > >> [1] I find the parser's differentiating factor is its size - time and >> time again the feedback I've received is that folks love its being >> below 100K. Size almost directly maps on to simplicity. And that >> impacts the other important area - performance. >> >> [2] I hate to pay for what I don't need - when folks get tons of >> stuff that they don't need, they are paying for the needs of a few. >> >> At the same time, I think it is a challenge to be able to accomodate >> new requests and still keep the parser light. I see a natural layer >> forming: >> >> >> ,----------------------------------------. >> | Sample Applications, GUI | >> | ,'''''''''''''''''''''''''''''''`. | >> | | Logging Mechanism | | >> | | ,''''''''''''''''''''''''''| | | >> | | | Beans | | | >> | | | +--------------------b | | | >> | | | | Scanners | | | | >> | | | | ,---------------Y | | | | >> | | | | | Core Parser | | | | | >> | | | | `.............../ | | | | >> | | | L____________________| | | | >> | | | | | | >> | | '`'''''''''''''''''''''''''' | | >> | | default, log4j, jdk1.4 | | >> | `................................/ | >> |________________________________________| >> >> If we can perform this seperation in the design and the packaging, it >> might allow people to choose what they need. We don't have to follow >> the "one size fits all" policy. >> >> What are your thoughts? I am not sure how we'd achieve this >> seperation or whether it really makes sense - so please jump in with >> your two cents.. >> >> Regards, >> Somik >> > > > <snip> > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Derrick O. <Der...@ro...> - 2003-05-03 12:03:32
|
Since it's a library incorporated within other applications, size is always an issue. There are two aspects though, disk footprint (jar size) and memory usage. Usually, there is a speed/memory usage trade-off to be made, which is only sometimes reflected in the disk footprint size. With current desktop hardware, people usually trade off memory for speed. It's only with embedded or mobile applications you concentrate on disk size and memory consumption. Regarding your picture, the layers won't necessarily follow the current package structure. For example, logging is integral to the core parser to report problems, and the beans layer removes all HTML tags so it can't be used by upper layers. In order to decide the breakdown in layers, a poll of users regarding typical use-cases might be in order. Lets say there are two major groupings: 1) extraction of all or part of the information on a page to be consumed by another application. 2) rewriting URLs, content, specific tags, clean-up, reformatting or pretty printing HTML text This would suggest three configuration items (jars): parser_applications.jar - Sample applications, GUI tools, beans, tests parser_edit.jar - Rewriting tools, DOM type heirarchical editing, visitors, smart tags parser_core.jar - Read-only core parser, stream of undifferentiated tags If a programs parser usage involves extraction, it need only use the parser_core.jar and pass through the data in a stream-like fashion. But if rewriting is in order, they use both parser_core.jar and parser_edit.jar and the parser presents the full HTML document as a heirarchy of tag specific nodes. All else goes into parser_applications. We could probably get parser_core.jar below 25KB, or in that range. Derrick Somik Raha wrote: <snip> > [1] I find the parser's differentiating factor is its size - time and > time again the feedback I've received is that folks love its being > below 100K. Size almost directly maps on to simplicity. And that > impacts the other important area - performance. > > [2] I hate to pay for what I don't need - when folks get tons of stuff > that they don't need, they are paying for the needs of a few. > > At the same time, I think it is a challenge to be able to accomodate > new requests and still keep the parser light. I see a natural layer > forming: > > > ,----------------------------------------. > | Sample Applications, GUI | > | ,'''''''''''''''''''''''''''''''`. | > | | Logging Mechanism | | > | | ,''''''''''''''''''''''''''| | | > | | | Beans | | | > | | | +--------------------b | | | > | | | | Scanners | | | | > | | | | ,---------------Y | | | | > | | | | | Core Parser | | | | | > | | | | `.............../ | | | | > | | | L____________________| | | | > | | | | | | > | | '`'''''''''''''''''''''''''' | | > | | default, log4j, jdk1.4 | | > | `................................/ | > |________________________________________| > > If we can perform this seperation in the design and the packaging, it > might allow people to choose what they need. We don't have to follow > the "one size fits all" policy. > > What are your thoughts? I am not sure how we'd achieve this seperation > or whether it really makes sense - so please jump in with your two cents.. > > Regards, > Somik > <snip> |
From: Somik R. <so...@ya...> - 2003-05-02 15:55:06
|
> Well the Node [] returned by getChildrenAsNodeArray is a copy of the > original children nodelist. I used the NodeList obtained from > getChildren() and changed contents in that to get my work done. It > worked!!! Ah yes, I recall the copy now. > How about a removeAll(). Felt the need for that since I was replacing > the entire child list with a single node. Will be useful for others also > who want to change number of child nodes. At present I had to remove > each one individually. Only advantage ofcourse is cleaner (not to > mention easier) developer code. > Sounds good. I didn't need it, so didn't put it, but if you do - go ahead. > Now for the setLabel method of LabelTag. Considering all things(i.e. as > much as my mind can think of), how about using a NodeList as a > parameter? Developers can either use the orginal NodeList with some > modifications or create an entirely new one and pass it to the method > which in turn will effetively replace the childTags variable in > CompositeTag. An overloaded String parameter based one can also be given > with information related to its possible slow performance due to > internal parsing. All this is only to shield users from inner-level > code. Otherwise everything that is required to be done can get done but > only after getting some knowledge over these mailing lists......thanx to > somik and Derrick. > It is doubtful that users of the parser will normally need this. Advanced users would - who write their own scanners. We'd probably like them to know something about the internals - not too much of course. Striking the right balance is hard. Its always good to revisit a problem. Now that you know what you do, what do you think is the minimum and simplest, given that you don't want to lose performance ? (You are in the advanced user category now) > Also can we have a no-args constructor for LabelScanner. I think I had > sent these files to Somik for updation into CVS (alongwith SelectTag to > use NodeList instead of List) Hmm.. I seem to have misplaced them (apologies). Can you mail them to Derrick ? Derrick-> Dhaval is inside a firewall, and does not have access to our CVS repository. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-05-02 14:46:01
|
Hi Derrick, Dhaval, Everyone, I've been thinking about his for a while too, and I agree with you - = we shouldnt decide before 1.3 is out. Some issues that have been on my mind -=20 [1] I find the parser's differentiating factor is its size - time and = time again the feedback I've received is that folks love its being below = 100K. Size almost directly maps on to simplicity. And that impacts the = other important area - performance. [2] I hate to pay for what I don't need - when folks get tons of stuff = that they don't need, they are paying for the needs of a few. At the same time, I think it is a challenge to be able to accomodate new = requests and still keep the parser light. I see a natural layer forming: ,----------------------------------------. | Sample Applications, GUI | | ,'''''''''''''''''''''''''''''''`. | | | Logging Mechanism | | | | ,''''''''''''''''''''''''''| | | | | | Beans | | | | | | +--------------------b | | | | | | | Scanners | | | | | | | | ,---------------Y | | | | | | | | | Core Parser | | | | | | | | | `.............../ | | | | | | | L____________________| | | | | | | | | | | | '`'''''''''''''''''''''''''' | | | | default, log4j, jdk1.4 | | | `................................/ | |________________________________________| If we can perform this seperation in the design and the packaging, it = might allow people to choose what they need. We don't have to follow the = "one size fits all" policy. What are your thoughts? I am not sure how we'd achieve this seperation = or whether it really makes sense - so please jump in with your two = cents.. Regards, Somik ----- Original Message -----=20 From: "Derrick Oswald" <Der...@ro...> To: <htm...@li...> Sent: Friday, May 02, 2003 4:13 AM Subject: Re: [Htmlparser-developer] ParserFeedback mechanism > I looked at this earlier too, regarding the org.apache.commons.logging = > package: > http://jakarta.apache.org/commons/logging.html > which provides a thin logging-system agnostic interface. >=20 > There didn't appear to be anything in the licence=20 > (http://jakarta.apache.org/commons/license.html) that precludes=20 > repackaging the code into the htmlparser tree. It isn't very large. We = > would have to say >=20 > "This product includes software developed by the Apache Software = Foundation (http://www.apache.org/)." >=20 > somewhere in the documentation. >=20 > There is however, the task of re-working every file in the source tree = > to use the logging wrapper mechanism, which is non-trivial (190 files, = > 83 of which are tests). > I would suggest this be undertaken when version 1.3 is finished. I = think=20 > we can arbitrarily set a cut-off point for 1.3 next week, unless a = major=20 > show-stopper is discovered. > Of course for backwards compatibility, we should just deprecate the=20 > ParserFeedback (et al) and provide an implementation for it in terms = of=20 > the new logging code. >=20 > Derrick >=20 > dha...@or... wrote: >=20 > >Hi guys, > > > >I remember we had a discussion about the feedback mechanism earlier. = I > >just wanted to restart it by suggesting use of the Logging Wrapper = from > >Jakarta. > > > >I have noticed that if anyone wants to use the ParserFeedback to log > >then they will need to mostly extend the DefaultParserFeedback class = and > >override the methods appropriately. If we can map the ParserFeedback > >class to the Logging Wrapper applications can easily use the Feedback > >mechanism to log to Log4j and JDK 1.4 without having to do a thing. = most > >users according tome woudl be using one of these systems. I believe = the > >argument then was coupling with a third-party library. But I believe = the > >flexibility it offers outstrips the coupling drawback.=20 > > > >Furthermore imagine an application which is using some other logging > >tool. They have coded their entire logging framework using the = Logging > >Wrapper and have used an adapter to log to their logging tool. If = they > >use the parser and want to log its output as well, they will have to > >write one more adapter. Instead if the parser provides a mechanism = for > >using the Logging Wrapper, they would not need to do anything.=20 > > > >We ahve actually had requests wherein different clients have asked = for > >different logging tools to be used!!! Hence the request. > > > >We could simply extend from DefaultParserFeedback for = LogWrapperFeedback > >and make it implement the commons logging interface. > > > >Do let me know your thoughts/opinions/suggestions on the same. > > > >Regards, > >Dhaval > > > > > > =20 > > >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |