htmlparser-developer Mailing List for HTML Parser (Page 24)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Somik R. <so...@ya...> - 2002-12-11 19:45:40
|
Hi Sam, Can you post the debug output that you see ? Also, if you can provide the url you are parsing, we can attempt to duplicate this. Unfortunately my IDE crashed yesterday after I did a Windows update, so I might take a day or two to get back to programming from home. Regards, Somik --- Sam Joseph <ga...@yh...> wrote: > > Hi Somik > > > Somik wrote: > > >>Most importantly I seem to get a lot of debug > output > >>text that I would > >>prefer to avoid, see the examples below. Perhaps > I'm > >>mistaken but this > >>seems to be output by default. Is there some way > for > >>me to avoid getting > >>this debug output? > >> > >> > > > >Of course - HTMLParser now takes in a logging > object - > >HTMLParserFeedback. All you have to do is to > implement > >this interface and pass your object in to the > parser. > >If you don't, a DefaultHTMLParserFeedback object is > >created - and its function is to send log data to > >System.out. > > > Well I wrote the following: > > private class BlankHTMLParserFeedback > implements HTMLParserFeedback > { > public void info(String message) > { > //System.out.println("INFO: " + message); > } > > public void warning(String message) > { > //System.out.println("WARNING: " + message); > } > > public void error(String message, > HTMLParserException e) > { > //System.out.println("ERROR: " + message); > e.printStackTrace(); > } > } > > > > /** > * parse the page > */ > public final void parse() > throws Exception > { > if (o_parser==null) > o_parser = new HTMLParser(o_url, new > BlankHTMLParserFeedback()); > > o_parser.addScanner(new HTMLMetaTagScanner("-t")); > o_parser.addScanner(new HTMLLinkScanner("-l")); > o_parser.addScanner(new HTMLTitleScanner("-a")); > parseURLForData(); > o_summary = createSummary(); > } > > However I still seem to be getting the same debug > output. Can you see > what I am doing wrong? > > Have you considered using log4j? With log4j you have > a log4j properties > file and you can specify the debug level on a class > by class basis > within the properties file, and debug output can be > formatted to give > you useful info such as the line number of the code > where the debug > statement is. > > Thanks in advance. > > CHEERS> SAM > > p.s. is there some operation for picking up HTML > comments using the > HTMLParser (<!-- a comment -->) or are they > automatically ignored? > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance > Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Sam J. <ga...@yh...> - 2002-12-11 03:03:00
|
Hi Somik Somik wrote: >>Most importantly I seem to get a lot of debug output >>text that I would >>prefer to avoid, see the examples below. Perhaps I'm >>mistaken but this >>seems to be output by default. Is there some way for >>me to avoid getting >>this debug output? >> >> > >Of course - HTMLParser now takes in a logging object - >HTMLParserFeedback. All you have to do is to implement >this interface and pass your object in to the parser. >If you don't, a DefaultHTMLParserFeedback object is >created - and its function is to send log data to >System.out. > Well I wrote the following: private class BlankHTMLParserFeedback implements HTMLParserFeedback { public void info(String message) { //System.out.println("INFO: " + message); } public void warning(String message) { //System.out.println("WARNING: " + message); } public void error(String message, HTMLParserException e) { //System.out.println("ERROR: " + message); e.printStackTrace(); } } /** * parse the page */ public final void parse() throws Exception { if (o_parser==null) o_parser = new HTMLParser(o_url, new BlankHTMLParserFeedback()); o_parser.addScanner(new HTMLMetaTagScanner("-t")); o_parser.addScanner(new HTMLLinkScanner("-l")); o_parser.addScanner(new HTMLTitleScanner("-a")); parseURLForData(); o_summary = createSummary(); } However I still seem to be getting the same debug output. Can you see what I am doing wrong? Have you considered using log4j? With log4j you have a log4j properties file and you can specify the debug level on a class by class basis within the properties file, and debug output can be formatted to give you useful info such as the line number of the code where the debug statement is. Thanks in advance. CHEERS> SAM p.s. is there some operation for picking up HTML comments using the HTMLParser (<!-- a comment -->) or are they automatically ignored? |
From: Somik R. <so...@ya...> - 2002-12-10 22:32:29
|
Hi Sam, > Most importantly I seem to get a lot of debug output > text that I would > prefer to avoid, see the examples below. Perhaps I'm > mistaken but this > seems to be output by default. Is there some way for > me to avoid getting > this debug output? Of course - HTMLParser now takes in a logging object - HTMLParserFeedback. All you have to do is to implement this interface and pass your object in to the parser. If you don't, a DefaultHTMLParserFeedback object is created - and its function is to send log data to System.out. > Also as I was upgrading various parts of the > software there were > naturally a few method name changes such as: > > HTMLEndTag.getContent() --> HTMLEndTag.getTagName() > > since there were no notes in the javadoc I had to > run the following: > > HTMLEndTag x_end_tag =(HTMLEndTag)p_node; > o_cat.debug("x_end_tag.toString(): " + > x_end_tag.toString()); > o_cat.debug("x_end_tag.toHTML(): " + > x_end_tag.toHTML()); > o_cat.debug("x_end_tag.toPlainTextString(): " + > x_end_tag.toPlainTextString()); > o_cat.debug("x_end_tag.getTagName(): " + > x_end_tag.getTagName()); > o_cat.debug("x_end_tag.getText(): " + > x_end_tag.getText()); > > to determine which of the new method calls would > give me which kind of > data. Not such an important point, just a nudge > about the importance of > providing method descriptions (ideally with > examples) for your javadoc :-) You are right - the javadoc on HTMLEndTag needs to be updated. Thanks for pointing that out. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Sam J. <ga...@yh...> - 2002-12-10 12:47:31
|
Hi Somik, I finally got round to making the NeuroGridHTMLParser work with the latest version of the HTMLParser. A number of issues have arisen. Most importantly I seem to get a lot of debug output text that I would prefer to avoid, see the examples below. Perhaps I'm mistaken but this seems to be output by default. Is there some way for me to avoid getting this debug output? Also as I was upgrading various parts of the software there were naturally a few method name changes such as: HTMLEndTag.getContent() --> HTMLEndTag.getTagName() since there were no notes in the javadoc I had to run the following: HTMLEndTag x_end_tag =(HTMLEndTag)p_node; o_cat.debug("x_end_tag.toString(): " + x_end_tag.toString()); o_cat.debug("x_end_tag.toHTML(): " + x_end_tag.toHTML()); o_cat.debug("x_end_tag.toPlainTextString(): " + x_end_tag.toPlainTextString()); o_cat.debug("x_end_tag.getTagName(): " + x_end_tag.getTagName()); o_cat.debug("x_end_tag.getText(): " + x_end_tag.getText()); to determine which of the new method calls would give me which kind of data. Not such an important point, just a nudge about the importance of providing method descriptions (ideally with examples) for your javadoc :-) Keep up the good work on this project. CHEERS> SAM ========auto-debug examples ============= ; begins at : -1; ends at : -1 Text = to provide a means to discover that the data set exists and how it might be obtained or accessed; and ; begins at : -1; ends at : -1 Text = to document the content, quality, and features of a data set, indicating its fitness for use. ; begins at : -1; ends at : -1 Text = NOTE: ; begins at : -1; ends at : -1 and LinkData -------- 0 Begin Tag : IMG ALIGN=left WIDTH=31 HEIGHT=31 BORDER=0 SRC="../pics/misc/nscape1.gif" ALT="nscape1.gif"; begins at : 0; ends at : 48 *** END of LinkData *** Link to : http://cgi.netscape.com/cgi-bin/123.cgiurllist; titled : ; begins at : 0; ends at : 53, AccessKey=null LinkData -------- 0 Begin Tag : IMG ALIGN=right WIDTH=31 HEIGHT=31 BORDER=0 SRC="../pics/misc/nscape1.gif" ALT="nscape1.gif"; begins at : 0; ends at : 48 *** END of LinkData *** and META TAG -------- Http-Equiv : null Name : Description Contents : Examples of HTML META tag codes META TAG -------- Http-Equiv : null Name : KeyWords Contents : HTML, META |
From: Somik R. <so...@ya...> - 2002-12-10 08:30:13
|
Oh, I thought you were using this all along... This is also on the htmlparser cvs page. Regards, Somik --- Derrick Oswald <Der...@ro...> wrote: > > Just to finish up this thread... the answer us to > use ext instead of > pserver and follow the instructions here: > https://sourceforge.net/docman/display_doc.php?docid=761&group_id=1 > > Derrick > > Kaarle Kaila wrote: > > > At 08:27 9.12.2002 -0500, Derrick Oswald wrote: > > > >> I can check out OK, do everything in fact, except > commit. > >> The messages I get indicate everuthing is working > as it should, I > >> need to login to access the repository as > expected, but I suspect > >> there is some cron job at sourceforge that needs > to run to add me to > >> the 'write access' list: > > > > > > > > I set up the environment variables first > > (in windows as I explained earlier. i have done it > also in Linux) > > > > Then I use some of the commands (something like > this) > > > > cvs checkout htmlparser > > cvs update htmlparser > > cvs add file.java (from correct directory or > with full path) > > cvs release -d htmlparser > > cvs commit > > > > I have no other parameters in the command > > > > regards > > Kaarle > > > > > >> $echo $CVS_RSH > >> /usr/bin/ssh -2 > >> $ cvs -t -d > >> > :pserver:der...@cv...:/cvsroot/htmlparser > >> commit -F /tmp/tempCommit62337output > >> src/org/htmlparser/util/HTMLLinkProcessor.java > >> cvs commit: notice: main loop with > >> > CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser > > >> > >> cvs commit: authorization failed: server > cvs.kacoma.sourceforge.net > >> rejected access to /cvsroot/htmlparser for user > derrickoswald > >> cvs commit: used empty password; try "cvs login" > with a real password > >> $ cvs -t -d > >> > :pserver:der...@cv...:/cvsroot/htmlparser > > >> login > >> cvs login: notice: main loop with > >> > CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser > > >> > >> (Logging in to > der...@cv...) > >> CVS password: > >> $ cvs -t -d > >> > :pserver:der...@cv...:/cvsroot/htmlparser > >> commit -F /tmp/tempCommit62337output > >> src/org/htmlparser/util/HTMLLinkProcessor.java > >> cvs commit: notice: main loop with > >> > CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser > > >> > >> -> Sending file `HTMLLinkProcessor.java' to > server > >> cvs [server aborted]: "commit" requires write > access to the repository > >> $ > >> > >> Kaarle Kaila wrote: > >> > >>> At 09:53 8.12.2002 -0800, Somik Raha wrote: > >>> > >>>> Oh yes, Claude has, and so has Kaarle. The > CVSROOT/users file is > >>>> only useful for adding watchers. I could do > that, but as far as I > >>>> understand, once you do ssh, you should be able > to access the > >>>> repository and check in code. > >>>> > >>>> Claude, Kaarle--> Can you shed more light on > this ? > >>> > >>> > >>> > >>> hi, > >>> > >>> I have installed SSH software from > www.f-secure.com > >>> > >>> I think it was something like F-Secure SSH 5.2 > for > >>> Win95/98/ME/NT4.0/2000/XP Client > >>> > >>> It is a nice grapfical SSH client both for > terminal use and > >>> filetransfer > >>> and it also contains commandline ssh2 software > that CVS needs. > >>> > >>> To access CVS I first set it up with these > commands > >>> > >>> set CVS_RSH=ssh2 > >>> set > CVSROOT=use...@cv...:/cvsroot/htmlparser > >>> > >>> username = your sourceforge username > >>> > >>> In an empty directory I then can give CVS > commands such as > >>> > >>> cvs chekcout htmlparser > >>> > >>> It asks for your password to sourceforge > >>> > >>> This retrieves the latest fileversions. > >>> Check the CVS commands in some handbook you can > find on the internet. > >>> The manual I found is called Version Management > with CVS by Per > >>> Cederqvist et al. > >>> perhaps from http://www.cvshome.org > >>> > >>> Kaarle > >>> > >>>> Regards, > >>>> Somik > >>>> > >>>> ----- Original Message ----- > >>>> From: <mailto:Der...@ro...>Derrick > Oswald > >>>> To: <mailto:so...@ya...>Somik Raha > >>>> Cc: > <mailto:so...@us...>so...@us... > >>>> Sent: Sunday, December 08, 2002 7:13 AM > >>>> Subject: Re: cvs access > >>>> > >>>> Somik, > >>>> > >>>> I can successfully ssh, which set up a > ~/.ssh/known_hosts2 file on > >>>> my system, but I think you need to add me to > the > >>>> htmlparser/CVSROOT/users file or something. > >>>> > >>>> As near as I can tell, you are the only one who > has ever dropped > >>>> something into the repository. > >>>> How do all the other developers affect the code > base? > >>>> Did they ever successfully commit using cvs? > >>>> > >>>> Derrick > >>>> > >>>> Somik Raha wrote: > >>>> > >>>>> > >>>>> Hi Derrick > >>>>> > >>>>> This is probably a sourceforge quirk - you > might need to ssh > >>>>> once, and > >>>>> > >>>>> then you should be able to write to the > repository. Bytway, I have > >>>>> already > >>>>> > >>>>> committed the fix for HTMLFormTag.. > >>>>> > >>>>> > >>>>> Regards, > >>>>> > >>>>> Somik > >>>>> > >>>>> ----- Original Message ----- > >>>>> > >>>>> From: "Derrick Oswald" > === message truncated === __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Derrick O. <Der...@ro...> - 2002-12-10 07:17:46
|
Just to finish up this thread... the answer us to use ext instead of pserver and follow the instructions here: https://sourceforge.net/docman/display_doc.php?docid=761&group_id=1 Derrick Kaarle Kaila wrote: > At 08:27 9.12.2002 -0500, Derrick Oswald wrote: > >> I can check out OK, do everything in fact, except commit. >> The messages I get indicate everuthing is working as it should, I >> need to login to access the repository as expected, but I suspect >> there is some cron job at sourceforge that needs to run to add me to >> the 'write access' list: > > > > I set up the environment variables first > (in windows as I explained earlier. i have done it also in Linux) > > Then I use some of the commands (something like this) > > cvs checkout htmlparser > cvs update htmlparser > cvs add file.java (from correct directory or with full path) > cvs release -d htmlparser > cvs commit > > I have no other parameters in the command > > regards > Kaarle > > >> $echo $CVS_RSH >> /usr/bin/ssh -2 >> $ cvs -t -d >> :pserver:der...@cv...:/cvsroot/htmlparser >> commit -F /tmp/tempCommit62337output >> src/org/htmlparser/util/HTMLLinkProcessor.java >> cvs commit: notice: main loop with >> CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser >> >> cvs commit: authorization failed: server cvs.kacoma.sourceforge.net >> rejected access to /cvsroot/htmlparser for user derrickoswald >> cvs commit: used empty password; try "cvs login" with a real password >> $ cvs -t -d >> :pserver:der...@cv...:/cvsroot/htmlparser >> login >> cvs login: notice: main loop with >> CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser >> >> (Logging in to der...@cv...) >> CVS password: >> $ cvs -t -d >> :pserver:der...@cv...:/cvsroot/htmlparser >> commit -F /tmp/tempCommit62337output >> src/org/htmlparser/util/HTMLLinkProcessor.java >> cvs commit: notice: main loop with >> CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser >> >> -> Sending file `HTMLLinkProcessor.java' to server >> cvs [server aborted]: "commit" requires write access to the repository >> $ >> >> Kaarle Kaila wrote: >> >>> At 09:53 8.12.2002 -0800, Somik Raha wrote: >>> >>>> Oh yes, Claude has, and so has Kaarle. The CVSROOT/users file is >>>> only useful for adding watchers. I could do that, but as far as I >>>> understand, once you do ssh, you should be able to access the >>>> repository and check in code. >>>> >>>> Claude, Kaarle--> Can you shed more light on this ? >>> >>> >>> >>> hi, >>> >>> I have installed SSH software from www.f-secure.com >>> >>> I think it was something like F-Secure SSH 5.2 for >>> Win95/98/ME/NT4.0/2000/XP Client >>> >>> It is a nice grapfical SSH client both for terminal use and >>> filetransfer >>> and it also contains commandline ssh2 software that CVS needs. >>> >>> To access CVS I first set it up with these commands >>> >>> set CVS_RSH=ssh2 >>> set CVSROOT=use...@cv...:/cvsroot/htmlparser >>> >>> username = your sourceforge username >>> >>> In an empty directory I then can give CVS commands such as >>> >>> cvs chekcout htmlparser >>> >>> It asks for your password to sourceforge >>> >>> This retrieves the latest fileversions. >>> Check the CVS commands in some handbook you can find on the internet. >>> The manual I found is called Version Management with CVS by Per >>> Cederqvist et al. >>> perhaps from http://www.cvshome.org >>> >>> Kaarle >>> >>>> Regards, >>>> Somik >>>> >>>> ----- Original Message ----- >>>> From: <mailto:Der...@ro...>Derrick Oswald >>>> To: <mailto:so...@ya...>Somik Raha >>>> Cc: <mailto:so...@us...>so...@us... >>>> Sent: Sunday, December 08, 2002 7:13 AM >>>> Subject: Re: cvs access >>>> >>>> Somik, >>>> >>>> I can successfully ssh, which set up a ~/.ssh/known_hosts2 file on >>>> my system, but I think you need to add me to the >>>> htmlparser/CVSROOT/users file or something. >>>> >>>> As near as I can tell, you are the only one who has ever dropped >>>> something into the repository. >>>> How do all the other developers affect the code base? >>>> Did they ever successfully commit using cvs? >>>> >>>> Derrick >>>> >>>> Somik Raha wrote: >>>> >>>>> >>>>> Hi Derrick >>>>> >>>>> This is probably a sourceforge quirk - you might need to ssh >>>>> once, and >>>>> >>>>> then you should be able to write to the repository. Bytway, I have >>>>> already >>>>> >>>>> committed the fix for HTMLFormTag.. >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Somik >>>>> >>>>> ----- Original Message ----- >>>>> >>>>> From: "Derrick Oswald" >>>>> <mailto:Der...@ro...><Der...@ro...> >>>>> >>>>> To: <mailto:so...@ya...><so...@ya...> >>>>> >>>>> Sent: Saturday, December 07, 2002 2:06 PM >>>>> >>>>> Subject: cvs access >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> Somik, >>>>>> >>>>>> >>>>>> I tried dropping a small change and I get: >>>>>> >>>>>> cvs [server aborted]: "commit" requires write access to the >>>>>> repository >>>>>> >>>>>> >>>>>> Does CVS write access lag behind being added to the htmlparser >>>>>> project >>>>>> >>>>>> as a developer or is there some step you've forgotten? >>>>>> >>>>>> >>>>>> Derrick >>>>>> >>>>> >>>>> >>>>> >>>>> >>> >>> --------------------------------------------- >>> Kaarle Kaila >>> http://www.iki.fi/kaila >>> mailto:kaa...@ik... >>> tel: +358 50 3725844 >>> >>> >>> ------------------------------------------------------- >>> This sf.net email is sponsored by:ThinkGeek >>> Welcome to geek heaven. >>> http://thinkgeek.com/sf >>> _______________________________________________ >>> Htmlparser-developer mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> >> >> >> >> ------------------------------------------------------- >> This sf.net email is sponsored by:ThinkGeek >> Welcome to geek heaven. >> http://thinkgeek.com/sf >> _______________________________________________ >> Htmlparser-developer mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > --------------------------------------------- > Kaarle Kaila > http://www.iki.fi/kaila > mailto:kaa...@ik... > tel: +358 50 3725844 > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Leslie R. <le...@op...> - 2002-12-09 18:18:40
|
Stephen J. Harrington wrote: > I already created a work around, so it doesn't kill me. > > I just hated to have to spend the time to make a new connection to the > source I am scraping since the pipe I am using is small. > Don't make a new connection. Just do a mark(10000) right after the Reader is opened, call parse, and do a Reader.reset() before calling parse again. The connection will remain, and the BufferedReader will hold onto the html string between calls to parse. The # 10000 is an example only -- you'll have to provide a value large enough to accommodate whatever stream length you expect or the subsequent reset will fail. > I would be fine with it the way it is, provided the docs are updated. > > Thanks for looking into this. > > --stephen > > Somik Raha wrote: > >> Hi Folks, We've come up with an interesting problem - there was a >> request by Steve Harrington recently that we support >> multiple-sequential parsing, i.e. use the same parser object multiple >> times to parse instead of creating a new one each time. >> Unfortunately this has caused us to play around with the reader and >> try to mark and reset streams. This is not such a good idea as for >> large streams there is no guarantee that a reset will work. Leslie >> suggests that we note this in the javadoc, and roll back this >> feature. Our complete bug report and discussion is at >> https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399 >> <https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399> >> The bug id is #649133. A discussion of this bug is in order, and it >> would be good if developers can participate with their views. >> Steve --> It will be good to hear your views on this. Regards,Somik > -- Leslie Rohde mailto:le...@op... http://www.optitext.com |
From: Stephen J. H. <Ste...@tr...> - 2002-12-09 18:06:19
|
I already created a work around, so it doesn't kill me. I just hated to have to spend the time to make a new connection to the source I am scraping since the pipe I am using is small. I would be fine with it the way it is, provided the docs are updated. Thanks for looking into this. --stephen Somik Raha wrote: > Hi Folks, We've come up with an interesting problem - there was a > request by Steve Harrington recently that we support > multiple-sequential parsing, i.e. use the same parser object multiple > times to parse instead of creating a new one each time. > Unfortunately this has caused us to play around with the reader and > try to mark and reset streams. This is not such a good idea as for > large streams there is no guarantee that a reset will work. Leslie > suggests that we note this in the javadoc, and roll back this > feature. Our complete bug report and discussion is at > https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399 > The bug id is #649133. A discussion of this bug is in order, and it > would be good if developers can participate with their views. Steve > --> It will be good to hear your views on this. Regards,Somik |
From: Kaarle K. <kaa...@ik...> - 2002-12-09 15:08:32
|
At 08:27 9.12.2002 -0500, Derrick Oswald wrote: >I can check out OK, do everything in fact, except commit. >The messages I get indicate everuthing is working as it should, I need to >login to access the repository as expected, but I suspect there is some >cron job at sourceforge that needs to run to add me to the 'write access' list: I set up the environment variables first (in windows as I explained earlier. i have done it also in Linux) Then I use some of the commands (something like this) cvs checkout htmlparser cvs update htmlparser cvs add file.java (from correct directory or with full path) cvs release -d htmlparser cvs commit I have no other parameters in the command regards Kaarle >$echo $CVS_RSH >/usr/bin/ssh -2 >$ cvs -t -d >:pserver:der...@cv...:/cvsroot/htmlparser >commit -F /tmp/tempCommit62337output >src/org/htmlparser/util/HTMLLinkProcessor.java >cvs commit: notice: main loop with >CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser >cvs commit: authorization failed: server cvs.kacoma.sourceforge.net >rejected access to /cvsroot/htmlparser for user derrickoswald >cvs commit: used empty password; try "cvs login" with a real password >$ cvs -t -d >:pserver:der...@cv...:/cvsroot/htmlparser login >cvs login: notice: main loop with >CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser >(Logging in to der...@cv...) >CVS password: >$ cvs -t -d >:pserver:der...@cv...:/cvsroot/htmlparser >commit -F /tmp/tempCommit62337output >src/org/htmlparser/util/HTMLLinkProcessor.java >cvs commit: notice: main loop with >CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser >-> Sending file `HTMLLinkProcessor.java' to server >cvs [server aborted]: "commit" requires write access to the repository >$ > >Kaarle Kaila wrote: > >>At 09:53 8.12.2002 -0800, Somik Raha wrote: >> >>>Oh yes, Claude has, and so has Kaarle. The CVSROOT/users file is only >>>useful for adding watchers. I could do that, but as far as I understand, >>>once you do ssh, you should be able to access the repository and check in code. >>> >>>Claude, Kaarle--> Can you shed more light on this ? >> >> >>hi, >> >>I have installed SSH software from www.f-secure.com >> >>I think it was something like F-Secure SSH 5.2 for >>Win95/98/ME/NT4.0/2000/XP Client >> >>It is a nice grapfical SSH client both for terminal use and filetransfer >>and it also contains commandline ssh2 software that CVS needs. >> >>To access CVS I first set it up with these commands >> >>set CVS_RSH=ssh2 >>set CVSROOT=use...@cv...:/cvsroot/htmlparser >> >>username = your sourceforge username >> >>In an empty directory I then can give CVS commands such as >> >>cvs chekcout htmlparser >> >>It asks for your password to sourceforge >> >>This retrieves the latest fileversions. >>Check the CVS commands in some handbook you can find on the internet. >>The manual I found is called Version Management with CVS by Per >>Cederqvist et al. >>perhaps from http://www.cvshome.org >> >>Kaarle >> >>>Regards, >>>Somik >>> >>>----- Original Message ----- >>>From: <mailto:Der...@ro...>Derrick Oswald >>>To: <mailto:so...@ya...>Somik Raha >>>Cc: <mailto:so...@us...>so...@us... >>>Sent: Sunday, December 08, 2002 7:13 AM >>>Subject: Re: cvs access >>> >>>Somik, >>> >>>I can successfully ssh, which set up a ~/.ssh/known_hosts2 file on my >>>system, but I think you need to add me to the htmlparser/CVSROOT/users >>>file or something. >>> >>>As near as I can tell, you are the only one who has ever dropped >>>something into the repository. >>>How do all the other developers affect the code base? >>>Did they ever successfully commit using cvs? >>> >>>Derrick >>> >>>Somik Raha wrote: >>> >>>> >>>>Hi Derrick >>>> >>>> This is probably a sourceforge quirk - you might need to ssh once, and >>>> >>>>then you should be able to write to the repository. Bytway, I have already >>>> >>>>committed the fix for HTMLFormTag.. >>>> >>>> >>>>Regards, >>>> >>>>Somik >>>> >>>>----- Original Message ----- >>>> >>>>From: "Derrick Oswald" >>>><mailto:Der...@ro...><Der...@ro...> >>>> >>>>To: <mailto:so...@ya...><so...@ya...> >>>> >>>>Sent: Saturday, December 07, 2002 2:06 PM >>>> >>>>Subject: cvs access >>>> >>>> >>>> >>>> >>>>> >>>>>Somik, >>>>> >>>>> >>>>>I tried dropping a small change and I get: >>>>> >>>>>cvs [server aborted]: "commit" requires write access to the repository >>>>> >>>>> >>>>>Does CVS write access lag behind being added to the htmlparser project >>>>> >>>>>as a developer or is there some step you've forgotten? >>>>> >>>>> >>>>>Derrick >>>>> >>>> >>>> >>>> >>>> >> >>--------------------------------------------- >>Kaarle Kaila >>http://www.iki.fi/kaila >>mailto:kaa...@ik... >>tel: +358 50 3725844 >> >> >>------------------------------------------------------- >>This sf.net email is sponsored by:ThinkGeek >>Welcome to geek heaven. >>http://thinkgeek.com/sf >>_______________________________________________ >>Htmlparser-developer mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Derrick O. <Der...@ro...> - 2002-12-09 13:21:45
|
I can check out OK, do everything in fact, except commit. The messages I get indicate everuthing is working as it should, I need to login to access the repository as expected, but I suspect there is some cron job at sourceforge that needs to run to add me to the 'write access' list: $echo $CVS_RSH /usr/bin/ssh -2 $ cvs -t -d :pserver:der...@cv...:/cvsroot/htmlparser commit -F /tmp/tempCommit62337output src/org/htmlparser/util/HTMLLinkProcessor.java cvs commit: notice: main loop with CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser cvs commit: authorization failed: server cvs.kacoma.sourceforge.net rejected access to /cvsroot/htmlparser for user derrickoswald cvs commit: used empty password; try "cvs login" with a real password $ cvs -t -d :pserver:der...@cv...:/cvsroot/htmlparser login cvs login: notice: main loop with CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser (Logging in to der...@cv...) CVS password: $ cvs -t -d :pserver:der...@cv...:/cvsroot/htmlparser commit -F /tmp/tempCommit62337output src/org/htmlparser/util/HTMLLinkProcessor.java cvs commit: notice: main loop with CVSROOT=:pserver:der...@cv...:/cvsroot/htmlparser -> Sending file `HTMLLinkProcessor.java' to server cvs [server aborted]: "commit" requires write access to the repository $ Kaarle Kaila wrote: > At 09:53 8.12.2002 -0800, Somik Raha wrote: > >> Oh yes, Claude has, and so has Kaarle. The CVSROOT/users file is only >> useful for adding watchers. I could do that, but as far as I >> understand, once you do ssh, you should be able to access the >> repository and check in code. >> >> Claude, Kaarle--> Can you shed more light on this ? > > > hi, > > I have installed SSH software from www.f-secure.com > > I think it was something like F-Secure SSH 5.2 for > Win95/98/ME/NT4.0/2000/XP Client > > It is a nice grapfical SSH client both for terminal use and filetransfer > and it also contains commandline ssh2 software that CVS needs. > > To access CVS I first set it up with these commands > > set CVS_RSH=ssh2 > set CVSROOT=use...@cv...:/cvsroot/htmlparser > > username = your sourceforge username > > In an empty directory I then can give CVS commands such as > > cvs chekcout htmlparser > > It asks for your password to sourceforge > > This retrieves the latest fileversions. > Check the CVS commands in some handbook you can find on the internet. > The manual I found is called Version Management with CVS by Per > Cederqvist et al. > perhaps from http://www.cvshome.org > > Kaarle > >> Regards, >> Somik >> >> ----- Original Message ----- >> From: <mailto:Der...@ro...>Derrick Oswald >> To: <mailto:so...@ya...>Somik Raha >> Cc: <mailto:so...@us...>so...@us... >> Sent: Sunday, December 08, 2002 7:13 AM >> Subject: Re: cvs access >> >> Somik, >> >> I can successfully ssh, which set up a ~/.ssh/known_hosts2 file on my >> system, but I think you need to add me to the >> htmlparser/CVSROOT/users file or something. >> >> As near as I can tell, you are the only one who has ever dropped >> something into the repository. >> How do all the other developers affect the code base? >> Did they ever successfully commit using cvs? >> >> Derrick >> >> Somik Raha wrote: >> >>> >>> Hi Derrick >>> >>> This is probably a sourceforge quirk - you might need to ssh >>> once, and >>> >>> then you should be able to write to the repository. Bytway, I have >>> already >>> >>> committed the fix for HTMLFormTag.. >>> >>> >>> Regards, >>> >>> Somik >>> >>> ----- Original Message ----- >>> >>> From: "Derrick Oswald" >>> <mailto:Der...@ro...><Der...@ro...> >>> >>> To: <mailto:so...@ya...><so...@ya...> >>> >>> Sent: Saturday, December 07, 2002 2:06 PM >>> >>> Subject: cvs access >>> >>> >>> >>> >>>> >>>> Somik, >>>> >>>> >>>> I tried dropping a small change and I get: >>>> >>>> cvs [server aborted]: "commit" requires write access to the repository >>>> >>>> >>>> Does CVS write access lag behind being added to the htmlparser project >>>> >>>> as a developer or is there some step you've forgotten? >>>> >>>> >>>> Derrick >>>> >>>> >>> >>> >>> >>> >>> > > --------------------------------------------- > Kaarle Kaila > http://www.iki.fi/kaila > mailto:kaa...@ik... > tel: +358 50 3725844 > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |
From: Somik R. <so...@ya...> - 2002-12-09 01:38:31
|
Hi Leslie, I prefer the second, non-compat, approach on architectural grounds. If = the creator of the reader knows the length of data, which is the most = common case, then it (the creator) can do the mark and reset where = needed with absolute certainty. On the other hand, if the creator of = the reader does not know the data length, then it is in every bit as = good a position to suggest a length as htmlparser is, and nothing can be = gained by delegating to htmlparser. Just to add to the picture - the reset is done when a call is made to = the elements() method, we wish to position the parser back to the = beginning of the stream. Now, it just might be that this is not possible = - in which case we'd throw an exception. For the user to handle an = exception and create a new parser object/move the mark himself in the = catch code is an unncessary complication - dont you think ?=20 The whole idea of putting it there was to make it simpler to parse thru = a given html page again and again using the same parser object.=20 But if that is leading to other complications, it might just be better = to take it out and expect that the parser object will need to be created = every time. Of course if we can handle all of it in the parser, then I'd = think its worth it, but a middle approach might just benefit neither = side. What are your thoughts ? Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-09 01:28:26
|
Hi Folks, This week's release is Candidate 5. We've had talented developers joining us over the weekend, hence, you can expect improvements in quality in the coming weeks. Hopefully, we should have our production release ready by New Year's... From the change log : Integration Build 1.2 - 20021208 --------------------------------- [1] Fixed bug in base href scanner - would always expect href [2] Refactored HTMLFormScanner [3] Refactored HTMLRenderer to use the Visitor pattern- enabling connections with links and images [4] HTMLStringNode returns a blank string in toPlainTextString() [5] HTMLFormTag returns string information in toPlainTextString() #5 is an important fix as now, we wont lose any meaningful string info contained inside forms when we issue calls like node.toPlainTextString(). Get the latest release from http://htmlparser.sourceforge.net The site update is continuing at an even pace. There is a new section on writing tests for HTMLParser. We're also trying to introduce a philosophy called "Communicate with TestCases". If you've found a bug, write a testcase for it, and submit that in your report. Of course, you dont have to do this, but if you do, we'd be able to make the fix much faster (and motivated to make the fix). Writing a testcase for the parser is super simple - you can check the philosophy and an example on the documentation page. http://htmlparser.sourceforge.net/design/index.html Regards, Somik |
From: Kaarle K. <kaa...@ik...> - 2002-12-08 19:14:15
|
At 09:53 8.12.2002 -0800, Somik Raha wrote: >Oh yes, Claude has, and so has Kaarle. The CVSROOT/users file is only >useful for adding watchers. I could do that, but as far as I understand, >once you do ssh, you should be able to access the repository and check in code. > >Claude, Kaarle--> Can you shed more light on this ? hi, I have installed SSH software from www.f-secure.com I think it was something like F-Secure SSH 5.2 for Win95/98/ME/NT4.0/2000/XP Client It is a nice grapfical SSH client both for terminal use and filetransfer and it also contains commandline ssh2 software that CVS needs. To access CVS I first set it up with these commands set CVS_RSH=ssh2 set CVSROOT=use...@cv...:/cvsroot/htmlparser username = your sourceforge username In an empty directory I then can give CVS commands such as cvs chekcout htmlparser It asks for your password to sourceforge This retrieves the latest fileversions. Check the CVS commands in some handbook you can find on the internet. The manual I found is called Version Management with CVS by Per Cederqvist et al. perhaps from http://www.cvshome.org Kaarle >Regards, >Somik > >----- Original Message ----- >From: <mailto:Der...@ro...>Derrick Oswald >To: <mailto:so...@ya...>Somik Raha >Cc: <mailto:so...@us...>so...@us... >Sent: Sunday, December 08, 2002 7:13 AM >Subject: Re: cvs access > >Somik, > >I can successfully ssh, which set up a ~/.ssh/known_hosts2 file on my >system, but I think you need to add me to the htmlparser/CVSROOT/users >file or something. > >As near as I can tell, you are the only one who has ever dropped something >into the repository. >How do all the other developers affect the code base? >Did they ever successfully commit using cvs? > >Derrick > >Somik Raha wrote: >> >>Hi Derrick >> >> This is probably a sourceforge quirk - you might need to ssh once, and >> >>then you should be able to write to the repository. Bytway, I have already >> >>committed the fix for HTMLFormTag.. >> >> >>Regards, >> >>Somik >> >>----- Original Message ----- >> >>From: "Derrick Oswald" >><mailto:Der...@ro...><Der...@ro...> >> >>To: <mailto:so...@ya...><so...@ya...> >> >>Sent: Saturday, December 07, 2002 2:06 PM >> >>Subject: cvs access >> >> >> >> >>> >>>Somik, >>> >>> >>>I tried dropping a small change and I get: >>> >>>cvs [server aborted]: "commit" requires write access to the repository >>> >>> >>>Does CVS write access lag behind being added to the htmlparser project >>> >>>as a developer or is there some step you've forgotten? >>> >>> >>>Derrick >>> >>> >> >> >> >> >> --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-12-08 17:52:36
|
Oh yes, Claude has, and so has Kaarle. The CVSROOT/users file is only = useful for adding watchers. I could do that, but as far as I understand, = once you do ssh, you should be able to access the repository and check = in code. Claude, Kaarle--> Can you shed more light on this ? Regards, Somik ----- Original Message -----=20 From: Derrick Oswald=20 To: Somik Raha=20 Cc: so...@us...=20 Sent: Sunday, December 08, 2002 7:13 AM Subject: Re: cvs access Somik, I can successfully ssh, which set up a ~/.ssh/known_hosts2 file on my = system, but I think you need to add me to the htmlparser/CVSROOT/users = file or something. As near as I can tell, you are the only one who has ever dropped = something into the repository. How do all the other developers affect the code base? Did they ever successfully commit using cvs? Derrick Somik Raha wrote: Hi Derrick This is probably a sourceforge quirk - you might need to ssh once, = and then you should be able to write to the repository. Bytway, I have = already committed the fix for HTMLFormTag.. Regards, Somik ----- Original Message ----- From: "Derrick Oswald" <Der...@ro...> To: <so...@ya...> Sent: Saturday, December 07, 2002 2:06 PM Subject: cvs access Somik, I tried dropping a small change and I get: cvs [server aborted]: "commit" requires write access to the repository Does CVS write access lag behind being added to the htmlparser project as a developer or is there some step you've forgotten? Derrick =20 =20 |
From: Leslie R. <le...@op...> - 2002-12-07 22:40:47
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title></title> </head> <body> <br> <br> Somik Raha wrote:<br> <blockquote type="cite" cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra"> <meta http-equiv="Content-Type" content="text/html; "> <meta content="MSHTML 6.00.2800.1106" name="GENERATOR"> <style></style> <div><font face="Arial" size="2">Hi Folks, </font></div> <div><font face="Arial" size="2"> We've come up with an interesting problem - there was a request by Steve Harrington recently that we support multiple-sequential parsing, i.e. use the same parser object multiple times to parse instead of creating a new one each time.</font></div> <div> </div> <div><font face="Arial" size="2"> Unfortunately this has caused us to play around with the reader and try to mark and reset streams. This is not such a good idea as for large streams there is no guarantee that a reset will work. Leslie suggests that we note this in the javadoc, and roll back this feature.</font></div> </blockquote> My initial notion was to do that, which works for any size stream but does break backward compat.<br> I have also tested the length() method approach on pages as large as 60K bytes and all is well.<br> To decide this issue for myself, i went to the code in both htmlparser and the Sun sources. Here is what I found.<br> <br> The most typical use of the feature at hand would be the use of a StringReader wrapped by the HTMLReader, like:<br> String s = "<html>.....</html>";<br> StringReader sr = new StringReader(s);<br> HTMLReader hr = new HTMLReader(sr, s.length());<br> <br> Looking at the mark() and reset() implementations in StringReader we find that they do nothing at all.<br> Not surprising since the StringReader depends on the String (reference held in a member) for storage and there really is no "buffering" per se, since the String itself is obviously entirely in memory. The mark() and reset() are really just to keep the protocol consistant with the super-class chain.<br> <br> Looking at HTMLReader, it too does no buffering and likewise imposes the mark() nad reset() limit only because of the super-class protocol. In neither class does the mark actually create any buffering or impose any overhead -- we could just as easily hardcode the mark to MaxInt with no space or time penalty at all.<br> <br> However, if HTMLReader were used to wrap another sort of BufferedReader, conditions could be different as BufferedReader will keep upto "readAheadLimit" charactersbuffered in a char[]. This is pretty good, but not good enough to use MaxInt! ;-) so just changing 5000 to MaxInt is not what we want.<br> <br> But the real problem is not time and space performance, it's error recovery. What we really need to avoid is throwing away a finished parse when the reset() throws an exception, which is precisely what happens in the current release version for all strings longer than 5000 characters.<br> <br> I have implemented both fixes [1. a length() method in reader+ use in parser and 2. just pull the mark/reset out to the caller] and they each function as expected, with the caveat that the second method is not backward compat with respect to reusing the parser on a user supplied reader.<br> <br> I prefer the second, non-compat, approach on architectural grounds. If the creator of the reader knows the length of data, which is the most common case, then it (the creator) can do the mark and reset where needed with absolute certainty. On the other hand, if the creator of the reader does not know the data length, then it is in every bit as good a position to suggest a length as htmlparser is, and nothing can be gained by delegating to htmlparser.<br> <br> <br> <br> <blockquote type="cite" cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra"> <div><font face="Arial" size="2"> </font></div> <div> </div> <div><font face="Arial" size="2"> Our complete bug report and discussion is at <a href="https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399">https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399</a></font></div> <div> </div> <div><font face="Arial" size="2"> The bug id is #649133. A discussion of this bug is in order, and it would be good if developers can participate with their views. </font></div> <div><font face="Arial" size="2"> Steve --> It will be good to hear your views on this.</font></div> <div> </div> <div><font face="Arial" size="2">Regards,</font></div> <div><font face="Arial" size="2">Somik</font></div> </blockquote> <br> <pre class="moz-signature" cols="$mailwrapcol">-- Leslie Rohde <a class="moz-txt-link-freetext" href="mailto:le...@op...">mailto:le...@op...</a> <a class="moz-txt-link-freetext" href="http://www.optitext.com">http://www.optitext.com</a> </pre> <br> </body> </html> |
From: Derrick O. <Der...@ro...> - 2002-12-07 20:10:24
|
I'm using htmlparser on the second version of a personal project, an email answering system. see http://members.rogers.com/robotz/ Somik Raha wrote: > Hi Folks, > Please welcome Derrick Oswald to the dev group. > Derrick is a java developer with a company called Autodesk (the > AutoCAD people). In his words, > "I've been programming for the last 17 years. I've been programming in > java since version 1.0 first came out. > I'm a Sun Certified Developer." > > Its great that we are getting highly experienced people on this > project. Derrick -> Please feel free to ask any questions and suggest > any improvements on this list. > > Just a reminder - as developers - you can access the CVS repository > and directly check in your modifications. But please do so after > you've written testcases and ensured that existing tests dont break. > > Looking forward to exciting times. > > Regards, > Somik > > (PS: I am a little curious, whats Autodesk doing with html parsing ? ) |
From: Somik R. <so...@ya...> - 2002-12-07 19:52:14
|
Hi Folks, Please welcome Derrick Oswald to the dev group. Derrick is a java developer with a company called Autodesk (the = AutoCAD people). In his words, "I've been programming for the last 17 years. I've been programming in = java since version 1.0 first came out. I'm a Sun Certified Developer." Its great that we are getting highly experienced people on this project. = Derrick -> Please feel free to ask any questions and suggest any = improvements on this list.=20 Just a reminder - as developers - you can access the CVS repository and = directly check in your modifications. But please do so after you've = written testcases and ensured that existing tests dont break. Looking forward to exciting times. Regards, Somik (PS: I am a little curious, whats Autodesk doing with html parsing ? ) |
From: Somik R. <so...@ya...> - 2002-12-07 07:20:42
|
Hi Folks, =20 We've come up with an interesting problem - there was a request by = Steve Harrington recently that we support multiple-sequential parsing, = i.e. use the same parser object multiple times to parse instead of = creating a new one each time. Unfortunately this has caused us to play around with the reader and = try to mark and reset streams. This is not such a good idea as for large = streams there is no guarantee that a reset will work. Leslie suggests = that we note this in the javadoc, and roll back this feature.=20 Our complete bug report and discussion is at = https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D649133&grou= p_id=3D24399&atid=3D381399 The bug id is #649133. A discussion of this bug is in order, and it = would be good if developers can participate with their views.=20 Steve --> It will be good to hear your views on this. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-07 07:03:39
|
Hi Folks, We've got a new developer in our midst - Leslie Rohde. Leslie is a = highly experienced programmer, programming since 1974 (that was before I = was born).=20 Here's his intro : "I have been a programmer since 1974. I have been a tech biz owner, much of it indep contract dev, since 1988. I currently build tools for the search engine optimization (SEO) community. these tools are cross platform, pure java. See http://www.optitext.com for the my current lead product (OptiLink). =20 I am currently using the built-in html parsing support in J2SE but there are so many problems and they are so difficult to fix -- owing to the monolithic/intertwined nature of that package -- that i have elected to build my next release using an alternate parser. having = looked at several options, your project wins because it is lightweight (fast is not much of an issue in my application domain) and open source. Ultimately, i will want to use htmlparser to compile a DOM tree but my current needs can be met with just link spidering and text scanning. By embedding htmlparser in my toolset, you can be sure that it will=20 (come to) work on a really wide range of real html. I'm sure to be a source of = lots of email traffic at first as i figure out the ins-and-outs of your = project, but since this _is_ my "day job" I have time and motivation to get = things fixed when they don't work." On behalf of the rest of us - welcome!! The most exciting days on this = project are when we get a dynamic new developer, for thats when we take = on interesting, new directions. Feel free to ask as many questions as = you like on this list. For intros about the rest-of-us, check = http://htmlparser.sourceforge.net/design/contributors.html=20 Cheers, Somik |
From: Somik R. <so...@ya...> - 2002-12-02 02:56:54
|
Hi Folks, Candidate Release 4 is out. This actually contains a few minor API = changes which wont affect your application, but have been done to = improve the OO design of the system. HTMLFormScanner has been improved. = The major work in this release went in refactoring 201 testcases - so as = to make it more readable, and follow the Once-And-Only-Once paradigm. = Well, the package size dropped about 12KB (after zipping), so you can = estimate how much refactoring was done.. All tests are passing. From the Change Log,=20 Integration Build 1.2 - 20021201 -------------------------------- [1] Refactored HTMLNode, API improved, now HTMLNode stores nodeBegin and nodeEnd. [2] Refactored Testing framework - to reduce the code size = substantially. [3] HTMLFormScanner improved to include Input,TextArea, Select and = Option scanners within You can get it from http://htmlparser.sourceforge.net There's an all-new Contributors Page (linked from the main site). Just = in case I missed anybody, or you have info to add, pls let me know. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-11-26 06:39:53
|
Hi Folks, Candidate 3 is out. You can get it from http://htmlparser.sourceforge.net The website is getting an overhaul, though this is in progress. You = will find a new samples page. If anyone wishes to contribute a simple program to add to the = catalog, please feel free to come forward. From the change log, in this release : Integration Build 1.2 - 20021125 -------------------------------- [1] Incorporated Bug Fix for HTMLLinkProcessor to parse dynamic urls [2] Refactored package names to org.htmlparser [3] Added documentation [4] Can handle url with spaces in it [5] Fixed bug 643352 - going into infinite loop on bad img within link [6] Refactored HTMLLinkTag - unnecessary boolean variables removed Developers --> can you send me a brief bio, with a pic - I'd like to = acknowledge everyone who has contributed to this project.=20 Regards, Somik |
From: <dha...@or...> - 2002-11-26 04:23:28
|
Hi Somik, > Oh, in that case, we should not be registering the input, select, textareas directly. Instead, we could do it within form > tags, like its done in the frame scanner (enumerating thru frameset elements from within). Yeah that would make a lot of sense since obviously these tags can only be within the form tag. ------------------------------------------------------- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power & Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2002-11-26 04:07:33
|
Hi Dhaval, > > No problem. What about the Form scanner issue ? > > The Form Scanner is registered by default now while it was not the case > earlier. Hence if I try to scan tags like INPUT, SELECT, TEXTAREA which > have to be within FORM tags then they are not picked up. After I remove > the form scanner from the registered list then it works fine. Oh, in that case, we should not be registering the input, select, textareas directly. Instead, we could do it within form tags, like its done in the frame scanner (enumerating thru frameset elements from within). What do u think ? Regards, Somik |
From: <dha...@or...> - 2002-11-26 04:03:03
|
Hi Somik, > No problem. What about the Form scanner issue ? The Form Scanner is registered by default now while it was not the case earlier. Hence if I try to scan tags like INPUT, SELECT, TEXTAREA which have to be within FORM tags then they are not picked up. After I remove the form scanner from the registered list then it works fine. > Hmm.. this is actually a reconstruction issue. Lets look into this together. > Can you also peep into the code (HTMLScriptScanner.scan())? Yeah. I'll do that. Bye, Dhaval ------------------------------------------------------- This sf.net email is sponsored by: See the NEW Palm Tungsten T handheld. Power & Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Amit R. <ami...@ya...> - 2002-11-22 03:30:32
|
> > outHTML.append(((HTMLStringNode)node).getText() > > +"\n"); instead use following String string = new String(((HTMLStringNode)node).getText().getBytes("8859_4"),"newEncoding"); for reading japanese pages i use newEncoding = Shift_JIS for chinese it can be (i have never tested though) newEncoding = BIG5 or Big5_HKSCS or GB2312 Somik, Thats why i suggested sometime back to have a function/method to tell programmer what encoding htmlparser is using internally. > We've got an expert on internationalization on this > list.. > Amit -- do you have time to check this ? i hope first line was not for me :-) __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |