htmlparser-developer Mailing List for HTML Parser (Page 17)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Somik R. <so...@ya...> - 2003-03-30 23:25:06
|
> It's most likely that users=20 > don't know the character set though, either from the HTTP or HTML=20 > header, so automatic handling by default is best. =20 I agree. > I noted that four tests are failing after last weeks integration: > testStringBeanListener This one has been failing for a while.=20 > I'm not sure about the others, but the bean listener test is failing=20 > because the handling of tables has changed. The test is to show that=20 > extracted text contains the link URLs when the links property is set = to=20 > true. It would now have to dig into the table tags to find the link.=20 > I'm looking at collectInto() but can't see how to collect string and=20 > link tags so the links can be inserted in context into the text (where = > they are found). I'm also wrestling with the issue of handling=20 > <pre></pre>, since collectInto() doesn't seem to be able to give that=20 > kind of information. I guess collectInto() is too blunt a tool. If you're trying to collect strings AND links and keep them in context, = your best bet is to write your own visitor. > testThreadSafety Thanks for reporting this - on my end this one's passing. I had left one = last variable in TagParser- and I thought it would affect Thread safety. = So I rigged up that test, but surprisingly it passed every time on my = end. Can you send me the failure message ? I might need to rework = TagParser again. > testScriptCodeExtraction > testScriptCodeExtractionWithMultipleQuotes You can ignore these two - they actually demonstrate a bug which I have = no clue about, and I think there's little we can do about it. From my = earlier integration release mail (last week), Thanks are also due to Huang-Chun Yu for uncovering a serious bug with = the script scanning mechanism. The parser can currently handle script tags = like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular = tags. Such pages are quite widespread and ought to be supported. I was curious = if anyone has ideas on solving this - given the existing design - fresh = ideas often lead to a better perspective.=20 Regards, Somik |
From: Derrick O. <Der...@ro...> - 2003-03-30 21:19:53
|
Somik, The capability to setEncoding() is already there, so API users who know the character set can set it before parsing (caveat: there is no test case for this, so it may not work). This can happen before or after the connection is opened, but in the latter case will cause an input stream reset. In the former case the setting will be overwritten by the incoming HTTP and HTML header values if they are there and differ from what's set. One possible enhancement would be to not allow the headers to override the character set if it's been set via the API, which assumes the user knows what they are doing. It's most likely that users don't know the character set though, either from the HTTP or HTML header, so automatic handling by default is best. I noted that four tests are failing after last weeks integration: testScriptCodeExtraction testScriptCodeExtractionWithMultipleQuotes testStringBeanListener testThreadSafety I'm not sure about the others, but the bean listener test is failing because the handling of tables has changed. The test is to show that extracted text contains the link URLs when the links property is set to true. It would now have to dig into the table tags to find the link. I'm looking at collectInto() but can't see how to collect string and link tags so the links can be inserted in context into the text (where they are found). I'm also wrestling with the issue of handling <pre></pre>, since collectInto() doesn't seem to be able to give that kind of information. I guess collectInto() is too blunt a tool. Derrick Somik Raha wrote: > Hi Derrick, > Continuing our earlier discussion, I've had an idea- instead of > re-establishing an input stream, suppose we assume that the parser can > be initialized with a character set - and we use that.. > We could have both strategies in there. > > Bytway, quite a few steps are failing - I'm guessing that you're > actively working on those - let me know if there any issues if I make > an integration release this week (in case you don't finish). > > Regards, > Somik |
From: Somik R. <so...@ya...> - 2003-03-30 06:25:29
|
Hi Derrick, Continuing our earlier discussion, I've had an idea- instead of = re-establishing an input stream, suppose we assume that the parser can = be initialized with a character set - and we use that.. We could have both strategies in there. Bytway, quite a few steps are failing - I'm guessing that you're = actively working on those - let me know if there any issues if I make an = integration release this week (in case you don't finish). Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-27 06:30:52
|
Hi Marc, I will look into it soon - I am in a conference right now, but should be on it this weekend. Meanwhile you are free to analyze it, and tell me anything that you find. Regards, Somik ----- Original Message ----- From: "Marc Novakowski" <ma...@ke...> To: <htm...@li...> Sent: Monday, March 24, 2003 5:42 PM Subject: [Htmlparser-developer] RE: [Htmlparser-user] Integration Release 1.3-20030323 is out By the way, I've entered the OOM exception as a bug (#709152), along with a simple program that reproduces it. Marc -----Original Message----- From: Marc Novakowski Sent: Monday, March 24, 2003 3:23 PM To: htm...@li... Subject: RE: [Htmlparser-user] Integration Release 1.3-20030323 is out Somik, Thanks for fixing 702614! Unfortunately I can't seem to get the latest build to work. It's throwing an OOM exception in my own code when using the NodeIterator returned by parser.elements(). I'm looking into this to make sure I'm not doing something stupid in my code. However, the library seems to be acting differently than previous releases even out-of-the-box. For example, the following used to return a list of the links on Yahoo (in the 0302 release): java -jar ./htmlparser.jar http://www.yahoo.com -l In the 0323 release, however, it returns nothing. Marc -----Original Message----- From: Somik Raha [mailto:so...@ya...] Sent: Sunday, March 23, 2003 5:24 PM To: HTMLParser Announcement List; HTMLParser User List; HTMLParser Developer List Subject: [Htmlparser-user] Integration Release 1.3-20030323 is out Hi Folks, This week's integration release has two important fixes : Integration build 1.3 - 20030323 -------------------------------- [1] Fixed bug 702547 - single quotes parsed more robustly now [2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a method isEmptyXmlTag(). #2 refers to tags like <tag/>. Thanks to Joe Robbins for a fine bug report that helped in putting in the fix for #1 faster. Thanks also to Marc Novakowski for the other report. Thanks are also due to Huang-Chun Yu for uncovering a serious bug with the script scanning mechanism. The parser can currently handle script tags like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular tags. Such pages are quite widespread and ought to be supported. I was curious if anyone has ideas on solving this - given the existing design - fresh ideas often lead to a better perspective. If you have some ideas, feel free to join the developer list (http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and post. Regards, Somik ------------------------------------------------------- This SF.net email is sponsored by:Crypto Challenge is now open! Get cracking and register here for some mind boggling fun and the chance of winning an Apple iPod: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- This SF.net email is sponsored by: The Definitive IT and Networking Event. Be There! NetWorld+Interop Las Vegas 2003 -- Register today! http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Marc N. <ma...@ke...> - 2003-03-25 01:43:15
|
By the way, I've entered the OOM exception as a bug (#709152), along = with a simple program that reproduces it. Marc -----Original Message----- From: Marc Novakowski=20 Sent: Monday, March 24, 2003 3:23 PM To: htm...@li... Subject: RE: [Htmlparser-user] Integration Release 1.3-20030323 is out Somik, Thanks for fixing 702614! Unfortunately I can't seem to get the latest = build to work. It's throwing an OOM exception in my own code when using = the NodeIterator returned by parser.elements(). I'm looking into this = to make sure I'm not doing something stupid in my code. However, the = library seems to be acting differently than previous releases even = out-of-the-box. For example, the following used to return a list of the = links on Yahoo (in the 0302 release): java -jar ./htmlparser.jar http://www.yahoo.com -l In the 0323 release, however, it returns nothing. Marc -----Original Message----- From: Somik Raha [mailto:so...@ya...] Sent: Sunday, March 23, 2003 5:24 PM To: HTMLParser Announcement List; HTMLParser User List; HTMLParser Developer List Subject: [Htmlparser-user] Integration Release 1.3-20030323 is out Hi Folks, This week's integration release has two important fixes : Integration build 1.3 - 20030323 -------------------------------- [1] Fixed bug 702547 - single quotes parsed more robustly now [2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a method isEmptyXmlTag(). #2 refers to tags like <tag/>. Thanks to Joe Robbins for a fine bug report that helped in putting in = the fix for #1 faster. Thanks also to Marc Novakowski for the other report. Thanks are also due to Huang-Chun Yu for uncovering a serious bug with = the script scanning mechanism. The parser can currently handle script tags = like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular = tags. Such pages are quite widespread and ought to be supported. I was curious = if anyone has ideas on solving this - given the existing design - fresh = ideas often lead to a better perspective. If you have some ideas, feel free to join the developer list (http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and = post. Regards, Somik ------------------------------------------------------- This SF.net email is sponsored by:Crypto Challenge is now open!=20 Get cracking and register here for some mind boggling fun and=20 the chance of winning an Apple iPod: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2003-03-24 01:22:12
|
Hi Folks, This week's integration release has two important fixes : Integration build 1.3 - 20030323 -------------------------------- [1] Fixed bug 702547 - single quotes parsed more robustly now [2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a method isEmptyXmlTag(). #2 refers to tags like <tag/>. Thanks to Joe Robbins for a fine bug report that helped in putting in the fix for #1 faster. Thanks also to Marc Novakowski for the other report. Thanks are also due to Huang-Chun Yu for uncovering a serious bug with the script scanning mechanism. The parser can currently handle script tags like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular tags. Such pages are quite widespread and ought to be supported. I was curious if anyone has ideas on solving this - given the existing design - fresh ideas often lead to a better perspective. If you have some ideas, feel free to join the developer list (http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and post. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-16 21:50:44
|
Hi Folks, Thanks are due to Derrick Oswald, Dhaval Udani for their support on = the mailing lists, and to Josh Kerievsky for having shown the = refactoring direction which is making a world of difference to the = project. =20 Derrick -> It will be really nice to have some docs about your = contribution - could you add a section to the Wiki ? Also, one test seems to be failing in testStringBeanListener(). I = couldnt figure it out, so I was wondering if you could look into it ? Dave Knipp & others -> I have checked in a module called = WikiCapturer. This project uses the parser and converts a standard wiki = to static html. If you're interested, you could take over this module = and make it a product in its own right - to handle php and modwiki (or = any other). It would be a useful thing to have -perhaps with a GUI. =20 James Crowley -> Thanks for the offer of the J# version and C# = version. We can make a release of the former as soon as you are ready. = You could take over the J# section of the htmlparser project.=20 Regards, Somik=20 (PS: James, Dave, I am not sure if you folks are on the developer = mailing list, let me know if you are, and I wont cc you explicitly) |
From: Somik R. <so...@ya...> - 2003-03-16 21:36:46
|
Hi Folks, This is a major milestone release. A massive refactoring has been completed (took two weeks) - which has brought all the robust error handling cases into CompositeTagScanner. This means, all tags that have children will be able to do error correction uniformly. Form tag (and table tags too) should be robust. Table tags are not yet in the standard set of scanners (you still need to add them manually). They should make the cut next week. We have a new method - registerDomScanners() in Parser - that allows you to build html dom objects. Interesting fact, as a result of the refactorings, the LOC of the scanners package has reduced from 1553 to 1355 (I was surprised at the digits). Documentation has been updated - we've started putting up answers by our list members to common questions. Pls feel free to update the Wiki and improve it. No login is required. From the change log: Integration build 1.3 - 20030316 -------------------------------- [1] Added method finishedParsing() to NodeVisitor [2] LinkScanner uses CompositeTagScanner.scan() [3] BulletScanner added [4] FormScanner uses CompositeTagScanner.scan() [5] AppletScanner uses CompositeTagScanner.scan() We highly recommend an upgrade to this version. Regards, Somik |
From: Mr L. MA <law...@ya...> - 2003-03-09 23:08:54
|
If you have a ftp site, I can upload exception pages to it daily. Ling Ma --- Somik Raha <so...@ya...> wrote: > > > > One problem I had with FormTag.toString() method > is > > that form tag should be treated as body tag since > any > > other tags could be nested in it. > > > > The ultimate htmlparser test would be webase > > collection from stanford. > > What you could really do to speed up our testing is > to provide us with urls > that cause breaks - and keep filing lots of bug > reports. That would be a > great help. > > > Is there a way even with readelements=null I can > still > > get the rest nodes? > > This usually means the parser has reached the end of > the page without > finding a matching end tag. It is usually a fatal > error. But this week, I am > planning to improve robustness - systemwide. It > would be good to have some > nice bug reports before I start, though. > > Regards, > Somik > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of > TotalView, The debugger > for complex code. Debugging C/C++ programs can leave > you feeling lost and > disoriented. TotalView can help you find your way. > Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Mr L. MA <law...@ya...> - 2003-03-09 23:07:32
|
Can someone look for while parsing this two HTML pages? The parser throws exceptions. Ling Ma __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-03-08 05:12:12
|
Hi Richard, > Could someone clarify the licensing situation / fulfilment requirements > of HTMLParser with regard to its inclusion as part of an otherwise > closed-source commercial app. Thanks for bringing up this question. The parser is licensed under LGPL. This means, applications that USE it dont have to be open-source. But, here are two restrictions that apply: [1] Any modifications made to the library itself must be kept open-source or made available. [2] Your app source code does not live with the parser source code, but the object code does. That means - people should either be able to reverse engineer your product so as to be able to remove the parser library and put a newer version in (gasp!) or - simply provide an external linkage to the parser - whereby folks can swap out the current version with a later version (the idea is to let them have the benefit of the open-source library). That reverse engineering stuff is actually a cryptic interpretation of the clause - applicable only if you want to provide a single executable in your application (it can be bypassed, but I dont want to further complicate the interpretation for you - let me know if this is the case and I can advise you accordingly). Bytway, if you are not distributing your application, and only using it internally, none of the above applies. Let me know if that answers your question. Regards, Somik ******************************************** Somik Raha Extreme Programmer and Coach Industrial Logic, Inc. so...@in... http://industriallogic.com Voice : 510-540-8336 Fax : 510-540-8936 ******************************************** Periodic reassessment means looking at things which are taken for granted, things which seem beyond doubt. Periodic reassessment means challenging all assumptions. It is not a matter of reassessing something because there is a need to reassess it; there may be no need at all. It is a matter of reassessing something simply because it is there and has not been assessed for a long time. It is a deliberate and quite unjustified attempt to look at things in a new way. --- Edward De Bono in Lateral Thinking, Chapter 5, The Use of Lateral Thinking |
From: Richard W. <ri...@ri...> - 2003-03-07 10:18:17
|
Hi, Could someone clarify the licensing situation / fulfilment requirements of HTMLParser with regard to its inclusion as part of an otherwise closed-source commercial app. Richard. |
From: Somik R. <so...@ya...> - 2003-03-07 03:11:28
|
> One problem I had with FormTag.toString() method is > that form tag should be treated as body tag since any > other tags could be nested in it. > > The ultimate htmlparser test would be webase > collection from stanford. What you could really do to speed up our testing is to provide us with urls that cause breaks - and keep filing lots of bug reports. That would be a great help. > Is there a way even with readelements=null I can still > get the rest nodes? This usually means the parser has reached the end of the page without finding a matching end tag. It is usually a fatal error. But this week, I am planning to improve robustness - systemwide. It would be good to have some nice bug reports before I start, though. Regards, Somik |
From: Mr L. MA <law...@ya...> - 2003-03-06 17:31:08
|
One problem I had with FormTag.toString() method is that form tag should be treated as body tag since any other tags could be nested in it. The ultimate htmlparser test would be webase collection from stanford. What I did is to download a website with a offline browser ( such as webstripper) Running StringExtractor on the local collection gives many ParserExceptions. Sometimes with JTidy I can get luck on some pages before apply HTMLParser, sometimes not. My focus is to use HTMLParser for text extraction, so I came into "dirty" pages that HTMLParser gives error. Is there a way even with readelements=null I can still get the rest nodes? Ling Ma --- Somik Raha <so...@ya...> wrote: > Thanks very much for the sample page. My to do list > for this week : > [1] Refactor correction logic in the link scanner to > the composite scanner, > so that it becomes available for all composite tags. > That will solve the > problem you mention. > > [2] Work on Dhaval's suggestion - I have some ideas > about switching off > testcases that require the internet. > > Regards, > Somik > ----- Original Message ----- > From: "Mr LING MA" <law...@ya...> > To: <htm...@li...> > Sent: Wednesday, March 05, 2003 10:34 PM > Subject: [Htmlparser-developer] Form tag should not > be composite tag? > > > > Hi all: > > Do you guys think form tag should not be composite > > tag? > > or else it cannot process page like: > > > > http://money.cnn.com/services/glossary/a.html > > > > which misses one form end tag. > > > > Ling Ma > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Tax Center - forms, calculators, tips, more > > http://taxes.yahoo.com/ > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Etnus, makers > of TotalView, The > debugger > > for complex code. Debugging C/C++ programs can > leave you feeling lost and > > disoriented. TotalView can help you find your way. > Available on major UNIX > > and Linux platforms. Try it free. www.etnus.com > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of > TotalView, The debugger > for complex code. Debugging C/C++ programs can leave > you feeling lost and > disoriented. TotalView can help you find your way. > Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-03-06 15:24:56
|
I got it sometime back and did fill up the form. Cant say if its authentic... Regards, Somik ----- Original Message ----- From: <dha...@or...> To: <htm...@li...> Sent: Thursday, March 06, 2003 4:34 AM Subject: [Htmlparser-developer] FW: Open Source Research > Has anyone else got a mail like this? Is it authentic? > > Regards, > > Dhaval Udani > Senior Analyst > M-Line, QPEG > OrbiTech Solutions Ltd. > +91-22-28290019 Extn. 1457 > > > > -----Original Message----- > From: cantamessa [mailto:can...@us...] > Sent: Monday, February 17, 2003 10:42 PM > To: dhavaludani > Cc: cantamessa > Subject: Open Source Research > > > Dear Sourceforge developer, > > The Department of Manufacturing and Economics of the > Politecnico di Torino, Italy, is running a research project on > Open Source Software. Within the project we aim to identify > key success factors in the management of open source > projects. The project HTML Parser you are cooperating with > has been selected from www.sourceforge.net to be part of a > sample of 100 successful projects to be analyzed, and we > therefore kindly ask you to give us a few minutes of your time > in order to fill in the attached questionnaire > https://lepshare.aigest.it/quest/encuesta.asp > > We ensure that the data thus gathered will be kept with the > utmost confidentiality, will be analyzed with statistical > techniques and results will be presented only in aggregate > form. If you wish, we will be happy to send you a copy of the > report with the results of our project. > > For the sake of security we will ask you to fill in a secret > code, yours is 3343 > > For further information, feel free to contact us at our e-mail > address os...@le... . > Please, respond in the next few days. > I thank you for your help and remain > > Yours Sincerely > > > Prof. Ing. Marco Cantamessa > mar...@po... > Dipartimento di Sistemi di Produzione ed Economia > dell'Azienda > Politecnico di Torino > Corso Duca degli Abruzzi 24 - I 10129 Torino (Italy) > tel. +39-0115647223, fax +39-0115647299 > > > ---------------------------------------------------------------------------- ---- > Received: from myrtle1.citicorp.com (myrtle1-b.citicorp.com [192.193.249.35]) > by elaralan1.email.citicorp.com (8.8.6 (PHNE_17135)/8.8.6) with ESMTP id WAA29698 > for <dha...@or...>; Mon, 17 Feb 2003 22:34:52 +0530 (IST) > Received: from citicorp.com (localhost [127.0.0.1]) > by myrtle1.citicorp.com (8.12.5/8.12.5) with ESMTP id h1HH4mNN009389 > for <dha...@or...>; Mon, 17 Feb 2003 12:04:49 -0500 (EST) > Received: from sc8-sf-list1.sourceforge.net (lists.sourceforge.net [66.35.250.206]) > by citicorp.com (8.9.3/8.9.3) with ESMTP id MAA01295 > for <dha...@or...>; Mon, 17 Feb 2003 12:03:43 -0500 (EST) > Received: from sc8-sf-sshgate.sourceforge.net ([66.35.250.220] helo=sc8-sf-netmisc.sourceforge.net) > by sc8-sf-list1.sourceforge.net with esmtp > (Cipher TLSv1:DES-CBC3-SHA:168) (Exim 3.31-VA-mm2 #1 (Debian)) > id 18koh9-0005ax-00 > for <dha...@us...>; Mon, 17 Feb 2003 09:04:43 -0800 > Received: from sc8-sf-web2-b.sourceforge.net ([10.3.1.22] helo=sc8-sf-web2.sourceforge.net) > by sc8-sf-netmisc.sourceforge.net with esmtp (Exim 3.36 #1 (Debian)) > id 18koh9-0003Oo-00 > for <dha...@us...>; Mon, 17 Feb 2003 09:04:43 -0800 > Received: from nobody by sc8-sf-web2.sourceforge.net with local (Exim 3.36 #1 (Debian)) > id 18kooX-00084q-00 > for <dha...@us...>; Mon, 17 Feb 2003 09:12:21 -0800 > To: dha...@us... > Subject: Open Source Research > From: Marco Cantamessa <can...@us...> > Message-Id: <E18...@sc...> > Date: Mon, 17 Feb 2003 09:12:21 -0800 > |
From: Somik R. <so...@ya...> - 2003-03-06 15:05:15
|
Thanks very much for the sample page. My to do list for this week : [1] Refactor correction logic in the link scanner to the composite scanner, so that it becomes available for all composite tags. That will solve the problem you mention. [2] Work on Dhaval's suggestion - I have some ideas about switching off testcases that require the internet. Regards, Somik ----- Original Message ----- From: "Mr LING MA" <law...@ya...> To: <htm...@li...> Sent: Wednesday, March 05, 2003 10:34 PM Subject: [Htmlparser-developer] Form tag should not be composite tag? > Hi all: > Do you guys think form tag should not be composite > tag? > or else it cannot process page like: > > http://money.cnn.com/services/glossary/a.html > > which misses one form end tag. > > Ling Ma > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger > for complex code. Debugging C/C++ programs can leave you feeling lost and > disoriented. TotalView can help you find your way. Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: <dha...@or...> - 2003-03-06 12:34:50
|
Received: from myrtle1.citicorp.com (myrtle1-b.citicorp.com [192.193.249.35]) by elaralan1.email.citicorp.com (8.8.6 (PHNE_17135)/8.8.6) with ESMTP id WAA29698 for <dha...@or...>; Mon, 17 Feb 2003 22:34:52 +0530 (IST) Received: from citicorp.com (localhost [127.0.0.1]) by myrtle1.citicorp.com (8.12.5/8.12.5) with ESMTP id h1HH4mNN009389 for <dha...@or...>; Mon, 17 Feb 2003 12:04:49 -0500 (EST) Received: from sc8-sf-list1.sourceforge.net (lists.sourceforge.net [66.35.250.206]) by citicorp.com (8.9.3/8.9.3) with ESMTP id MAA01295 for <dha...@or...>; Mon, 17 Feb 2003 12:03:43 -0500 (EST) Received: from sc8-sf-sshgate.sourceforge.net ([66.35.250.220] helo=sc8-sf-netmisc.sourceforge.net) by sc8-sf-list1.sourceforge.net with esmtp (Cipher TLSv1:DES-CBC3-SHA:168) (Exim 3.31-VA-mm2 #1 (Debian)) id 18koh9-0005ax-00 for <dha...@us...>; Mon, 17 Feb 2003 09:04:43 -0800 Received: from sc8-sf-web2-b.sourceforge.net ([10.3.1.22] helo=sc8-sf-web2.sourceforge.net) by sc8-sf-netmisc.sourceforge.net with esmtp (Exim 3.36 #1 (Debian)) id 18koh9-0003Oo-00 for <dha...@us...>; Mon, 17 Feb 2003 09:04:43 -0800 Received: from nobody by sc8-sf-web2.sourceforge.net with local (Exim 3.36 #1 (Debian)) id 18kooX-00084q-00 for <dha...@us...>; Mon, 17 Feb 2003 09:12:21 -0800 To: dha...@us... Subject: Open Source Research From: Marco Cantamessa <can...@us...> Message-Id: <E18...@sc...> Date: Mon, 17 Feb 2003 09:12:21 -0800 |
From: Mr L. MA <law...@ya...> - 2003-03-06 06:34:42
|
Hi all: Do you guys think form tag should not be composite tag? or else it cannot process page like: http://money.cnn.com/services/glossary/a.html which misses one form end tag. Ling Ma __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-03-04 14:45:13
|
Let me know the version, and I'll make it available for you. Regards, Somik ----- Original Message ----- From: <dha...@or...> To: <htm...@li...> Sent: Tuesday, March 04, 2003 3:36 AM Subject: [Htmlparser-developer] Previous integration releases > Hi, > > My product is using a very old version of HTMLParser. I am not allowed > to distribute its jar file hecne I ask people to come to the website and > downlaod the appropriate version which in my case is some particular > integration build of 1.2. However the problem is that I can't find it. > Can someone tell me how to locate a particular integration build > release? > > Regards, > > Dhaval Udani > Senior Analyst > M-Line, QPEG > OrbiTech Solutions Ltd. > +91-22-28290019 Extn. 1457 > > > |
From: <dha...@or...> - 2003-03-04 11:41:04
|
Hi, My product is using a very old version of HTMLParser. I am not allowed to distribute its jar file hecne I ask people to come to the website and downlaod the appropriate version which in my case is some particular integration build of 1.2. However the problem is that I can't find it. Can someone tell me how to locate a particular integration build release? Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 |
From: Somik R. <so...@ya...> - 2003-03-03 03:52:39
|
Hi Folks, In this week's release, the change log is : Integration build 1.3 - 20030302 -------------------------------- [1] Fixed bug in LinkScanner [2] Cleaned up StringNode interface [3] Cleaned up RemarkNode interface [4] Refactored Parser, created ParserHelper Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-03 00:09:27
|
Joe Lin wrote: > Anoter question regarding the collectInto(NodeList > collectionList, java.lang.String filter) method, I > could not seem to find the filter constants for > different Node type. Can anyone point me where these > are? After moving to the class parameters, this method has become redundant. We're planning to take it out. You're better off using the other techniques (the other collectInto or TagFindingVisitor). > BTW, I think HTMLParser is a great software. I have > being looking for Java html parser high and low. > HTMLParser represent a best architecture and user API > to me. I especially like that it is in a sense a > steaming parser. This means performance and optimal > memory usage for me. Thanks for the kind words. We've got a diverse and talented set of people who've been making contributions over a period of time. Kind words always help inspire us to serve the community better. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-02-24 18:12:00
|
I was trying to integrate the changes of the latest parser with some existing projects at work - and of course, I had to modify the code to use the new API. I had some suggestions - as I know many of you will be facing the same issue. I use Eclipse, and I hope most of you use a decent IDE that supports refactoring. Get the parser into your IDE, and let all your other project code refer to it (thats how it is setup in my IDE). Then, rename Parser to HTMLParser using your refactoring tool. Rename it back to Parser, and all your existing code will automatically get fixed. Do this for some other classes like HTMLNode/Node, etc.. and within minutes it should be done. Regards, Somik --- Somik Raha <so...@ya...> wrote: > Hi Folks, > This week's release is out. I've finally taken > heed of all the feedback > I had been receiving about the terrible naming > convention, and have removed > "HTML" from all class names. In addition, > HTMLEnumeration is now > NodeIterator and SimpleEnumeration is > SimpleNodeIterator. HTMLParser is just > Parser. > > This is a big step, so to make it easy for > everyone, there have been no > major bug fixes that will require you to upgrade > right away. I apologize in > advance for inconvenience caused - I hope you don't > curse me too much for > having to modify your programs. I had the option of > doing it in stages, and > forcing you to modify some small thing in every > release, or get it over with > in one sweep. I chose the latter bcos there were too > many changes and > suffering over a long period of time didn't make > sense. Hopefully, once you > have migrated to the new names, you will appreciate > not having to type > "HTML" each time. > > The BodyScanner contributed by Dhaval Udani is > finally in (Dhaval - > sorry for the delay). > The interesting part is that the documentation > accompanying the package > is now the latest one on the site - it has been > ripped off a Php Wiki. I am > thinking that the ripping program might be useful > for those who wish to > provide wiki content as offline documentation (any > feedback on this is > welcome). > > From the change log : > Integration build 1.3 - 20030223 > -------------------------------- > [1] Modification of documentation packaging > - the new documentation is actually produced > by a tiny program that coverts wiki pages > into documentation (works with PhpWiki) > [2] Inclusion of BodyScanner, BodyTag > [3] HTMLVisitor is now NodeVisitor - and has an > extra param to > visit itself > [4] HTMLParser is now Parser. No class has HTML > prefix anymore. > [5] HTMLEnumeration is now NodeIterator, > SimpleEnumeration is > SimpleNodeIterator > > Regards, > Somik > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SlickEdit Inc. > Develop an edge. > The most comprehensive and flexible code editor you > can use. > Code faster. C/C++, C#, Java, HTML, XML, many more. > FREE 30-Day Trial. > www.slickedit.com/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-02-24 06:15:43
|
Hi Folks, This week's release is out. I've finally taken heed of all the feedback I had been receiving about the terrible naming convention, and have removed "HTML" from all class names. In addition, HTMLEnumeration is now NodeIterator and SimpleEnumeration is SimpleNodeIterator. HTMLParser is just Parser. This is a big step, so to make it easy for everyone, there have been no major bug fixes that will require you to upgrade right away. I apologize in advance for inconvenience caused - I hope you don't curse me too much for having to modify your programs. I had the option of doing it in stages, and forcing you to modify some small thing in every release, or get it over with in one sweep. I chose the latter bcos there were too many changes and suffering over a long period of time didn't make sense. Hopefully, once you have migrated to the new names, you will appreciate not having to type "HTML" each time. The BodyScanner contributed by Dhaval Udani is finally in (Dhaval - sorry for the delay). The interesting part is that the documentation accompanying the package is now the latest one on the site - it has been ripped off a Php Wiki. I am thinking that the ripping program might be useful for those who wish to provide wiki content as offline documentation (any feedback on this is welcome). From the change log : Integration build 1.3 - 20030223 -------------------------------- [1] Modification of documentation packaging - the new documentation is actually produced by a tiny program that coverts wiki pages into documentation (works with PhpWiki) [2] Inclusion of BodyScanner, BodyTag [3] HTMLVisitor is now NodeVisitor - and has an extra param to visit itself [4] HTMLParser is now Parser. No class has HTML prefix anymore. [5] HTMLEnumeration is now NodeIterator, SimpleEnumeration is SimpleNodeIterator Regards, Somik |
From: Derrick O. <Der...@ro...> - 2003-02-16 17:29:40
|
JJ, Somik, I looked at it briefly, and saw that the fetch is returning 403 - access prohibited. In the past, when I've experienced this, there is usually some header field on the connection that needs to be set, like Accept-Language, Referer or User-Agent. I don't think this can be solved in a general way. I believe it needs to be specified differently for different servers and queries. See testPost() in HTMLParserTest.java for how to set header fields on the connection. Some experimentation will be required. Derrick Somik Raha wrote: >>Question: How I can extract links of a page as: >> >>http://www.google.com/search?q=universe >> >> > >Don't know why this is happening- Derrick ? > >Regards, >Somik > > > > |