htmlparser-developer Mailing List for HTML Parser (Page 23)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Somik R. <so...@ya...> - 2002-12-17 06:40:17
|
Great to have a discussion going! I'd like to branch off all the issues into seperate threads so that we could deal with them seperately. > > Logging > > The use of a feedback object is adequate, but JDK version 1.4 has a > > rich API, java.util.logging, that we might want to emulate (presuming > > we don't want to force JDK 1.4 usage). > > I would be against forcing JDK 1.4 usage. I would recommend log4j > http://jakarta.apache.org/log4j/docs/index.html > Using either JDK 1.4 or log4j ties you down to a specific logging API. The latter will add to the weight of the parser. (I was actually considering log4j sometime back, but Claude Duguay convinced me otherwise) If however, more logging support is needed, I guess it could be added using a facade (or adapter) with JDK 1.4 (or log4j), externally. This is of course open to discussion. Regards, Somik |
From: Sam J. <sa...@ne...> - 2002-12-17 00:21:44
|
Hi Derrick, Some responses of my own. Derrick Oswald wrote: > POST constructor. > The basically two constructors that HTMLParser has either take a > string URL or a HTMLReader. This shifts the onus on performing HTTP > to the API user for POST operations. It might be good to have a > HttpURLConnection or URLConnection argument constructor, where a > primed and loaded connection is passed to the parser. I very much agree (thanks for your previous suggestions on this topic BTW) > Tables > The current version flattens tables, pushing the onus on the API user > to syntactically walk through the table data to get to a certain table > entry. It may be useful to nest table entries, similar to what the > the FORM tag does now, but have it correctly generate rows and columns. Have you looked at HTTPUnit? http://httpunit.sourceforge.net/ They have to deal with a lot of similar problems and there may be synergies. > Logging > The use of a feedback object is adequate, but JDK version 1.4 has a > rich API, java.util.logging, that we might want to emulate (presuming > we don't want to force JDK 1.4 usage). I would be against forcing JDK 1.4 usage. I would recommend log4j http://jakarta.apache.org/log4j/docs/index.html > charset > Currently the charset directive within the HTML page is ignored. There > may be a need to honour this parameter on the Content-Type field. Agreed CHEERS> SAM |
From: <dha...@or...> - 2002-12-16 13:35:22
|
Hi, Derrick has opened a lovely thread of discussion and I would like to add some of my own thoughts. Currently the parser does not store any tabs or newlines that may be present on the HTML page. However if one wants to parse the page and reproduce it, it is imperative that the formatting remains the same i.e. the look and feel of the parsed page and the unparsed page do not have any difference(obviously unless added during the parsing routine). I think it is worthwhile giving a thought to this. I may be very selfish in suggesting it since my usage requires a production of the HTML page after parsing it and adding some information depending on the tags. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 -----Original Message----- From: DerrickOswald [mailto:Der...@ro...] Sent: Monday, December 16, 2002 7:02 PM To: htmlparser-developer Cc: DerrickOswald Subject: [Htmlparser-developer] version 1.3 This message is just to open discussion. Here are some enhancements that might best be left till the next version. POST constructor. The basically two constructors that HTMLParser has either take a string URL or a HTMLReader. This shifts the onus on performing HTTP to the API user for POST operations. It might be good to have a HttpURLConnection or URLConnection argument constructor, where a primed and loaded connection is passed to the parser. Tables The current version flattens tables, pushing the onus on the API user to syntactically walk through the table data to get to a certain table entry. It may be useful to nest table entries, similar to what the the FORM tag does now, but have it correctly generate rows and columns. Logging The use of a feedback object is adequate, but JDK version 1.4 has a rich API, java.util.logging, that we might want to emulate (presuming we don't want to force JDK 1.4 usage). charset Currently the charset directive within the HTML page is ignored. There may be a need to honour this parameter on the Content-Type field. beans It might be nice to create one or more java beans that can be used within GUI IDE's. The predefined behavior might be what the parserapplications do now, but exposing some accessors on HTMLParser and providing a zero arg constructor may also prove useful. executable jar There is no default application for the htmlparser.jar, i.e. java -jar htmlparser.jar doesn't do anything at the moment. A little GUI application might be nice. I'm not talking a browser, but rather a demo of the applications (i.e. a tree view of the links a la robot, a text view a la StringExtractor, a list of mail addresses a la ripper etc. ). This would utilize the beans mentioned above. ------------------------------------------------------- This sf.net email is sponsored by: With Great Power, Comes Great Responsibility Learn to use your power at OSDN's High Performance Computing Channel http://hpc.devchannel.org/ _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2002-12-16 13:24:02
|
This message is just to open discussion. Here are some enhancements that might best be left till the next version. POST constructor. The basically two constructors that HTMLParser has either take a string URL or a HTMLReader. This shifts the onus on performing HTTP to the API user for POST operations. It might be good to have a HttpURLConnection or URLConnection argument constructor, where a primed and loaded connection is passed to the parser. Tables The current version flattens tables, pushing the onus on the API user to syntactically walk through the table data to get to a certain table entry. It may be useful to nest table entries, similar to what the the FORM tag does now, but have it correctly generate rows and columns. Logging The use of a feedback object is adequate, but JDK version 1.4 has a rich API, java.util.logging, that we might want to emulate (presuming we don't want to force JDK 1.4 usage). charset Currently the charset directive within the HTML page is ignored. There may be a need to honour this parameter on the Content-Type field. beans It might be nice to create one or more java beans that can be used within GUI IDE's. The predefined behavior might be what the parserapplications do now, but exposing some accessors on HTMLParser and providing a zero arg constructor may also prove useful. executable jar There is no default application for the htmlparser.jar, i.e. java -jar htmlparser.jar doesn't do anything at the moment. A little GUI application might be nice. I'm not talking a browser, but rather a demo of the applications (i.e. a tree view of the links a la robot, a text view a la StringExtractor, a list of mail addresses a la ripper etc. ). This would utilize the beans mentioned above. |
From: Derrick O. <Der...@ro...> - 2002-12-16 12:51:06
|
Juan, I don't see that particular string anymore at http://www.elmundo.es, but another instance is: El miércoles tendrá lugar el estreno mundial de la segunda entrega de 'El señor de los anillos'. which has the both the \u00e1 and \u00f1 characters printing correctly. I also tried the jar file directly from the release candidate /htmlparser/htmlparser1_2_20021215.zip and it also correctly prints those characters. You may be using an old jar file. Derrick agente007 wrote: > > > --- On Fri 12/13, Derrick Oswald < Der...@ro... > wrote: > From: Derrick Oswald [mailto: Der...@ro...] > To: htm...@li... > Date: Fri, 13 Dec 2002 23:34:18 -0500 > Subject: Re: [Htmlparser-developer] How parse HTML in spanish? > > Yes, there seems to be a problem. > > The openURLConnection() method of HTMLParser uses "8859_4" encoding > which presumably maps to iso-8859-4. > > According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) > the default charset should be iso-8859-1 (section 3.7.1). > The content of http://www.elmundo.es is indeed content="text/html; > charset=iso-8859-1" and should be interpreted that way. > > From what I can find, the 8859-4 is an extension of 8859-1 for > Lithuanian and Latvian characters, and is superceded by 8859-10. > e.g. see the Linux man pages for charsets > (http://nodevice.com/sections/ManIndex/man0132.html) > 8859-4 (Latin-4) > Latin-4 introduced letters for Estonian, Latvian, and > Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6). > > So I have changed the default to" 8859_1". > > I would also make sure that you are able to see the correct glyphs by > running this: > > public class Test > { > public static void main (String[] args) > { > System.out.println ("El m\u00e1s famoso y caro vino de pago de > Espa\u00f1a"); > } > } > > Going forward, it would be good for HTMLParser to honour the charset > property on the "Content-Type" field in the HTML header. But at that > point the InputStream from the URLConnection is already partially > consumed by the parser and a switch of character set may be problematic. > It's not clear when the character set is supposed to take effect (within > ?) but this may be a good reason to use the mark() and > reset() on the Reader. > > Derrick > > agente007 wrote: > > > > > > > --- On Fri 12/13, Derrick Oswald < Der...@ro... > wrote: > > From: Derrick Oswald [mailto: Der...@ro...] > > To: htm...@li... > > Date: Fri, 13 Dec 2002 08:22:00 -0500 > > Subject: Re: [Htmlparser-developer] How parse HTML in spanish? > > > > Can you be more specific about what isn't being extracted correctly? > > The best way would be to make a test case that shows the problem and > > submit it as a bug. > > > > For example, when I try a URL as : "http://www.elmundo.es" then > > appears the text: ... Text = El mßs famoso y caro vino de pago de > > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > > propia ... The correct text would be: ... Text = El más famoso y caro > > vino de pago de España, el Pingus, no podrá acceder a una denominación > > de origen propia ... what happend? Juan J > > ------------------------------------------------------------------------ > > Join Excite! - http://www.excite.com > > The most personalized portal on the Web! > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > Do not work! The parser and the example write "El mßs famoso y caro > vino de pago de Espa?a" Should be "El más famoso y caro vino de pago > de España". See the character á and the character ñ. (\u00e1 and > \u00f1") Regards Juan J. Samper > ------------------------------------------------------------------------ > Join Excite! - http://www.excite.com > The most personalized portal on the Web! |
From: agente007 <e-a...@ex...> - 2002-12-16 12:16:22
|
--- On Fri 12/13, Derrick Oswald wrote:From: Derrick Oswald [mailto: Der...@ro...]To: htm...@li...Date: Fri, 13 Dec 2002 23:34:18 -0500Subject: Re: [Htmlparser-developer] How parse HTML in spanish?Yes, there seems to be a problem.The openURLConnection() method of HTMLParser uses "8859_4" encoding which presumably maps to iso-8859-4.According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) the default charset should be iso-8859-1 (section 3.7.1).The content of http://www.elmundo.es is indeed content="text/html; charset=iso-8859-1" and should be interpreted that way. From what I can find, the 8859-4 is an extension of 8859-1 for Lithuanian and Latvian characters, and is superceded by 8859-10.e.g. see the Linux man pages for charsets (http://nodevice.com/sections/ManIndex/man0132.html) 8859-4 (Latin-4) Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6).So I have changed the default to" 8859_1".I would also make sure that you are able to see the correct glyphs by running this:public class Test{ public static void main (String[] args) { System.out.println ("El m\u00e1s famoso y caro vino de pago de Espa\u00f1a"); }}Going forward, it would be good for HTMLParser to honour the charset property on the "Content-Type" field in the HTML header. But at that point the InputStream from the URLConnection is already partially consumed by the parser and a switch of character set may be problematic. It's not clear when the character set is supposed to take effect (within ?) but this may be a good reason to use the mark() and reset() on the Reader.Derrickagente007 wrote:>>> --- On Fri 12/13, Derrick Oswald wrote:> From: Derrick Oswald [mailto: Der...@ro...]> To: htm...@li...> Date: Fri, 13 Dec 2002 08:22:00 -0500> Subject: Re: [Htmlparser-developer] How parse HTML in spanish?>> Can you be more specific about what isn't being extracted correctly?> The best way would be to make a test case that shows the problem and> submit it as a bug.>> For example, when I try a URL as : "http://www.elmundo.es" then > appears the text: ... Text = El mßs famoso y caro vino de pago de > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > propia ... The correct text would be: ... Text = El más famoso y caro > vino de pago de España, el Pingus, no podrá acceder a una denominación > de origen propia ... what happend? Juan J> ------------------------------------------------------------------------> Join Excite! - http://www.excite.com> The most personalized portal on the Web! -------------------------------------------------------This sf.net email is sponsored by:With Great Power, Comes Great Responsibility Learn to use your power at OSDN's High Performance Computing Channelhttp://hpc.devchannel.org/_______________________________________________Htmlparser-developer mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlparser-developer Do not work! The parser and the example write "El mßs famoso y caro vino de pago de Espa?a" Should be "El más famoso y caro vino de pago de España". See the character á and the character ñ. (\u00e1 and \u00f1") Regards Juan J. Samper _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |
From: Somik R. <so...@ya...> - 2002-12-15 09:29:43
|
Hi Folks, Candidate 6 is out, and there are some goodies in this one.. Thanks to Derrick Oswald and Leslie Rohde (our two new developers) who have put in their time. From the Change Log : Integration Build 1.2 - 20021215 -------------------------------- [1] Modified API of HTMLImageTag (refactored name of image loc), HTMLLinkTag (added getters) [2] Fixed bug 650457 - removeEscapeCharacters() incorrect [3] Fixed bug 652263 - HTMLParser and null feedback [4] Changed encoding used from 8859_4 to 8859_1 [5] HTMLRemarkNode returns string data in toPlainTextString() (This is a rollback) [6] Fixed bug 652746 - HTMLFormTag gets links correctly now [7] Fixed bug 653720 - HTMLNode uses sun specific class [8] Improved StringExtractor parser application [9] Major design improvement, implemented Collection-Parameter pattern - in HTMLNode.collectInto() [10] Fixed reset crash bug. Reader providers have to explicitly call mark and reset now. This is now documented in HTMLParser.java. [11] Fixed bug 649269 in HTMLLinkTag.isHttpLink(), now correctly identifies relative links as Http links. A major API improvement has occurred - HTMLNode now has a new method - collectInto(), which uses a collection parameter to collect nodes. A sample program demonstrating this feature is at : http://htmlparser.sourceforge.net/samples/linksEmbedded.html Thanks to everyone who participated in the discussions and architecture changes. There has been a rollback as well, we've taken out the mark and reset mechanism, and this is now the responsibility of the reader supplier. Cheers, Somik |
From: agente007 <e-a...@ex...> - 2002-12-14 13:12:48
|
--- On Fri 12/13, Derrick Oswald wrote:From: Derrick Oswald [mailto: Der...@ro...]To: htm...@li...Date: Fri, 13 Dec 2002 23:34:18 -0500Subject: Re: [Htmlparser-developer] How parse HTML in spanish?Yes, there seems to be a problem.The openURLConnection() method of HTMLParser uses "8859_4" encoding which presumably maps to iso-8859-4.According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) the default charset should be iso-8859-1 (section 3.7.1).The content of http://www.elmundo.es is indeed content="text/html; charset=iso-8859-1" and should be interpreted that way. From what I can find, the 8859-4 is an extension of 8859-1 for Lithuanian and Latvian characters, and is superceded by 8859-10.e.g. see the Linux man pages for charsets (http://nodevice.com/sections/ManIndex/man0132.html) 8859-4 (Latin-4) Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6).So I have changed the default to" 8859_1".I would also make sure that you are able to see the correct glyphs by running this:public class Test{ public static void main (String[] args) { System.out.println ("El m\u00e1s famoso y caro vino de pago de Espa\u00f1a"); }}Going forward, it would be good for HTMLParser to honour the charset property on the "Content-Type" field in the HTML header. But at that point the InputStream from the URLConnection is already partially consumed by the parser and a switch of character set may be problematic. It's not clear when the character set is supposed to take effect (within ?) but this may be a good reason to use the mark() and reset() on the Reader.Derrickagente007 wrote:>>> --- On Fri 12/13, Derrick Oswald wrote:> From: Derrick Oswald [mailto: Der...@ro...]> To: htm...@li...> Date: Fri, 13 Dec 2002 08:22:00 -0500> Subject: Re: [Htmlparser-developer] How parse HTML in spanish?>> Can you be more specific about what isn't being extracted correctly?> The best way would be to make a test case that shows the problem and> submit it as a bug.>> For example, when I try a URL as : "http://www.elmundo.es" then > appears the text: ... Text = El mßs famoso y caro vino de pago de > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > propia ... The correct text would be: ... Text = El más famoso y caro > vino de pago de España, el Pingus, no podrá acceder a una denominación > de origen propia ... what happend? Juan J> ------------------------------------------------------------------------> Join Excite! - http://www.excite.com> The most personalized portal on the Web! -------------------------------------------------------This sf.net email is sponsored by:With Great Power, Comes Great Responsibility Learn to use your power at OSDN's High Performance Computing Channelhttp://hpc.devchannel.org/_______________________________________________Htmlparser-developer mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlparser-developer Thanks! I will prove it. Juan J _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |
From: Somik R. <so...@ya...> - 2002-12-14 07:10:20
|
Hi Sam, Dhaval, To set the record straight, > > 1. return the original HTML > > I think as per the rest of the parser this activity should be done using the > toHTML() method. > > 2. return the text appearing within it that is not a default part of the tag > > This should be done with the toPlainTextString() method. This is our intention. toPlainTextString() will be fixed for the next release. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-14 05:41:12
|
Hi Sam, > I'm not sure what the downside is to having a toPlainTextString() call > in the HTMLRemarkNode. Remember I don't have such a wonderful > understanding of the HTMLParser itself. For example I'm not sure what > you mean when you say that the remark text data would appear in your > string filter. I'm not sure what a string filter is ... At the moment > it seems I have to explicitly check for HTMLRemarkNodes and then process > them if I want to .... A string filter is a program that filters a page and gives you the string. To create a string filter, I'd write a loop that calls node.toPlainTextString(). There is no downside to having toPlainTextString() implemented in HTMLRemarkNode. It was there till last week - I took it off on an incorrect notion - bcos very often, people comment HTML tags, and thats not "plain-text", and that shows up in the toPlainTextString() method of HTMLRemarkNode. Once we put this functionality back in, you wont need to check HTMLRemarkNodes explicitly. What might have sounded confusing in my earlier mail was that - if tags are present inside the HTMLRemarkNode - as : <!-- <sometag> <someothertag> ... Text <blah></blah> --> we could recursively parse it to get the actual text - but that is a developer's flight of fancy - you can ignore that... Regards, Somik |
From: Derrick O. <Der...@ro...> - 2002-12-14 04:27:21
|
Yes, there seems to be a problem. The openURLConnection() method of HTMLParser uses "8859_4" encoding which presumably maps to iso-8859-4. According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) the default charset should be iso-8859-1 (section 3.7.1). The content of http://www.elmundo.es is indeed content="text/html; charset=iso-8859-1" and should be interpreted that way. From what I can find, the 8859-4 is an extension of 8859-1 for Lithuanian and Latvian characters, and is superceded by 8859-10. e.g. see the Linux man pages for charsets (http://nodevice.com/sections/ManIndex/man0132.html) 8859-4 (Latin-4) Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6). So I have changed the default to" 8859_1". I would also make sure that you are able to see the correct glyphs by running this: public class Test { public static void main (String[] args) { System.out.println ("El m\u00e1s famoso y caro vino de pago de Espa\u00f1a"); } } Going forward, it would be good for HTMLParser to honour the charset property on the "Content-Type" field in the HTML header. But at that point the InputStream from the URLConnection is already partially consumed by the parser and a switch of character set may be problematic. It's not clear when the character set is supposed to take effect (within <BODY> </BODY> ?) but this may be a good reason to use the mark() and reset() on the Reader. Derrick agente007 wrote: > > > --- On Fri 12/13, Derrick Oswald < Der...@ro... > wrote: > From: Derrick Oswald [mailto: Der...@ro...] > To: htm...@li... > Date: Fri, 13 Dec 2002 08:22:00 -0500 > Subject: Re: [Htmlparser-developer] How parse HTML in spanish? > > Can you be more specific about what isn't being extracted correctly? > The best way would be to make a test case that shows the problem and > submit it as a bug. > > For example, when I try a URL as : "http://www.elmundo.es" then > appears the text: ... Text = El mßs famoso y caro vino de pago de > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > propia ... The correct text would be: ... Text = El más famoso y caro > vino de pago de España, el Pingus, no podrá acceder a una denominación > de origen propia ... what happend? Juan J > ------------------------------------------------------------------------ > Join Excite! - http://www.excite.com > The most personalized portal on the Web! |
From: agente007 <e-a...@ex...> - 2002-12-13 22:20:10
|
--- On Fri 12/13, Derrick Oswald wrote:From: Derrick Oswald [mailto: Der...@ro...]To: htm...@li...Date: Fri, 13 Dec 2002 08:22:00 -0500Subject: Re: [Htmlparser-developer] How parse HTML in spanish?Can you be more specific about what isn't being extracted correctly?The best way would be to make a test case that shows the problem and submit it as a bug. For example, when I try a URL as : "http://www.elmundo.es" then appears the text: ... Text = El mßs famoso y caro vino de pago de Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen propia ... The correct text would be: ... Text = El más famoso y caro vino de pago de España, el Pingus, no podrá acceder a una denominación de origen propia ... what happend? Juan J _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |
From: Derrick O. <Der...@ro...> - 2002-12-13 13:15:12
|
Can you be more specific about what isn't being extracted correctly? The best way would be to make a test case that shows the problem and submit it as a bug. agente007 wrote: > > Hello. > > I would like to parse HTML with characters in spanish but htmlparser > do not extract the text correctly. > > How I can to modify the code for doing it? > > Juan J. > ------------------------------------------------------------------------ > Join Excite! - http://www.excite.com > The most personalized portal on the Web! |
From: Derrick O. <Der...@ro...> - 2002-12-13 13:05:18
|
Sam, I've had some success in passing in an HTMLReader object I construct from the contents of a URL (from which you can get your own header info). But an outstanding issue ([ 649133 ] reader.reset crash in HTMLParser https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399) that explains there is an exception thrown at the end of parsing by a reset() on the reader object for pages longer than 5000 characters, means you have to perform a workaround like this until it gets fixed: reader = new HTMLReader (some_reader, some_url); parser = new HTMLParser (reader); // reset/remark to end of stream reader.reset (); reader.mark (real_number_of_characters_available_in_reader); // proceed with parse Derrick Sam Joseph wrote: >Hi Somik, > >Once again apologies for barraging you with the questions this week, but >I guess that's what open source is all about eh? > >When you wrote the NeuroGridHTMLParser for me a while back you added >functionality to support getting the full plain text of a page. I've >been busy modifying that and one of the things I'd really like to do is >initialize a StringBuffer to an appropriate size, so that the buffer >doesn't have to get resized while parsing the page. > >My first thought is that I would like to get access the HTTP headers >that would tell me the content length of the incoming HTML page, and >looking through the HTMLParser as is, it looks like I can't really >access those headers directly. > >Is there some other mechanism to determine a document's length before >parsing starts, or could we put one in? > >Naturally when reading from a file as opposed to a url one would call a >different underlying method, but it would seem plausible to have a >getDocumentSize() method. Or how about access to the underlying File or >URLConnection objects? > >I'm just thinking out loud, ..., maybe giving the user the ability to >pass in a URLConnection or File object would be the best, as then the >user could get all the info they need. I guess some changes would be >required to support this given that currently the HTMLParser opens >connections using this private method: > >private HTMLReader openURLConnection() throws HTMLParserException { >try { >// Its a web address >resourceLocn=HTMLLinkProcessor.removeEscapeCharacters(resourceLocn); >resourceLocn=checkEnding(resourceLocn); >resourceLocn=HTMLLinkProcessor.fixSpaces(resourceLocn); >URL url = new URL(resourceLocn); >URLConnection uc = url.openConnection(); >return new HTMLReader(new BufferedReader(new >InputStreamReader(uc.getInputStream(),"8859_4")),resourceLocn); >} >catch (Exception e) { >String msg="HTMLParser.openURLConnection() : Error in opening a URL >connection to "+resourceLocn; >HTMLParserException ex = new HTMLParserException(msg,e); >feedback.error(msg,ex); >throw ex; >} >} > >CHEERS> SAM > > > >------------------------------------------------------- >This sf.net email is sponsored by: >With Great Power, Comes Great Responsibility >Learn to use your power at OSDN's High Performance Computing Channel >http://hpc.devchannel.org/ >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |
From: agente007 <e-a...@ex...> - 2002-12-13 08:40:07
|
Hello. I would like to parse HTML with characters in spanish but htmlparser do not extract the text correctly. How I can to modify the code for doing it? Juan J. _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |
From: Sam J. <ga...@yh...> - 2002-12-13 07:32:35
|
Hi Somik, Once again apologies for barraging you with the questions this week, but I guess that's what open source is all about eh? When you wrote the NeuroGridHTMLParser for me a while back you added functionality to support getting the full plain text of a page. I've been busy modifying that and one of the things I'd really like to do is initialize a StringBuffer to an appropriate size, so that the buffer doesn't have to get resized while parsing the page. My first thought is that I would like to get access the HTTP headers that would tell me the content length of the incoming HTML page, and looking through the HTMLParser as is, it looks like I can't really access those headers directly. Is there some other mechanism to determine a document's length before parsing starts, or could we put one in? Naturally when reading from a file as opposed to a url one would call a different underlying method, but it would seem plausible to have a getDocumentSize() method. Or how about access to the underlying File or URLConnection objects? I'm just thinking out loud, ..., maybe giving the user the ability to pass in a URLConnection or File object would be the best, as then the user could get all the info they need. I guess some changes would be required to support this given that currently the HTMLParser opens connections using this private method: private HTMLReader openURLConnection() throws HTMLParserException { try { // Its a web address resourceLocn=HTMLLinkProcessor.removeEscapeCharacters(resourceLocn); resourceLocn=checkEnding(resourceLocn); resourceLocn=HTMLLinkProcessor.fixSpaces(resourceLocn); URL url = new URL(resourceLocn); URLConnection uc = url.openConnection(); return new HTMLReader(new BufferedReader(new InputStreamReader(uc.getInputStream(),"8859_4")),resourceLocn); } catch (Exception e) { String msg="HTMLParser.openURLConnection() : Error in opening a URL connection to "+resourceLocn; HTMLParserException ex = new HTMLParserException(msg,e); feedback.error(msg,ex); throw ex; } } CHEERS> SAM |
From: <dha...@or...> - 2002-12-12 13:52:30
|
Hi, I agree with the views expressed here and agree on the 2 basic requirements as pointed out here: > 1. return the original HTML I think as per the rest of the parser this activity should be done using the toHTML() method. 2. return the text appearing within it that is not a default part of the tag This should be done with the toPlainTextString() method. Apart from this I have'nt understood much of this thread. Bye, Dhaval |
From: Sam J. <ga...@yh...> - 2002-12-12 13:40:21
|
Hi Somik, Somik Raha wrote: >>Thanks for the help. I think I would like to see the >>toPlainTextString() method remain. Although I'm not quite sure of the >>difference between HTMLRemarkNode.toString and >>HTMLRemarkNode.toPlainTextString. >> >> > >This is actually based on your suggestion (eons back..) - >toPlainTextString() is the uniform way of getting string representation of a >page - meaningful and hopefully semantic data. I think you'd probably want >to use toPlainTextString() instead of toString() - as toString() always >gives some output for all the tags, while toPlainTextString() works only for >specific ones like string nodes, link text and strings inside forms. It was >also enabled earlier for comments, but was taken out last week. I am >thinking of putting it back in. What this will mean is that if folks have >commented tags - you will get that sort of data in your string filter. I >think you can live with that (?) > >Also - I am thinking of a better approach - wherein, should one require pure >strings within a comment, one could create a new parser, that operates on >the contents of the string node (it would be an interesting approach to >try..) > I'm not sure that I'm following you. But then its late here .... It would seem that whatever other considerations there might be one would want to have some method on HTMLRemarkNode that allows you to grab the pure unadulterated text of the remark without anything else. The HTMLRemarkNode.toString() method I'm using now seems to be appending the string "Comment Tag :" to the front of the string that is returned. Its nice to have convenience methods to pretty print things. But shouldn't the two default methods on any node be to: 1. return the original HTML 2. return the text appearing within it that is not a default part of the tag Naturally there will be variation depending on the node, but it seems odd to have prettified print responses as the default (maybe they're not and I'm just getting confused) - ideally they would be called with a parameter or special method like prettyPrint(). I'm not sure what the downside is to having a toPlainTextString() call in the HTMLRemarkNode. Remember I don't have such a wonderful understanding of the HTMLParser itself. For example I'm not sure what you mean when you say that the remark text data would appear in your string filter. I'm not sure what a string filter is ... At the moment it seems I have to explicitly check for HTMLRemarkNodes and then process them if I want to .... CHEERS> SAM |
From: Somik R. <so...@ya...> - 2002-12-12 05:21:26
|
Hi Sam, > Also, I solved my problem with the debugging output. The problem was > with the code I was using to output the final data. The print() command ... Oops.. > Thanks for the help. I think I would like to see the > toPlainTextString() method remain. Although I'm not quite sure of the > difference between HTMLRemarkNode.toString and > HTMLRemarkNode.toPlainTextString. This is actually based on your suggestion (eons back..) - toPlainTextString() is the uniform way of getting string representation of a page - meaningful and hopefully semantic data. I think you'd probably want to use toPlainTextString() instead of toString() - as toString() always gives some output for all the tags, while toPlainTextString() works only for specific ones like string nodes, link text and strings inside forms. It was also enabled earlier for comments, but was taken out last week. I am thinking of putting it back in. What this will mean is that if folks have commented tags - you will get that sort of data in your string filter. I think you can live with that (?) Also - I am thinking of a better approach - wherein, should one require pure strings within a comment, one could create a new parser, that operates on the contents of the string node (it would be an interesting approach to try..) Regards, Somik ----- Original Message ----- From: "Sam Joseph" <ga...@yh...> To: <htm...@li...> Sent: Wednesday, December 11, 2002 8:05 PM Subject: Re: [Htmlparser-developer] HTML Comments/Remarks > Hi Somik, > > Thanks for the help. I think I would like to see the > toPlainTextString() method remain. Although I'm not quite sure of the > difference between HTMLRemarkNode.toString and > HTMLRemarkNode.toPlainTextString. > > Trying out both in my code I see that toPlainTextString() seems to > generate a blank while toString() gives me the contents of the > remark/comment. To be specific about my objectives, I'm trying to > handle meta-data by the creative commons group which currently involved > placing a big chunk of rdf/xml in a remark within the page. I'm very > much hoping to be able to extract that comment verbatim and then pass it > over to my rdf/xml parser. > > I'll be happy as long as I can achieve that. > > Also, I solved my problem with the debugging output. The problem was > with the code I was using to output the final data. The print() command > was being called on links and meta-tags, and the way that ant formatted > things it made it look like the associated System.out calls were being > made during the parsing process rather than at the end. Sorry about > that, all fixed now, so don't worry about looking at the code that I > sent you in my previous email. > > Thanks again for all your help. I'm looking forward to fully > integrating HTMLParser with NeuroGrid over the next two days. > > CHEERS> SAM > > Somik Raha wrote: > > >Hi Sam, > > HTMLRemarkNode is a special class -it is not a > >scanner. > > It is registered by default - so you dont have to do > >anything - just check if the node object is a remark > >node. > > > > However, last week, I removed the > >toPlainTextString() implementation as it often a lot > >of HTML code is commented out, and I thought it might > >interfere with a simple string representation of a > >page. If that is not the case and you need to use > >toPlainTextString(), pls let us know, and we should > >put that functionality back in. > > > >Regards, > >Somik > >--- Sam Joseph <ga...@yh...> wrote: > > > > > >>Hi Somik > >> > >>Sorry to ask so much this week, but I was wondering > >>it there some operation for picking up HTML comments > >>using the HTMLParser (<!-- a comment -->) or are > >>they automatically ignored? > >> > >>I can see from the API that there is HTMLRemarkNode, > >>but I can't see any similar tag or scanner. Must a > >>special tag/scanner be created to handle > >>comments/remarks? > >> > >>Thanks in advance. > >> > >>CHEERS> SAM > >> > >> > >> > >> > >> > >> > >> > >> > >------------------------------------------------------- > > > > > >>This sf.net email is sponsored by: > >>With Great Power, Comes Great Responsibility > >>Learn to use your power at OSDN's High Performance > >>Computing Channel > >>http://hpc.devchannel.org/ > >>_______________________________________________ > >>Htmlparser-developer mailing list > >>Htm...@li... > >> > >> > >> > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > >__________________________________________________ > >Do you Yahoo!? > >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. > >http://mailplus.yahoo.com > > > > > >------------------------------------------------------- > >This sf.net email is sponsored by: > >With Great Power, Comes Great Responsibility > >Learn to use your power at OSDN's High Performance Computing Channel > >http://hpc.devchannel.org/ > >_______________________________________________ > >Htmlparser-developer mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2002-12-12 05:15:28
|
Hi Sam, The parse() is not being called, but the print() method is. From three places : NeurogridHTMLParserTest.printLinks() NeurogridHTMLParserTest.printMetaTags() NeurogridHTMLParser.searchForSummaryContents() If you mask this, output will be as you desire. Regards, Somik ----- Original Message ----- From: "Sam Joseph" <ga...@yh...> To: <htm...@li...> Sent: Wednesday, December 11, 2002 3:41 PM Subject: [Htmlparser-developer] Re: Htmlparser-developer digest, Vol 1 #136 - 3 msgs > Hi Somik, > > Sorry that my mails are not attaching to the thread properly. I'm on > digest, so when I reply to the digest meesage I think a new thread get > automatically started, and the sourceforge mail interface doesn't let me > reply directly to your messages > > Thanks for your suggestion below. As far as I can see from the code the > parse method on HTMLParser is not being called. In fact it uses exactly > the think you describe in your mail. I didn't really write this code. > It's still basically the NeuroGridHTMLParser that you wrote a while > back, modified into my coding format. > > Please find the code appended to this email. Both the links I have been > parsing are specified in the NeuroGridHTMLParserTest.java file. > > Thanks in advance. > > CHEERS> SAM > > Somik wrote: > > >Sorry, I just saw your other mail again with the > >output. I see the problem - > > > >You must be calling the parse method in > >HTMLParser.java. That is only a demo. As mentioned in > >the docs, you should be doing something like : > > > >(for HTMLEnumeration e = > >parser.elements();e.hasMoreNodes();) { > > HTMLNode node = e.nextHTMLNode(); > > // create summary here > >} > > > >The call to parse has the printing stuff which prints > >all the details of the nodes (calling node.print()). > > > >If this does not help, can you post your complete > >parsing program ? > > > > ---------------------------------------------------------------------------- ---- > /* > * (c) Copyright 2001 MyCorporation. > * All Rights Reserved. > */ > package com.neurogrid.parser; > /** > * @version 1.0 > * @author > */ > public class Summary { > private String heading; > private String contents; > /** > * Constructor for Summary. > */ > public Summary(String heading, String contents) { > this.heading = heading; > this.contents = contents; > } > > /** > * Gets the heading. > * @return Returns a String > */ > public String getHeading() { > return heading; > } > > /** > * Sets the heading. > * @param heading The heading to set > */ > public void setHeading(String heading) { > this.heading = heading; > } > > /** > * Gets the contents. > * @return Returns a String > */ > public String getContents() { > return contents; > } > > /** > * Sets the contents. > * @param contents The contents to set > */ > public void setContents(String contents) { > this.contents = contents; > } > > public String toString() { > String retString; > if (heading.length()>0) retString = heading+"\n"+contents; > else retString = contents; > return retString; > } > } > ---------------------------------------------------------------------------- ---- > package com.neurogrid.parser; > > /* > * Copyright (C) 2000 NeuroGrid <sa...@ne...> > * > * This program is free software; you can redistribute it and/or > * modify it under the terms of the GNU General Public License > * as published by the Free Software Foundation; either version 2 > * of the License, or (at your option) any later version. > * > * This program is distributed in the hope that it will be useful, > * but WITHOUT ANY WARRANTY; without even the implied warranty of > * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > * GNU General Public License for more details. > * > * You should have received a copy of the GNU General Public License > * along with this program; if not, write to the Free Software > * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. > * > * You may find further details about this software at > * http://www.neurogrid.net/ > */ > > import junit.framework.*; > > // Import log4j classes. > import org.apache.log4j.Category; > import org.apache.log4j.BasicConfigurator; > import org.apache.log4j.PropertyConfigurator; > > import org.htmlparser.*; > import org.htmlparser.tags.*; > import org.htmlparser.scanners.*; > import org.htmlparser.util.*; > import java.util.Enumeration; > import java.util.Vector; > > /** > * @version 1.0 > * @author > */ > public class NeuroGridHTMLParser > { > private static final String cvsInfo = "$Id:$"; > public static String getCvsInfo() > { > return cvsInfo; > } > > private static Category o_cat = Category.getInstance(NeuroGridHTMLParser.class.getName()); > > /** > * initialize the logging system > * > * @param p_conf configuration filename > */ > public static void init(String p_conf) > { > BasicConfigurator.configure(); > PropertyConfigurator.configure(p_conf); > o_cat.info("NeuroGridHTMLParser logging Initialized"); > } > > private String o_url; > private String o_full_text; > private Vector o_meta_tags; > private Vector o_link_tags; > private Summary o_summary; > private StringBuffer o_summary_heading; > private StringBuffer o_summary_contents; > private HTMLParser o_parser = null; > private boolean o_h1_tag_found = false; > private boolean o_start_summary_search = false; > private int o_summary_count = 0; > > > /** > * This constructor is only to enable test cases. > * For clients, pls use NeuroGridHTMLParser(String) > * or NeuroGridHTMLParser(String,boolean) > * > * @param p_parser > */ > public NeuroGridHTMLParser(HTMLParser p_parser) > throws Exception > { > this("",false); > o_parser = p_parser; > } > > /** > * > * @param p_url > */ > public NeuroGridHTMLParser(String p_url) > throws Exception > { > this(p_url,true); > } > > /** > * > * @param p_url > * @param p_start_parsing > */ > public NeuroGridHTMLParser(String p_url, boolean p_start_parsing) > throws Exception > { > o_url = p_url; > o_meta_tags = new Vector(); > o_link_tags = new Vector(); > o_summary_heading = new StringBuffer(); > o_summary_contents = new StringBuffer(); > if (p_start_parsing) parse(); > } > > private class BlankHTMLParserFeedback > implements HTMLParserFeedback > { > public void info(String message) > { > //System.out.println("INFO: " + message); > } > > public void warning(String message) > { > //System.out.println("WARNING: " + message); > } > > public void error(String message, HTMLParserException e) > { > //System.out.println("ERROR: " + message); > e.printStackTrace(); > } > } > > > > /** > * parse the page > */ > public final void parse() > throws Exception > { > if (o_parser==null) > o_parser = new HTMLParser(o_url, new BlankHTMLParserFeedback()); > > o_parser.addScanner(new HTMLMetaTagScanner("-t")); > o_parser.addScanner(new HTMLLinkScanner("-l")); > o_parser.addScanner(new HTMLTitleScanner("-a")); > parseURLForData(); > o_summary = createSummary(); > } > > /** > * parse the URL for data > */ > private void parseURLForData() > throws Exception > { > HTMLNode x_node; > for (HTMLEnumeration e = o_parser.elements();e.hasMoreNodes();) > { > x_node = e.nextHTMLNode(); > checkForTitle(x_node); > checkForMetaTag(x_node); > checkForLinkTag(x_node); > checkForTag(x_node); > if(o_h1_tag_found == true) > { > o_h1_tag_found = processH1Tag(x_node); > } > else > { > if (o_start_summary_search) > { > searchForSummaryContents(x_node); > } > addToFullText(x_node); > } > > } > } > > /** > * parse the URL for data > * > * @param HTMLNode > */ > protected void checkForTitle(HTMLNode p_node) > { > if(p_node instanceof HTMLTitleTag) > { > String x_title = ((HTMLTitleTag)p_node).getTitle(); > o_cat.debug("appending title: " + x_title); > // I think it would be better to do one or the other of H1 and title. > //FIXXXXXXXXX > o_summary_heading.append(x_title+"\n"); > } > } > > /** > * add this nodes text to the full text > * > * @param HTMLNode > */ > private void addToFullText(HTMLNode p_node) > { > if(p_node instanceof HTMLStringNode) > { > o_full_text += ((HTMLStringNode)p_node).getText(); > } > } > > /** > * search for summary contents > * > * @param HTMLNode > */ > private void searchForSummaryContents(HTMLNode p_node) > { > if(p_node instanceof HTMLStringNode) > { > //o_cat.debug("*** SEARCHING FOR SUMMARY ***"); > p_node.print(); > String x_contents = ((HTMLStringNode)p_node).getText(); > if(x_contents.length()>0 && isAlphabetical(x_contents) && !isEmpty(x_contents)) > { > //o_cat.debug("x_contents = "+x_contents); > o_summary_count++; > o_summary_contents.append(x_contents+"\n"); > if(o_summary_count==2) > { > o_start_summary_search=false; > } > } > } > } > > /** > * check if this string is just spaces > * > * @param p_text > * > * @return boolean > */ > private boolean isEmpty(String p_text) > { > boolean x_empty = true; > for (int i=0;i<p_text.length();i++) > { > if (p_text.charAt(i) != ' ') > { > x_empty = false; > } > } > return x_empty; > } > > /** > * check if this string is alphabetical > * > * @param p_text > * > * @return boolean > */ > private boolean isAlphabetical(String p_text) > { > char x_ch; > p_text = p_text.toUpperCase(); > boolean x_return = true; > for(int i=0;i<p_text.length();i++) > { > x_ch = p_text.charAt(i); > if (!((x_ch>='A' && x_ch <='Z')|| (x_ch==' ' || x_ch=='.' || x_ch==','))) > { > x_return =false; > } > } > return x_return; > } > > /** > * check for a tag > * > * @param p_node > */ > private void checkForTag(HTMLNode p_node) > { > if(p_node instanceof HTMLTag) > { > HTMLTag x_tag = (HTMLTag)p_node; > checkForH1Tag(x_tag); > checkForBodyTag(x_tag); > } > } > > /** > * check for a body tag > * > * @param p_node > */ > private void checkForBodyTag(HTMLTag p_tag) > { > if(p_tag.getText().toUpperCase().indexOf("BODY")!=-1) > { > o_start_summary_search = true; > } > } > > /** > * check for an H1 tag > * > * @param p_node > */ > private void checkForH1Tag(HTMLTag tag) > { > if (tag.getText().toUpperCase().equals("H1")) > { > o_h1_tag_found = true; > } > } > > /** > * check for a meta tag > * > * @param p_node > */ > private void checkForMetaTag(HTMLNode p_node) > { > HTMLMetaTag x_meta_tag; > if(p_node instanceof HTMLMetaTag) > { > x_meta_tag = (HTMLMetaTag) p_node; > o_meta_tags.addElement(x_meta_tag); > } > } > > /** > * check for a link tag > * > * @param p_node > */ > private void checkForLinkTag(HTMLNode p_node) > { > HTMLLinkTag x_link_tag; > if(p_node instanceof HTMLLinkTag) > { > x_link_tag = (HTMLLinkTag)p_node; > o_link_tags.addElement(x_link_tag); > } > } > > /** > * process an H1 tag > * > * @param p_node > * > * @return boolean > */ > private boolean processH1Tag(HTMLNode p_node) > { > boolean x_h1_tag_found = true; > if(p_node instanceof HTMLStringNode) > { > o_summary_heading.append(((HTMLStringNode)p_node).getText()); > o_cat.debug("appending title: " + ((HTMLStringNode)p_node).getText()); > // I think it would be better to do one or the other of H1 and title. > //FIXXXXXXXXX > } > if(p_node instanceof HTMLEndTag) > { > HTMLEndTag x_end_tag =(HTMLEndTag)p_node; > //o_cat.debug("x_end_tag.toString(): " + x_end_tag.toString()); > //o_cat.debug("x_end_tag.toHTML(): " + x_end_tag.toHTML()); > //o_cat.debug("x_end_tag.toPlainTextString(): " + x_end_tag.toPlainTextString()); > //o_cat.debug("x_end_tag.getTagName(): " + x_end_tag.getTagName()); > //o_cat.debug("x_end_tag.getText(): " + x_end_tag.getText()); > if(x_end_tag.getTagName().toUpperCase().equals("H1")) > { > x_h1_tag_found = false; > } > } > return x_h1_tag_found; > } > > > > /** > * get the Summary > * > * @return Summary > */ > public Summary getSummary() > { > return o_summary; > } > > /** > * get the Full text > * > * @return String > */ > public String getFullText() > { > return o_full_text; > } > > /** > * get a vector of the links > * > * @return Vector > */ > public Vector links() > { > return o_link_tags; > } > > /** > * get a vector of meta tags > * > * @return Vector > */ > public Vector metaTags() > { > return o_meta_tags; > } > > /** > * create a summary > * > * @return Summary > */ > private Summary createSummary() > { > return new Summary(o_summary_heading.toString(),o_summary_contents.toString()); > } > > > /** > * main > * > * @param args > */ > public static void main(String[] args) > { > try > { > if (args.length==0) > { > o_cat.debug("Syntax:"); > o_cat.debug("java -jar neuroparser.jar URL"); > System.exit(-1); > } > o_cat.debug("Parsing "+args[0]+".."); > o_cat.debug(""); > NeuroGridHTMLParser parser = new NeuroGridHTMLParser(args[0]); > o_cat.debug("Printing links from "+args[0]); > o_cat.debug(""); > > printLinks(parser); > printMetaTags(args, parser); > printSummary(parser); > printFullText(parser); > } > catch(Exception e) > {e.printStackTrace();} > } > > public static void printSummary(NeuroGridHTMLParser parser) > { > o_cat.debug(""); > o_cat.debug("Summary"); > o_cat.debug("-------"); > o_cat.debug(parser.getSummary()); > o_cat.debug(""); > } > > public static void printFullText(NeuroGridHTMLParser parser) > { > o_cat.debug(""); > o_cat.debug("Full Text"); > o_cat.debug("-------"); > o_cat.debug(parser.getFullText()); > o_cat.debug(""); > } > > public static void printMetaTags(String[] args, NeuroGridHTMLParser parser) > { > HTMLMetaTag metaTag; > o_cat.debug(""); > o_cat.debug("Printing metaTags from "+args[0]); > o_cat.debug(""); > for(Enumeration e = parser.metaTags().elements();e.hasMoreElements();) > { > metaTag = (HTMLMetaTag)e.nextElement(); > metaTag.print(); > } > } > > public static void printLinks(NeuroGridHTMLParser parser) > { > HTMLLinkTag link; > for(Enumeration e =parser.links().elements();e.hasMoreElements();) > { > link = (HTMLLinkTag)e.nextElement(); > link.print(); > } > } > } > ---------------------------------------------------------------------------- ---- > package com.neurogrid.parser; > > /* > * Copyright (C) 2000 NeuroGrid <sa...@ne...> > * > * This program is free software; you can redistribute it and/or > * modify it under the terms of the GNU General Public License > * as published by the Free Software Foundation; either version 2 > * of the License, or (at your option) any later version. > * > * This program is distributed in the hope that it will be useful, > * but WITHOUT ANY WARRANTY; without even the implied warranty of > * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > * GNU General Public License for more details. > * > * You should have received a copy of the GNU General Public License > * along with this program; if not, write to the Free Software > * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. > * > * You may find further details about this software at > * http://www.neurogrid.net/ > */ > > import junit.framework.*; > > // Import log4j classes. > import org.apache.log4j.Category; > import org.apache.log4j.BasicConfigurator; > import org.apache.log4j.PropertyConfigurator; > > import org.htmlparser.*; > import org.htmlparser.tags.*; > import org.htmlparser.scanners.*; > import java.util.Enumeration; > import java.util.Vector; > > > /** > * @version 1.0 > * @author > */ > public class NeuroGridHTMLParserTest > extends TestCase > { > private static final String cvsInfo = "$Id:$"; > public static String getCvsInfo() > { > return cvsInfo; > } > > private static Category o_cat = Category.getInstance(NeuroGridHTMLParserTest.class.getName()); > > /** > * initialize the logging system > * > * @param p_conf configuration filename > */ > public static void init(String p_conf) > { > BasicConfigurator.configure(); > PropertyConfigurator.configure(p_conf); > o_cat.info("NeuroGridHTMLParserTest logging Initialized"); > } > > public static void main(String[] args) > { > NeuroGridHTMLParserTest.start(); > NeuroGridHTMLParserTest.init(args[0]); > NeuroGridHTMLParserTest.testStuff(); > } > > /** > * Subclasses must invoke this from their constructor. > */ > public NeuroGridHTMLParserTest(String p_name) > { > super(p_name); > } > > protected void setUp() > { > start(); > } > > protected static void start() > { > try > { > NeuroGridHTMLParserTest.init("conf/log4j.properties"); > NeuroGridHTMLParser.init("conf/log4j.properties"); > } > catch(Exception e){e.printStackTrace();} > } > > /** > * test some stuff > */ > public static void testStuff() > { > try > { > // String x_url = "http://belle.designwest.com/examples/test04b.html"; > String x_url = "http://home.att.ne.jp/red/gaijin/tribal-hardware/index.htm"; > > o_cat.debug("Parsing "+x_url+".."); > o_cat.debug(""); > NeuroGridHTMLParser parser = new NeuroGridHTMLParser(x_url); > o_cat.debug("Printing links from "+x_url); > o_cat.debug(""); > > printLinks(parser); > printMetaTags(x_url, parser); > printSummary(parser); > printFullText(parser); > } > catch(Exception e) > {e.printStackTrace();} > } > > > public static void printSummary(NeuroGridHTMLParser parser) > { > o_cat.debug(""); > o_cat.debug("Summary"); > o_cat.debug("-------"); > o_cat.debug(parser.getSummary().getHeading()); > o_cat.debug("-------"); > o_cat.debug(parser.getSummary().getContents()); > o_cat.debug("-------"); > o_cat.debug(""); > } > > public static void printFullText(NeuroGridHTMLParser parser) > { > o_cat.debug(""); > o_cat.debug("Full Text"); > o_cat.debug("-------"); > o_cat.debug(parser.getFullText()); > o_cat.debug(""); > } > > public static void printMetaTags(String p_url, NeuroGridHTMLParser parser) > { > HTMLMetaTag metaTag; > o_cat.debug(""); > o_cat.debug("Printing metaTags from "+p_url); > o_cat.debug(""); > for(Enumeration e = parser.metaTags().elements();e.hasMoreElements();) > { > metaTag = (HTMLMetaTag)e.nextElement(); > metaTag.print(); > } > } > > public static void printLinks(NeuroGridHTMLParser parser) > { > HTMLLinkTag link; > for(Enumeration e =parser.links().elements();e.hasMoreElements();) > { > link = (HTMLLinkTag)e.nextElement(); > link.print(); > } > } > } > |
From: Sam J. <ga...@yh...> - 2002-12-12 03:51:43
|
Hi Somik, Thanks for the help. I think I would like to see the toPlainTextString() method remain. Although I'm not quite sure of the difference between HTMLRemarkNode.toString and HTMLRemarkNode.toPlainTextString. Trying out both in my code I see that toPlainTextString() seems to generate a blank while toString() gives me the contents of the remark/comment. To be specific about my objectives, I'm trying to handle meta-data by the creative commons group which currently involved placing a big chunk of rdf/xml in a remark within the page. I'm very much hoping to be able to extract that comment verbatim and then pass it over to my rdf/xml parser. I'll be happy as long as I can achieve that. Also, I solved my problem with the debugging output. The problem was with the code I was using to output the final data. The print() command was being called on links and meta-tags, and the way that ant formatted things it made it look like the associated System.out calls were being made during the parsing process rather than at the end. Sorry about that, all fixed now, so don't worry about looking at the code that I sent you in my previous email. Thanks again for all your help. I'm looking forward to fully integrating HTMLParser with NeuroGrid over the next two days. CHEERS> SAM Somik Raha wrote: >Hi Sam, > HTMLRemarkNode is a special class -it is not a >scanner. > It is registered by default - so you dont have to do >anything - just check if the node object is a remark >node. > > However, last week, I removed the >toPlainTextString() implementation as it often a lot >of HTML code is commented out, and I thought it might >interfere with a simple string representation of a >page. If that is not the case and you need to use >toPlainTextString(), pls let us know, and we should >put that functionality back in. > >Regards, >Somik >--- Sam Joseph <ga...@yh...> wrote: > > >>Hi Somik >> >>Sorry to ask so much this week, but I was wondering >>it there some operation for picking up HTML comments >>using the HTMLParser (<!-- a comment -->) or are >>they automatically ignored? >> >>I can see from the API that there is HTMLRemarkNode, >>but I can't see any similar tag or scanner. Must a >>special tag/scanner be created to handle >>comments/remarks? >> >>Thanks in advance. >> >>CHEERS> SAM >> >> >> >> >> >> >> >> >------------------------------------------------------- > > >>This sf.net email is sponsored by: >>With Great Power, Comes Great Responsibility >>Learn to use your power at OSDN's High Performance >>Computing Channel >>http://hpc.devchannel.org/ >>_______________________________________________ >>Htmlparser-developer mailing list >>Htm...@li... >> >> >> >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > >__________________________________________________ >Do you Yahoo!? >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. >http://mailplus.yahoo.com > > >------------------------------------------------------- >This sf.net email is sponsored by: >With Great Power, Comes Great Responsibility >Learn to use your power at OSDN's High Performance Computing Channel >http://hpc.devchannel.org/ >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > |
From: Somik R. <so...@ya...> - 2002-12-12 00:04:16
|
Hi Sam, HTMLRemarkNode is a special class -it is not a scanner. It is registered by default - so you dont have to do anything - just check if the node object is a remark node. However, last week, I removed the toPlainTextString() implementation as it often a lot of HTML code is commented out, and I thought it might interfere with a simple string representation of a page. If that is not the case and you need to use toPlainTextString(), pls let us know, and we should put that functionality back in. Regards, Somik --- Sam Joseph <ga...@yh...> wrote: > Hi Somik > > Sorry to ask so much this week, but I was wondering > it there some operation for picking up HTML comments > using the HTMLParser (<!-- a comment -->) or are > they automatically ignored? > > I can see from the API that there is HTMLRemarkNode, > but I can't see any similar tag or scanner. Must a > special tag/scanner be created to handle > comments/remarks? > > Thanks in advance. > > CHEERS> SAM > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance > Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Sam J. <ga...@yh...> - 2002-12-11 23:33:04
|
Hi Somik Sorry to ask so much this week, but I was wondering it there some operation for picking up HTML comments using the HTMLParser (<!-- a comment -->) or are they automatically ignored? I can see from the API that there is HTMLRemarkNode, but I can't see any similar tag or scanner. Must a special tag/scanner be created to handle comments/remarks? Thanks in advance. CHEERS> SAM |
From: Sam J. <ga...@yh...> - 2002-12-11 23:27:23
|
Hi Somik, Sorry that my mails are not attaching to the thread properly. I'm on digest, so when I reply to the digest meesage I think a new thread get automatically started, and the sourceforge mail interface doesn't let me reply directly to your messages Thanks for your suggestion below. As far as I can see from the code the parse method on HTMLParser is not being called. In fact it uses exactly the think you describe in your mail. I didn't really write this code. It's still basically the NeuroGridHTMLParser that you wrote a while back, modified into my coding format. Please find the code appended to this email. Both the links I have been parsing are specified in the NeuroGridHTMLParserTest.java file. Thanks in advance. CHEERS> SAM Somik wrote: >Sorry, I just saw your other mail again with the >output. I see the problem - > >You must be calling the parse method in >HTMLParser.java. That is only a demo. As mentioned in >the docs, you should be doing something like : > >(for HTMLEnumeration e = >parser.elements();e.hasMoreNodes();) { > HTMLNode node = e.nextHTMLNode(); > // create summary here >} > >The call to parse has the printing stuff which prints >all the details of the nodes (calling node.print()). > >If this does not help, can you post your complete >parsing program ? > |
From: Somik R. <so...@ya...> - 2002-12-11 19:46:03
|
Sorry, I just saw your other mail again with the output. I see the problem - You must be calling the parse method in HTMLParser.java. That is only a demo. As mentioned in the docs, you should be doing something like : (for HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { HTMLNode node = e.nextHTMLNode(); // create summary here } The call to parse has the printing stuff which prints all the details of the nodes (calling node.print()). If this does not help, can you post your complete parsing program ? Regards, Somik --- Sam Joseph <ga...@yh...> wrote: > > Hi Somik > > > Somik wrote: > > >>Most importantly I seem to get a lot of debug > output > >>text that I would > >>prefer to avoid, see the examples below. Perhaps > I'm > >>mistaken but this > >>seems to be output by default. Is there some way > for > >>me to avoid getting > >>this debug output? > >> > >> > > > >Of course - HTMLParser now takes in a logging > object - > >HTMLParserFeedback. All you have to do is to > implement > >this interface and pass your object in to the > parser. > >If you don't, a DefaultHTMLParserFeedback object is > >created - and its function is to send log data to > >System.out. > > > Well I wrote the following: > > private class BlankHTMLParserFeedback > implements HTMLParserFeedback > { > public void info(String message) > { > //System.out.println("INFO: " + message); > } > > public void warning(String message) > { > //System.out.println("WARNING: " + message); > } > > public void error(String message, > HTMLParserException e) > { > //System.out.println("ERROR: " + message); > e.printStackTrace(); > } > } > > > > /** > * parse the page > */ > public final void parse() > throws Exception > { > if (o_parser==null) > o_parser = new HTMLParser(o_url, new > BlankHTMLParserFeedback()); > > o_parser.addScanner(new HTMLMetaTagScanner("-t")); > o_parser.addScanner(new HTMLLinkScanner("-l")); > o_parser.addScanner(new HTMLTitleScanner("-a")); > parseURLForData(); > o_summary = createSummary(); > } > > However I still seem to be getting the same debug > output. Can you see > what I am doing wrong? > > Have you considered using log4j? With log4j you have > a log4j properties > file and you can specify the debug level on a class > by class basis > within the properties file, and debug output can be > formatted to give > you useful info such as the line number of the code > where the debug > statement is. > > Thanks in advance. > > CHEERS> SAM > > p.s. is there some operation for picking up HTML > comments using the > HTMLParser (<!-- a comment -->) or are they > automatically ignored? > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance > Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |