htmlparser-user Mailing List for HTML Parser (Page 20)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: answers s. <fas...@gm...> - 2008-05-22 12:36:02
|
Hi i am strututre like to extract a table so that it doesnot have nested table inside it . nodefilter filtertable = new AndFilter( new HasParentFilter(new TagNameFilter("table"),new NotFilter(new HasChildFilter(new TagNameFilter("table))); still the o/p i see a table with nested table in it . |
From: abdullah <abd...@id...> - 2008-05-20 12:37:28
|
you dont need a linkExtractor you need a listExtractor , if all the links are inside lists you should get the list and navigate to its children which is the links .. for this case i suggest you parse the page with filter as following : Parser parser = new Parser(); NodeList lists = parser.parse(new NodeClassFilter(BulletList.class)); for(int i=0 i < lists.size() ;i++ ){ BulletList list = lists.elementAt(i); links = list.getChildern(); // this will give you another NodeList with children tags // do whatever you want with the links note that you need to cast each child them forn Node to LinkTag } i didnt test this code , but hopefully it will work if you gave me a specific example of the html page you want to parse i may help more good luck : ) On Tue, May 20, 2008 at 10:13 AM, <Sri...@ba...> wrote: > > Hi everyone, > > I am a new user of the HTMLParser API. I have found the link extraction > features to be very useful even in this short space of time. > > I would like to seek help with a program that I have to write. It > involves link extraction, but the logic is slightly more convoluted. > > Currently, I know how to use the LinkExtractor to supply a HTML document > as input and output the links in that document to either the command > prompt or a text file (with suitable modifications where required of > course). I have a HTML document in which there is a hierarchy of links > in the form of lists. I would like the output of the link information > given by LinkExtractor to reflect this hierarchy in some way. > > For example, I have a list of items in a <ul> tag. Each of these items > may/may not contain their own sub-items with their own links, so that > the HTML looks something like: > > <ul> > <li> <a href="...."> Item 1 </a> > <ul> > <li> <a href="...."> Sub-Item 1 </a> </li> > <li> <a href="...."> Sub-Item 2 </a> </li> > </ul> > > <li> Item 2 </li> > </ul> > > I would like to know how I can parse a document full of lists like these > and extract the links while having some indication of the hierarchy, > either the "tree path" of the link (i.e. if I extract the link > underyling Sub-Item 1 in my example, my text file should contain > something along the lines of "Item 1 > Sub-Item 1" before printing the > actual link path) or outputting a page identical to the one I am parsing > but with the full path of the link printed beside each of those list > items. > > Thanks for all your help in this regard. > > Warm Regards, > > Sridhar Venkataraman > Summer Analyst, Global Technology (Asia-Pacific) > Barclays Capital Services Ltd > 60B Orchard Road #10-00, TheAtrium@Orchard, > Singapore - 238891 > + (65) 6828 4609 (O) > + (65) 9871 0076 (m) | sri...@ba... > > > _______________________________________________ > > This e-mail may contain information that is confidential, privileged or > otherwise protected from disclosure. If you are not an intended recipient of > this e-mail, do not duplicate or redistribute it by any means. Please delete > it and any attachments and notify the sender that you have received it in > error. Unless specifically indicated, this e-mail is not an offer to buy or > sell or a solicitation to buy or sell any securities, investment products or > other financial product or service, an official confirmation of any > transaction, or an official statement of Barclays. Any views or opinions > presented are solely those of the author and do not necessarily represent > those of Barclays. This e-mail is subject to terms available at the > following link: www.barcap.com/emaildisclaimer. By messaging with Barclays > you consent to the foregoing. Barclays Capital is the investment banking > division of Barclays Bank PLC, a company registered in England (number > 1026167) with its registered offic > e at 1 Churchill Place, London, E14 5HP. This email may relate to or be > sent from other members of the Barclays Group. > _______________________________________________ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: 長弘 大樹 <nag...@by...> - 2008-05-20 08:34:13
|
Dear All, I am new to HTML Parser, and I don't understand well how to handle !DOCTYPE tag. Shortly speaking, I'd like to replace tag like this: <!DOCTYPE html PUBLIC "XXXX" "AAAA"> into: <! DOCTYPE html PUBLIC "YYYY" "BBBB"> I sat on my chair and had a lots of trial and error, but it did'nt work. I'd appreciate it if you could give me advice. (My e-mail address had changed.) |
From: <Sri...@ba...> - 2008-05-20 07:13:52
|
Hi everyone, I am a new user of the HTMLParser API. I have found the link extraction features to be very useful even in this short space of time. I would like to seek help with a program that I have to write. It involves link extraction, but the logic is slightly more convoluted. Currently, I know how to use the LinkExtractor to supply a HTML document as input and output the links in that document to either the command prompt or a text file (with suitable modifications where required of course). I have a HTML document in which there is a hierarchy of links in the form of lists. I would like the output of the link information given by LinkExtractor to reflect this hierarchy in some way. For example, I have a list of items in a <ul> tag. Each of these items may/may not contain their own sub-items with their own links, so that the HTML looks something like: <ul> <li> <a href="...."> Item 1 </a> <ul> <li> <a href="...."> Sub-Item 1 </a> </li> <li> <a href="...."> Sub-Item 2 </a> </li> </ul> <li> Item 2 </li> </ul> I would like to know how I can parse a document full of lists like these and extract the links while having some indication of the hierarchy, either the "tree path" of the link (i.e. if I extract the link underyling Sub-Item 1 in my example, my text file should contain something along the lines of "Item 1 > Sub-Item 1" before printing the actual link path) or outputting a page identical to the one I am parsing but with the full path of the link printed beside each of those list items. Thanks for all your help in this regard. Warm Regards, Sridhar Venkataraman Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - 238891 + (65) 6828 4609 (O) + (65) 9871 0076 (m) | sri...@ba... _______________________________________________ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. _______________________________________________ |
From: <bo...@ti...> - 2008-05-14 08:23:47
|
All the pages which don't work come from the same source... They all have these meta tags. I believe there is an option to force decoding with a different character set but the way I retrieve the pages - I don't seem to have the opportunity to do so maybe if someone can give me a few lines of sample code on how to do that - I would appreciate it. What I do at the moment is: parser = new Parser(URL); ThePage = parser.parse(null); MyPage = ThePage.toHtml(); And that doesn't give the oportunity to change the decoding. I believe you can read the page and then "force" decoding with a different character set but I can't figure out how to do that. Is there an example somewhere of how to do this? Thanks again Brian ----- Original Message ---- There might be an issue between the ISO-8859-1 and UTF-8. Here's a random explanation - out of many on the net - http://www. stanford.edu/~laurik/fsmbook/faq/utf8.html You'll have to determine if the character you want has an encoding in ISO-8859-1. The parser should switch to interpreting in UTF-8 when it encounters the meta tag. Do all pages have the meta tag? Or just the ones that are OK. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: bo...@ti...; htm...@li... Sent: Tuesday, May 13, 2008 3:33:57 AM Subject: Re: [Htmlparser-user] Character Encoding Thanks Derrick, The relevant section of the ConnectionMonitor output is: INFO: HTTP/1.1 200 OK Cache-Control: private Content-Type: text/html; charset=ISO-8859-1 Transfer-Encoding: chunked Does that help? Thanks Brian ----- Original Message ---- That <meta> tag doesn't look like the problem. If you use the built in ConnectionMonitor on the parser, you can see the header: C:>java -classpath parser\target\htmlparser.jar; lexer\target\htmllexer. jar org.htmlparser.Parser http://cbc.ca INFO: GET http://cbc.ca HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 301 Moved Permanently Date: Tue, 13 May 2008 01:12:31 GMT Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Location: http://www.cbc.ca/ Cache-Control: max-age=120 Expires: Tue, 13 May 2008 01:14:31 GMT Content-Length: 226 Keep-Alive: timeout=15, max=150 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 INFO: GET http://www.cbc.ca/ HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 200 OK Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Accept-Ranges: bytes Content-Type: text/html Cache-Control: max-age=61 Expires: Tue, 13 May 2008 01:13:32 GMT Date: Tue, 13 May 2008 01:12:31 GMT Content-Length: 28625 Connection: keep-alive ----- Original Message ---- ----Original Message---- From: bo...@ti... Date: 12/05/2008 12:55 To: <htm...@li...> Subj: [Htmlparser-user] Character Encoding Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ Free games from Tiscali Play - http://www.tiscali.co.uk/play |
From: <bo...@ti...> - 2008-05-13 07:34:11
|
Thanks Derrick, The relevant section of the ConnectionMonitor output is: INFO: HTTP/1.1 200 OK Cache-Control: private Content-Type: text/html; charset=ISO-8859-1 Transfer-Encoding: chunked Does that help? Thanks Brian ----- Original Message ---- That <meta> tag doesn't look like the problem. If you use the built in ConnectionMonitor on the parser, you can see the header: C:>java -classpath parser\target\htmlparser.jar;lexer\target\htmllexer. jar org.htmlparser.Parser http://cbc.ca INFO: GET http://cbc.ca HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 301 Moved Permanently Date: Tue, 13 May 2008 01:12:31 GMT Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Location: http://www.cbc.ca/ Cache-Control: max-age=120 Expires: Tue, 13 May 2008 01:14:31 GMT Content-Length: 226 Keep-Alive: timeout=15, max=150 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 INFO: GET http://www.cbc.ca/ HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 200 OK Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Accept-Ranges: bytes Content-Type: text/html Cache-Control: max-age=61 Expires: Tue, 13 May 2008 01:13:32 GMT Date: Tue, 13 May 2008 01:12:31 GMT Content-Length: 28625 Connection: keep-alive ----- Original Message ---- ----Original Message---- From: bo...@ti... Date: 12/05/2008 12:55 To: <htm...@li...> Subj: [Htmlparser-user] Character Encoding Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co. uk/protection ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun. com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection |
From: Derrick O. <der...@ro...> - 2008-05-13 01:17:42
|
That <meta> tag doesn't look like the problem. If you use the built in ConnectionMonitor on the parser, you can see the header: C:>java -classpath parser\target\htmlparser.jar;lexer\target\htmllexer.jar org.htmlparser.Parser http://cbc.ca INFO: GET http://cbc.ca HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 301 Moved Permanently Date: Tue, 13 May 2008 01:12:31 GMT Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Location: http://www.cbc.ca/ Cache-Control: max-age=120 Expires: Tue, 13 May 2008 01:14:31 GMT Content-Length: 226 Keep-Alive: timeout=15, max=150 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 INFO: GET http://www.cbc.ca/ HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 200 OK Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Accept-Ranges: bytes Content-Type: text/html Cache-Control: max-age=61 Expires: Tue, 13 May 2008 01:13:32 GMT Date: Tue, 13 May 2008 01:12:31 GMT Content-Length: 28625 Connection: keep-alive ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:55:56 AM Subject: [Htmlparser-user] Character Encoding Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: <bo...@ti...> - 2008-05-12 11:56:08
|
Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection |
From: <bo...@ti...> - 2008-05-12 11:31:57
|
Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection |
From: Derrick O. <der...@ro...> - 2008-05-11 13:03:18
|
A brute force approach would be to generate the parse tree in a NodeList with Parser.parse(null). Then recursively traverse the tree converting each sublist into text, until a plain text match occurs. In pseudo code the method would look something like this: findString (string, node_list) make a new StringBean apply visitAllNodesWith to the node list using the string_bean get the plain_text from the string_bean if string matches plain_text you are done, return the node_list else for each child in node_list try recursing into findString with the string and child ----- Original Message ---- From: Davide Taibi <da...@ta...> To: htmlparser user list <htm...@li...> Sent: Sunday, May 11, 2008 1:10:04 AM Subject: Re: [Htmlparser-user] Regex Filter Unfortunately I think that I need to remember the container tag. I'll try to better explain my problem. My aim is to extract all the text included in a tag that contain a substring. I have a list of excerpt from an RSS feed and I need to extract the whole content of a web post only knowing the excerpt (the first sentence of the post). In example I have this excerpt: "Davide Taibi, Luigi Lavazza, and Sandro MorascaUniversità dell'Insubria People and organizations that are considering the adoption of OSS..." and I have to extract the content of this post http://www.taibi.it/?p=39 The first part of the excerpt is in a <strong> tag while the second not. My Idea is to find the tag container and then extract all the content. Which strategy should I use? Thanks Davide On Sun, May 11, 2008 at 5:28 AM, Derrick Oswald <der...@ro...> wrote: Do you want to keep the tags? If not just use the StringBean to extract all the text and then look for the string to get its position. If you need to keep the tags it is more difficult. Someone else had modified the StringBean to remember the node or offset of each piece of text added to the buffer. This list of nodes or offsets could be used after a straight string comparison on the text to figure out the start and end node or offsets. From there you can extract the complete html. ----- Original Message ---- From: Davide Taibi <da...@ta...> To: htm...@li... Sent: Saturday, May 10, 2008 2:30:40 PM Subject: [Htmlparser-user] Regex Filter Dear all, I have a problem with regular expressions. I'd like to extract a block of text from an html page. I know how the text start (the first 10 words) but I don'k know if there are any tags inside. In other words, I have to find if a sentence "A" is written in an Html page "B". My problem is that the sentence "A" is written in plain text and the second one in html and could be nested in several nodes. Then... the first sentence can be written in the second including some html tags or spaces between words: Example: sentence a: "After hours of trying to sort the problem with uploading..." sentence b: "Dear All, <br/>After <i>hours of trying</i> to sort the <strong> problem with <a href="xxxxxx.html" >uploading pictures</a> </strong>to this thing I decided..." The sentence a should match correctly the b at position 15. I've tried to do this but it doesn't works: protected static String extractContent(String html, String searchText) throws ParserException{ Page page = new Page(html); Lexer lex = new Lexer(page); Parser parser = new Parser(lex); NodeList list = new NodeList(); NodeFilter filter = new RegexFilter(searchText); for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { it.nextNode().collectInto(list, filter); } if(list.size()>0){ System.out.println("text found n."+list.size() + "times"); return Translate.decode(list.toHtml()); } else System.out.println("text not found"); return null; } Tanks in advance Davide Taibi http://www.taibi.it ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Davide T. <da...@ta...> - 2008-05-11 08:10:08
|
Unfortunately I think that I need to remember the container tag. I'll try to better explain my problem. My aim is to extract all the text included in a tag that contain a substring. I have a list of excerpt from an RSS feed and I need to extract the whole content of a web post only knowing the excerpt (the first sentence of the post). In example I have this excerpt: "Davide Taibi, Luigi Lavazza, and Sandro Morasca* Università dell'Insubria* People and organizations that are considering the adoption of OSS..." and I have to extract the content of this post http://www.taibi.it/?p=39 The first part of the excerpt is in a <strong> tag while the second not. My Idea is to find the tag container and then extract all the content. Which strategy should I use? Thanks Davide On Sun, May 11, 2008 at 5:28 AM, Derrick Oswald <der...@ro...> wrote: > Do you want to keep the tags? If not just use the StringBean to extract all > the text and then look for the string to get its position. > If you need to keep the tags it is more difficult. > Someone else had modified the StringBean to remember the node or offset of > each piece of text added to the buffer. > This list of nodes or offsets could be used after a straight string > comparison on the text to figure out the start and end node or offsets. From > there you can extract the complete html. > > > ----- Original Message ---- > From: Davide Taibi <da...@ta...> > To: htm...@li... > Sent: Saturday, May 10, 2008 2:30:40 PM > Subject: [Htmlparser-user] Regex Filter > > Dear all, I have a problem with regular expressions. > > I'd like to extract a block of text from an html page. > > I know how the text start (the first 10 words) but I don'k know if there > are any tags inside. > > In other words, I have to find if a sentence "A" is written in an Html page > "B". My problem is that the sentence "A" is written in plain text and the > second one in html and could be nested in several nodes. > > Then... the first sentence can be written in the second including some html > tags or spaces between words: > > Example: > > sentence a: "After hours of trying to sort the problem with uploading..." > sentence b: "Dear All, <br/>After *<i>**hours* of trying</i> to sort > the <strong> problem with <a href="xxxxxx.html" >uploading > pictures</a> </strong>to this thing I decided..." > > The sentence a should match correctly the b at position 15. > > > I've tried to do this but it doesn't works: > > protected static String extractContent(String html, String searchText) > throws ParserException{ > Page page = new Page(html); > Lexer lex = new Lexer(page); > Parser parser = new Parser(lex); > NodeList list = new NodeList(); > > NodeFilter filter = new RegexFilter(searchText); > for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { > it.nextNode().collectInto(list, filter); > } > if(list.size()>0){ > System.out.println("text found n."+list.size() + "times"); > return Translate.decode(list.toHtml()); > } > else > System.out.println("text not found"); > return null; > } > > > Tanks in advance > > Davide Taibi > http://www.taibi.it > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <der...@ro...> - 2008-05-11 03:29:16
|
Do you want to keep the tags? If not just use the StringBean to extract all the text and then look for the string to get its position. If you need to keep the tags it is more difficult. Someone else had modified the StringBean to remember the node or offset of each piece of text added to the buffer. This list of nodes or offsets could be used after a straight string comparison on the text to figure out the start and end node or offsets. From there you can extract the complete html. ----- Original Message ---- From: Davide Taibi <da...@ta...> To: htm...@li... Sent: Saturday, May 10, 2008 2:30:40 PM Subject: [Htmlparser-user] Regex Filter Dear all, I have a problem with regular expressions. I'd like to extract a block of text from an html page. I know how the text start (the first 10 words) but I don'k know if there are any tags inside. In other words, I have to find if a sentence "A" is written in an Html page "B". My problem is that the sentence "A" is written in plain text and the second one in html and could be nested in several nodes. Then... the first sentence can be written in the second including some html tags or spaces between words: Example: sentence a: "After hours of trying to sort the problem with uploading..." sentence b: "Dear All, <br/>After <i>hours of trying</i> to sort the <strong> problem with <a href="xxxxxx.html" >uploading pictures</a> </strong>to this thing I decided..." The sentence a should match correctly the b at position 15. I've tried to do this but it doesn't works: protected static String extractContent(String html, String searchText) throws ParserException{ Page page = new Page(html); Lexer lex = new Lexer(page); Parser parser = new Parser(lex); NodeList list = new NodeList(); NodeFilter filter = new RegexFilter(searchText); for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { it.nextNode().collectInto(list, filter); } if(list.size()>0){ System.out.println("text found n."+list.size() + "times"); return Translate.decode(list.toHtml()); } else System.out.println("text not found"); return null; } Tanks in advance Davide Taibi http://www.taibi.it |
From: Davide T. <da...@ta...> - 2008-05-10 21:30:42
|
Dear all, I have a problem with regular expressions. I'd like to extract a block of text from an html page. I know how the text start (the first 10 words) but I don'k know if there are any tags inside. In other words, I have to find if a sentence "A" is written in an Html page "B". My problem is that the sentence "A" is written in plain text and the second one in html and could be nested in several nodes. Then... the first sentence can be written in the second including some html tags or spaces between words: Example: sentence a: "After hours of trying to sort the problem with uploading..." sentence b: "Dear All, <br/>After *<i>**hours* of trying</i> to sort the <strong> problem with <a href="xxxxxx.html" >uploading pictures</a> </strong>to this thing I decided..." The sentence a should match correctly the b at position 15. I've tried to do this but it doesn't works: protected static String extractContent(String html, String searchText) throws ParserException{ Page page = new Page(html); Lexer lex = new Lexer(page); Parser parser = new Parser(lex); NodeList list = new NodeList(); NodeFilter filter = new RegexFilter(searchText); for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { it.nextNode().collectInto(list, filter); } if(list.size()>0){ System.out.println("text found n."+list.size() + "times"); return Translate.decode(list.toHtml()); } else System.out.println("text not found"); return null; } Tanks in advance Davide Taibi http://www.taibi.it |
From: Derrick O. <der...@ro...> - 2008-04-30 13:53:28
|
The parser would make a number of TextNodes out of that separated by TagNodes with BR names. You'll need to handle this sort of partial extraction of mixed text and HTML yourself, possibly by just defining and registering a BrTag that prints <BR> even for the toText() method, then toText() the whole section. ----- Original Message ---- From: Nagahiro Daiki <e27...@gm...> To: htm...@li... Sent: Tuesday, April 29, 2008 11:00:13 PM Subject: [Htmlparser-user] How to extract untagged text Hello. I'm new to HTML Parser. For example, ----- <html> <body> <object id="aaa"> ... </object> Lorem ipsum dolor sit amet,<br> consectetur adipisicing elit,<br> sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <object id="xxx"> ... </object> </body> </html> ----- My question: How to extract ----- Lorem ipsum dolor sit amet,<br> consectetur adipisicing elit,<br> sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. ----- ? I tried toHtml method of TextNode, but it seems to ignore the <br> tag. Thanks for help! Anonymous Otaku ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Nagahiro D. <e27...@gm...> - 2008-04-30 03:00:16
|
Hello. I'm new to HTML Parser. For example, ----- <html> <body> <object id="aaa"> ... </object> Lorem ipsum dolor sit amet,<br> consectetur adipisicing elit,<br> sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <object id="xxx"> ... </object> </body> </html> ----- My question: How to extract ----- Lorem ipsum dolor sit amet,<br> consectetur adipisicing elit,<br> sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. ----- ? I tried toHtml method of TextNode, but it seems to ignore the <br> tag. Thanks for help! Anonymous Otaku |
From: Derrick O. <der...@ro...> - 2008-04-15 01:06:41
|
Try this. import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.Text; import org.htmlparser.lexer.Page; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.*; import org.htmlparser.nodes.TextNode; class MyText extends TextNode { public MyText () { super (null, 0, 0); mText = null; } public String toPlainTextString() { return (org.htmlparser.util.Translate.decode (super.toPlainTextString ())); } } public class Test { public static void main (String[] args) throws ParserException { String html = "<html><body>\n<script>alert('hi');</script><select id=\"da\"></select>" + "http://googlelink.com\">123456 " + "<h1>hello</h1><a href=cnn.com></a>\n" + "http://google.com " + "https://cnn.com/?test=3&2=d \n" + "http://table.com" + "123\nhttp://www.alreadylinkified.com/\">http://www.alreadylinkified.com\n"; PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); factory.setTextPrototype (new MyText ()); Parser parser = new Parser(html); parser.setNodeFactory(factory); NodeList all = parser.parse(null); System.out.println( all.toHtml()); } } ----- Original Message ---- From: Jay Prall <jp...@se...> To: htmlparser user list <htm...@li...> Sent: Wednesday, April 9, 2008 11:33:43 AM Subject: Re: [Htmlparser-user] Trouble overriding textnodes using setTextPrototype Re: [Htmlparser-user] Trouble overriding textnodes using setTextPrototype Derrick, When I do this all text in the text nodes is removed. How can I subclass TextNode? Thanks Jay On 4/2/08 9:38 PM, "Derrick Oswald" <der...@ro...> wrote: Just add an empty string in the constructor call: factory.setTextPrototype (new TextNode ("") { ----- Original Message ---- From: Jay Prall <jp...@se...> To: htm...@li... Sent: Wednesday, April 2, 2008 3:46:14 PM Subject: [Htmlparser-user] Trouble overriding textnodes using setTextPrototype Trouble overriding textnodes using setTextPrototype // This code doesn't compile. It complains "The constructor TextNode() is undefined". I got this in the documentation and thought it was a way to override textnodes? // My goal is to override TextNode so that I can process text and turn http://link.com into a real link <a href="link.com">link.com</a> // Any ideas? import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.Text; import org.htmlparser.lexer.Page; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.*; import org.htmlparser.nodes.TextNode; public static void main (String[] args) throws ParserException { String html = "<html><body>\n<script>alert('hi');</script><select id=\"da\"></select>" + " http://googlelink.com\">123456 <%5C%22%3Ca> " + "<h1>hello</h1><a href=cnn.com></a>\n" + "http://google.com</br>" + "<b>https://cnn.com/?test=3&2=d</b></p>\n" + "<table><tr><td>http://table.com</td></tr></table>" + "<a href=\"a.html\">123</a>\n<a href=\"http://www.alreadylinkified.com/\">http://www.alreadylinkified.com</a>\n</body></html>"; PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); factory.setTextPrototype (new TextNode () { public String toPlainTextString() { return (org.htmlparser.util.Translate.decode (super.toPlainTextString ())); } }); Parser parser = new Parser(html); parser.setNodeFactory(factory); NodeList all = parser.parse(null); System.out.println( all.toHtml()); } } -----Inline Attachment Follows----- ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace -----Inline Attachment Follows----- _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Jumbo P. <jum...@gm...> - 2008-04-14 02:11:54
|
Hello all, I recently started getting the following error on Parser p = new Parser(" http://sub.mydomain.com/somepage.jsp"); java.security.cert.CertificateException: No name matching sub.mydomain.comfound The page clearly exists but can't be opened for some reason. Has anyone ever seen this before or know what could be causing it? Thank you for your help. |
From: Jay P. <jp...@se...> - 2008-04-09 15:35:35
|
Derrick, When I do this all text in the text nodes is removed. How can I subclass TextNode? Thanks Jay On 4/2/08 9:38 PM, "Derrick Oswald" <der...@ro...> wrote: > Just add an empty string in the constructor call: > factory.setTextPrototype (new TextNode ("") { > > > ----- Original Message ---- > From: Jay Prall <jp...@se...> > To: htm...@li... > Sent: Wednesday, April 2, 2008 3:46:14 PM > Subject: [Htmlparser-user] Trouble overriding textnodes using setTextPrototype > > Trouble overriding textnodes using setTextPrototype > // This code doesn't compile. It complains "The constructor TextNode() is > undefined". I got this in the documentation and thought it was a way to > override textnodes? > // My goal is to override TextNode so that I can process text and turn > http://link.com into a real link <a href="link.com">link.com</a> > // Any ideas? > > import org.htmlparser.Node; > import org.htmlparser.Parser; > import org.htmlparser.Text; > import org.htmlparser.lexer.Page; > import org.htmlparser.tags.LinkTag; > import org.htmlparser.util.NodeList; > import org.htmlparser.util.ParserException; > import org.htmlparser.PrototypicalNodeFactory; > import org.htmlparser.tags.*; > import org.htmlparser.nodes.TextNode; > > public static void main (String[] args) throws ParserException > { > String html = > "<html><body>\n<script>alert('hi');</script><select id=\"da\"></select>" + > " > > http://googlelink.com\">123456 <%5C%22%3Ca> > " + > > "<h1>hello</h1><a href=cnn.com></a>\n" + > "http://google.com</br>" + > "<b>https://cnn.com/?test=3&2=d</b></p>\n" + > "<table><tr><td>http://table.com</td></tr></table>" + > "<a href=\"a.html\">123</a>\n<a > href=\"http://www.alreadylinkified.com/\">http://www.alreadylinkified.com</a>\ > n</body></html>"; > > PrototypicalNodeFactory factory = new > PrototypicalNodeFactory(); > factory.setTextPrototype (new TextNode () { > public String toPlainTextString() > { > return (org.htmlparser.util.Translate.decode > (super.toPlainTextString ())); > } > }); > > Parser parser = new Parser(html); > parser.setNodeFactory(factory); > NodeList all = parser.parse(null); > System.out.println( all.toHtml()); > } > > } > > > -----Inline Attachment Follows----- > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > > > -----Inline Attachment Follows----- > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <der...@ro...> - 2008-04-03 01:38:16
|
Just add an empty string in the constructor call: factory.setTextPrototype (new TextNode ("") { ----- Original Message ---- From: Jay Prall <jp...@se...> To: htm...@li... Sent: Wednesday, April 2, 2008 3:46:14 PM Subject: [Htmlparser-user] Trouble overriding textnodes using setTextPrototype Trouble overriding textnodes using setTextPrototype // This code doesn't compile. It complains "The constructor TextNode() is undefined". I got this in the documentation and thought it was a way to override textnodes? // My goal is to override TextNode so that I can process text and turn http://link.com into a real link <a href="link.com">link.com</a> // Any ideas? import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.Text; import org.htmlparser.lexer.Page; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.*; import org.htmlparser.nodes.TextNode; public static void main (String[] args) throws ParserException { String html = "<html><body>\n<script>alert('hi');</script><select id=\"da\"></select>" + "http://googlelink.com\">123456 " + "<h1>hello</h1><a href=cnn.com></a>\n" + "http://google.com</br>" + "<b>https://cnn.com/?test=3&2=d</b></p>\n" + "<table><tr><td>http://table.com</td></tr></table>" + "<a href=\"a.html\">123</a>\n<a href=\"http://www.alreadylinkified.com/\">http://www.alreadylinkified.com</a>\n</body></html>"; PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); factory.setTextPrototype (new TextNode () { public String toPlainTextString() { return (org.htmlparser.util.Translate.decode (super.toPlainTextString ())); } }); Parser parser = new Parser(html); parser.setNodeFactory(factory); NodeList all = parser.parse(null); System.out.println( all.toHtml()); } } -----Inline Attachment Follows----- ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace -----Inline Attachment Follows----- _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Jay P. <jp...@se...> - 2008-04-02 19:48:27
|
// This code doesn't compile. It complains "The constructor TextNode() is undefined". I got this in the documentation and thought it was a way to override textnodes? // My goal is to override TextNode so that I can process text and turn http://link.com into a real link <a href="link.com">link.com</a> // Any ideas? import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.Text; import org.htmlparser.lexer.Page; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.*; import org.htmlparser.nodes.TextNode; public static void main (String[] args) throws ParserException { String html = "<html><body>\n<script>alert('hi');</script><select id=\"da\"></select>" + "<p><a href=\"http://googlelink.com\">123456</a><br/>" + "<h1>hello</h1><a href=cnn.com></a>\n" + "http://google.com</br>" + "<b>https://cnn.com/?test=3&2=d</b></p>\n" + "<table><tr><td>http://table.com</td></tr></table>" + "<a href=\"a.html\">123</a>\n<a href=\"http://www.alreadylinkified.com/\">http://www.alreadylinkified.com</a >\n</body></html>"; PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); factory.setTextPrototype (new TextNode () { public String toPlainTextString() { return (org.htmlparser.util.Translate.decode (super.toPlainTextString ())); } }); Parser parser = new Parser(html); parser.setNodeFactory(factory); NodeList all = parser.parse(null); System.out.println( all.toHtml()); } } |
From: Jumbo P. <jum...@gm...> - 2008-04-02 19:31:49
|
I figured it out. It's actually pretty simple. Here is the code. Thanks anyway. Parser p = new Parser(url); NodeList list = p.extractAllNodesThatMatch (new AndFilter (new TagNameFilter ("div"), new HasAttributeFilter("class", "body"))); StringBean sb = new StringBean(); list.visitAllNodesWith(sb); System.out.println(sb.getStrings()); On Tue, Apr 1, 2008 at 7:38 PM, Joshua Kerievsky <jo...@in...> wrote: > You'll want to write your very own Visitor. > > Something like this (I'm using an older version of htmlparser for this > example): > > public class DivVisitor extends NodeVisitor { > > public void visitTag(Tag tag) { > // see if the tag is a div tag here and then check its attibutes > // if it matches what you want, collect it into something that this > visitor can return via some getter method > } > } > > Send your DivVisitor into the parser as you were doing with the > ObjectFIndingVisitor. > > Hope that helps, > jk > > On Tue, Apr 1, 2008 at 3:06 PM, Jumbo Pongo <jum...@gm...> wrote: > > > Thanks for the reply, Joshua. I think that's what I'm trying to do. > > The part I'm stuck on is where to distinguish that I only want the div tag > > that has the attribute class="body". Here is my code: > > > > String contents = null; > > > > Parser parser = new Parser(url); > > ObjectFindingVisitor visitor = new ObjectFindingVisitor(Div.class); > > parser.visitAllNodesWith(visitor); > > > > Node[] nodes = visitor.getTags(); // do I really want to use getTags() > > here? > > for (int i = 0; i < nodes.length; i++) > > { > > // if nodes[i] has attribute class="body", then get the page text > > enclosed in the div tags > > // what to do here? > > } > > > > return contents; > > > > > > Obviously I am new to htmlparser, so much thanks in advance. > > > > > > > > ------------------------------------------------------------------------- > > Check out the new SourceForge.net Marketplace. > > It's the best place to buy or sell services for > > just about anything Open Source. > > > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Joshua K. <jo...@in...> - 2008-04-01 23:38:51
|
You'll want to write your very own Visitor. Something like this (I'm using an older version of htmlparser for this example): public class DivVisitor extends NodeVisitor { public void visitTag(Tag tag) { // see if the tag is a div tag here and then check its attibutes // if it matches what you want, collect it into something that this visitor can return via some getter method } } Send your DivVisitor into the parser as you were doing with the ObjectFIndingVisitor. Hope that helps, jk On Tue, Apr 1, 2008 at 3:06 PM, Jumbo Pongo <jum...@gm...> wrote: > Thanks for the reply, Joshua. I think that's what I'm trying to do. The > part I'm stuck on is where to distinguish that I only want the div tag that > has the attribute class="body". Here is my code: > > String contents = null; > > Parser parser = new Parser(url); > ObjectFindingVisitor visitor = new ObjectFindingVisitor(Div.class); > parser.visitAllNodesWith(visitor); > > Node[] nodes = visitor.getTags(); // do I really want to use getTags() > here? > for (int i = 0; i < nodes.length; i++) > { > // if nodes[i] has attribute class="body", then get the page text enclosed > in the div tags > // what to do here? > } > > return contents; > > > Obviously I am new to htmlparser, so much thanks in advance. > > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Jumbo P. <jum...@gm...> - 2008-04-01 22:06:27
|
Thanks for the reply, Joshua. I think that's what I'm trying to do. The part I'm stuck on is where to distinguish that I only want the div tag that has the attribute class="body". Here is my code: String contents = null; Parser parser = new Parser(url); ObjectFindingVisitor visitor = new ObjectFindingVisitor(Div.class); parser.visitAllNodesWith(visitor); Node[] nodes = visitor.getTags(); // do I really want to use getTags() here? for (int i = 0; i < nodes.length; i++) { // if nodes[i] has attribute class="body", then get the page text enclosed in the div tags // what to do here? } return contents; Obviously I am new to htmlparser, so much thanks in advance. |
From: Joshua K. <jo...@in...> - 2008-04-01 21:39:10
|
You could write your own NodeVisitor for this. --jk On Tue, Apr 1, 2008 at 11:54 AM, Jumbo Pongo <jum...@gm...> wrote: > Hello, > > I'm trying to extract only the page text inside div tags with the > attribute class="body". Inside the div-body tags are other tags, e.g. h1, > h2, p, etc., which themselves should be ignored but their enclosed text > should be included with the rest of the body text. > > I'm using extractAllNodesThatMatch but I don't see where I can limit it > only to the div tag with the attribute class="body". > > Can anyone figure this out? > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Jumbo P. <jum...@gm...> - 2008-04-01 18:54:09
|
Hello, I'm trying to extract only the page text inside div tags with the attribute class="body". Inside the div-body tags are other tags, e.g. h1, h2, p, etc., which themselves should be ignored but their enclosed text should be included with the rest of the body text. I'm using extractAllNodesThatMatch but I don't see where I can limit it only to the div tag with the attribute class="body". Can anyone figure this out? |