htmlparser-user Mailing List for HTML Parser (Page 31)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: sebb <se...@gm...> - 2007-02-11 21:15:11
|
I'm hoping to release an updated version of JMeter before too long, and would like to include the updated htmlparser 2.0 (with the new license, for which many thanks). The current build of 2.0 is listed as an "Integration" build - is this OK to use, or is there going to be a formal release of 2.0? S/// |
From: Gavin G. <ga...@br...> - 2007-02-01 10:49:44
|
Ah, I see what you mean. Looking at the last url it seems even more redirects are taking place until it's arriving at the final url - I've counted around 7 so far. Regardless though, I'm still at a loss why the redirects aren't being processed and the cookie stuff handled in the end. Gavin. On Wed, Jan 31, 2007 at 06:00:17PM -0800, Derrick Oswald wrote: > > That should be handled OK. > But it doesn't look like what you want from that URL. > > ----- Original Message ---- > From: Gavin Gilmour <ga...@br...> > To: htmlparser user list <htm...@li...> > Sent: Tuesday, January 30, 2007 8:49:25 AM > Subject: Re: [Htmlparser-user] Cookie handling > > Wow, good point. The URL seems to load fine here in my browser but on further > inspection: > > sokar:~/junk% curl http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT\?CRETRY\=1\&SRETRY\=0 > <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> > <html><head> > <title>302 Found</title> > </head><body> > <h1>Found</h1> > <p>The document has moved <a > href="http://www3.interscience.wiley.com/cookie_setting_error.html">here</a>.</p> > </body></html> > > Seems to have already decided it's dud, weird. > > After a bit of investigating, the full story is a bit worse than I thought and > is involing multiple redirects. The first link I need comes back from this service: > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks > - which just offers up a 302 or whatever and then issues a redirect. > > Fair enough, so: > > --- > sokar:~/junk% curl 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks' > <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> > <html><head> > <title>302 Found</title> > </head><body> > <h1>Found</h1> > <p>The document has moved <a > href="http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M">here</a>.</p> > </body></html> > --- > > Is giving 'http://dx...' which is what the parser is trying next I'd imagine. So then: > > sokar:~/junk% curl 'http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M' > <HTML><HEAD><TITLE>Handle Redirect</TITLE></HEAD> > <BODY><A > HREF="http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M">http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M</A></BODY></HTML> > > Looking at this URL, it seems to be the 'final one' which is leading to the > (desired) destination in a browser. (Does that output even look like something > the parser would handle though?) > > What a mess :( > > Gavin. > > P.S. Sorry about the horribly formatted mail due to the unsightly urls > involved. > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier. > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <der...@ro...> - 2007-02-01 02:00:37
|
=0AThat should be handled OK.=0ABut it doesn't look like what you want from= that URL.=0A=0A----- Original Message ----=0AFrom: Gavin Gilmour <gavin@br= okentrain.net>=0ATo: htmlparser user list <htm...@li...urceforg= e.net>=0ASent: Tuesday, January 30, 2007 8:49:25 AM=0ASubject: Re: [Htmlpar= ser-user] Cookie handling=0A=0AWow, good point. The URL seems to load fine = here in my browser but on further=0Ainspection:=0A=0Asokar:~/junk% curl htt= p://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT\?CRETRY\= =3D1\&SRETRY\=3D0=0A<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">=0A<h= tml><head>=0A<title>302 Found</title>=0A</head><body>=0A<h1>Found</h1>=0A<p= >The document has moved <a=0Ahref=3D"http://www3.interscience.wiley.com/coo= kie_setting_error.html">here</a>.</p>=0A</body></html>=0A=0ASeems to have a= lready decided it's dud, weird.=0A=0AAfter a bit of investigating, the full= story is a bit worse than I thought and=0Ais involing multiple redirects. = The first link I need comes back from this service:=0Ahttp://eutils.ncbi.nl= m.nih.gov/entrez/eutils/elink.fcgi?dbfrom=3Dpubmed&id=3D10629107&retmode=3D= ref&cmd=3Dprlinks=0A- which just offers up a 302 or whatever and then issue= s a redirect.=0A=0AFair enough, so:=0A=0A---=0Asokar:~/junk% curl 'http://e= utils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=3Dpubmed&id=3D106291= 07&retmode=3Dref&cmd=3Dprlinks'=0A<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML = 2.0//EN">=0A<html><head>=0A<title>302 Found</title>=0A</head><body>=0A<h1>F= ound</h1>=0A<p>The document has moved <a=0Ahref=3D"http://dx.doi.org/10.100= 2/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M">here</a>.</p>= =0A</body></html>=0A---=0A=0AIs giving 'http://dx...' which is what the par= ser is trying next I'd imagine. So then:=0A=0Asokar:~/junk% curl 'http://dx= .doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M'=0A<H= TML><HEAD><TITLE>Handle Redirect</TITLE></HEAD>=0A<BODY><A=0AHREF=3D"http:/= /doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAI= D-JCB11%3E3.0.CO%3B2-M">http://doi.wiley.com/10.1002/%28SICI%291097-4644%28= 1999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M</A></BODY></HTML>=0A= =0ALooking at this URL, it seems to be the 'final one' which is leading to = the=0A(desired) destination in a browser. (Does that output even look like = something=0Athe parser would handle though?)=0A=0AWhat a mess :(=0A=0AGavin= .=0A=0AP.S. Sorry about the horribly formatted mail due to the unsightly ur= ls=0Ainvolved.=0A=0A=0A=0A=0A=0A |
From: Gavin G. <ga...@br...> - 2007-01-30 13:49:39
|
Wow, good point. The URL seems to load fine here in my browser but on further inspection: sokar:~/junk% curl http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT\?CRETRY\=1\&SRETRY\=0 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="http://www3.interscience.wiley.com/cookie_setting_error.html">here</a>.</p> </body></html> Seems to have already decided it's dud, weird. After a bit of investigating, the full story is a bit worse than I thought and is involing multiple redirects. The first link I need comes back from this service: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks - which just offers up a 302 or whatever and then issues a redirect. Fair enough, so: --- sokar:~/junk% curl 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks' <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M">here</a>.</p> </body></html> --- Is giving 'http://dx...' which is what the parser is trying next I'd imagine. So then: sokar:~/junk% curl 'http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M' <HTML><HEAD><TITLE>Handle Redirect</TITLE></HEAD> <BODY><A HREF="http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M">http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M</A></BODY></HTML> Looking at this URL, it seems to be the 'final one' which is leading to the (desired) destination in a browser. (Does that output even look like something the parser would handle though?) What a mess :( Gavin. P.S. Sorry about the horribly formatted mail due to the unsightly urls involved. On Tue, Jan 30, 2007 at 01:22:33PM +0000, sebb wrote: > Have you tried pasting the URL in a browser? > > If it works for you - it did not for me - perhaps the browser has > secretly fed the site with a tasty cookie it kept around for just such > a purpose ... > > On 30/01/07, Gavin Gilmour <ga...@br...> wrote: > > Hi Derrick, cheers for the quick reply. > > > > I knew I'd regret posting the code I did because the actual code I'm using does > > in fact have redirection processing enabled. (I had manged to mix up similar > > code elsewhere that actually needs the redirection stuff disabled.) > > > > Here's a quick example of the sort of problems I'm having (including a url of a > > page causing issues): > > > > --- > > import org.htmlparser.*; > > import org.htmlparser.http.*; > > import org.htmlparser.util.*; > > > > public class TestParser { > > > > public static void main(String[] argv) { > > > > try { > > ConnectionManager manager = Parser.getConnectionManager(); > > manager.setRedirectionProcessingEnabled(true); > > manager.setCookieProcessingEnabled(true); > > Parser parser = new Parser("http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT?CRETRY=1&SRETRY=0"); > > System.out.print(((NodeList)parser.parse(null)).toHtml()); > > } catch (ParserException pe) { > > pe.printStackTrace(); > > } > > } > > } > > > > sokar:~/junk% javac TestParser.java > > sokar:~/junk% java TestParser | grep -i --color Cookie > > > > <title>Wiley InterScience :: Session Cookies</title> > > <h2>Session Cookie Error</h2> > > <strong>An error has occured because we were unable to send a cookie to your > > web browser.</strong> Session cookies are commonly used to facilitate improved > > site navigation. > > In order to use Wiley InterScience you must have your browser set to accept > > cookies. Once you have logged in to Wiley InterScience, our Web server uses a > > temporary cookie to help us manage your visit. > > This <strong>Session Cookie</strong> is deleted when you logoff Wiley > > InterScience, or when you quit your browser. The cookie allows us to quickly > > determine your access control rights and your personal preferences during your > > online session. The Session Cookie is set out of necessity and not out of > > convenience. > > -- > > > > > > I'm getting the feeling that maybe the sites using some sort of strange > > technique for restricting access, but I've similar examples of other sites > > doing the same kind of things. > > > > Cheers again, > > > > Gavin. > > > > On Tue, Jan 30, 2007 at 04:36:49AM -0800, Derrick Oswald wrote: > > > Hello, > > > > > > In my experience, you will need both redirection processing and cookie processing. > > > The general scenario is a page is returned with cookie's and a redirect. > > > If the redirect is taken, the server looks for the cookie it tried to set with the first page. > > > You can monitor the traffic in the header by implementing ConnectionMonitor (like the Parser). > > > > > > Derrick > > > > > > ----- Original Message ---- > > > From: Gavin Gilmour <ga...@br...> > > > To: htm...@li... > > > Sent: Tuesday, January 30, 2007 5:59:41 AM > > > Subject: [Htmlparser-user] Cookie handling > > > > > > Hi, I hope this is the correct place to ask the mailing list is looking pretty > > > quiet according to sourceforge's mail archives :) > > > > > > I've ran into difficulty parsing certain sites that attempt to set cookies in > > > an effort to read them later. The sites in question generally spit back pages > > > saying that your browser or whatever is misconfigured and deny access. I assume > > > this is simply because they can't read the cookies that should've been set and > > > I wondered if there was an option to do so somewhere. > > > > > > I'm not trying to set *actual* cookies which most of the examples I've seen > > > indicate how to do, but just allow pages to dump cookies and read them later. > > > Is this possible or (possibly) unrealistic and difficult to implement? > > > > > > The current code I've got is: > > > > > > ... > > > ConnectionManager manager = Parser.getConnectionManager(); > > > manager.setRedirectionProcessingEnabled(false); > > > manager.setCookieProcessingEnabled(true); > > > Parser parser = new Parser(...); > > > ---- > > > > > > etc. > > > > > > > > > Cheers for any responses! > > > > > > Gavin. > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Gavin G. <ga...@br...> - 2007-01-30 13:22:48
|
Ach, I thought I was on to something there when I saw the method: 'static void setConnectionManager(ConnectionManager manager)' sitting in the API over at http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#main(java.lang.String[]) - However, adding Parser.setConnectionManager(manager) to the code below hasn't seemed to help :/ Gavin. On Tue, Jan 30, 2007 at 01:16:20PM +0000, Gavin Gilmour wrote: > Hi Derrick, cheers for the quick reply. > > I knew I'd regret posting the code I did because the actual code I'm using does > in fact have redirection processing enabled. (I had manged to mix up similar > code elsewhere that actually needs the redirection stuff disabled.) > > Here's a quick example of the sort of problems I'm having (including a url of a > page causing issues): > > --- > import org.htmlparser.*; > import org.htmlparser.http.*; > import org.htmlparser.util.*; > > public class TestParser { > > public static void main(String[] argv) { > > try { > ConnectionManager manager = Parser.getConnectionManager(); > manager.setRedirectionProcessingEnabled(true); > manager.setCookieProcessingEnabled(true); > Parser parser = new Parser("http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT?CRETRY=1&SRETRY=0"); > System.out.print(((NodeList)parser.parse(null)).toHtml()); > } catch (ParserException pe) { > pe.printStackTrace(); > } > } > } > > sokar:~/junk% javac TestParser.java > sokar:~/junk% java TestParser | grep -i --color Cookie > > <title>Wiley InterScience :: Session Cookies</title> > <h2>Session Cookie Error</h2> > <strong>An error has occured because we were unable to send a cookie to your > web browser.</strong> Session cookies are commonly used to facilitate improved > site navigation. > In order to use Wiley InterScience you must have your browser set to accept > cookies. Once you have logged in to Wiley InterScience, our Web server uses a > temporary cookie to help us manage your visit. > This <strong>Session Cookie</strong> is deleted when you logoff Wiley > InterScience, or when you quit your browser. The cookie allows us to quickly > determine your access control rights and your personal preferences during your > online session. The Session Cookie is set out of necessity and not out of > convenience. > -- > > > I'm getting the feeling that maybe the sites using some sort of strange > technique for restricting access, but I've similar examples of other sites > doing the same kind of things. > > Cheers again, > > Gavin. > > On Tue, Jan 30, 2007 at 04:36:49AM -0800, Derrick Oswald wrote: > > Hello, > > > > In my experience, you will need both redirection processing and cookie processing. > > The general scenario is a page is returned with cookie's and a redirect. > > If the redirect is taken, the server looks for the cookie it tried to set with the first page. > > You can monitor the traffic in the header by implementing ConnectionMonitor (like the Parser). > > > > Derrick > > > > ----- Original Message ---- > > From: Gavin Gilmour <ga...@br...> > > To: htm...@li... > > Sent: Tuesday, January 30, 2007 5:59:41 AM > > Subject: [Htmlparser-user] Cookie handling > > > > Hi, I hope this is the correct place to ask the mailing list is looking pretty > > quiet according to sourceforge's mail archives :) > > > > I've ran into difficulty parsing certain sites that attempt to set cookies in > > an effort to read them later. The sites in question generally spit back pages > > saying that your browser or whatever is misconfigured and deny access. I assume > > this is simply because they can't read the cookies that should've been set and > > I wondered if there was an option to do so somewhere. > > > > I'm not trying to set *actual* cookies which most of the examples I've seen > > indicate how to do, but just allow pages to dump cookies and read them later. > > Is this possible or (possibly) unrealistic and difficult to implement? > > > > The current code I've got is: > > > > ... > > ConnectionManager manager = Parser.getConnectionManager(); > > manager.setRedirectionProcessingEnabled(false); > > manager.setCookieProcessingEnabled(true); > > Parser parser = new Parser(...); > > ---- > > > > etc. > > > > > > Cheers for any responses! > > > > Gavin. > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: sebb <se...@gm...> - 2007-01-30 13:22:38
|
Have you tried pasting the URL in a browser? If it works for you - it did not for me - perhaps the browser has secretly fed the site with a tasty cookie it kept around for just such a purpose ... On 30/01/07, Gavin Gilmour <ga...@br...> wrote: > Hi Derrick, cheers for the quick reply. > > I knew I'd regret posting the code I did because the actual code I'm using does > in fact have redirection processing enabled. (I had manged to mix up similar > code elsewhere that actually needs the redirection stuff disabled.) > > Here's a quick example of the sort of problems I'm having (including a url of a > page causing issues): > > --- > import org.htmlparser.*; > import org.htmlparser.http.*; > import org.htmlparser.util.*; > > public class TestParser { > > public static void main(String[] argv) { > > try { > ConnectionManager manager = Parser.getConnectionManager(); > manager.setRedirectionProcessingEnabled(true); > manager.setCookieProcessingEnabled(true); > Parser parser = new Parser("http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT?CRETRY=1&SRETRY=0"); > System.out.print(((NodeList)parser.parse(null)).toHtml()); > } catch (ParserException pe) { > pe.printStackTrace(); > } > } > } > > sokar:~/junk% javac TestParser.java > sokar:~/junk% java TestParser | grep -i --color Cookie > > <title>Wiley InterScience :: Session Cookies</title> > <h2>Session Cookie Error</h2> > <strong>An error has occured because we were unable to send a cookie to your > web browser.</strong> Session cookies are commonly used to facilitate improved > site navigation. > In order to use Wiley InterScience you must have your browser set to accept > cookies. Once you have logged in to Wiley InterScience, our Web server uses a > temporary cookie to help us manage your visit. > This <strong>Session Cookie</strong> is deleted when you logoff Wiley > InterScience, or when you quit your browser. The cookie allows us to quickly > determine your access control rights and your personal preferences during your > online session. The Session Cookie is set out of necessity and not out of > convenience. > -- > > > I'm getting the feeling that maybe the sites using some sort of strange > technique for restricting access, but I've similar examples of other sites > doing the same kind of things. > > Cheers again, > > Gavin. > > On Tue, Jan 30, 2007 at 04:36:49AM -0800, Derrick Oswald wrote: > > Hello, > > > > In my experience, you will need both redirection processing and cookie processing. > > The general scenario is a page is returned with cookie's and a redirect. > > If the redirect is taken, the server looks for the cookie it tried to set with the first page. > > You can monitor the traffic in the header by implementing ConnectionMonitor (like the Parser). > > > > Derrick > > > > ----- Original Message ---- > > From: Gavin Gilmour <ga...@br...> > > To: htm...@li... > > Sent: Tuesday, January 30, 2007 5:59:41 AM > > Subject: [Htmlparser-user] Cookie handling > > > > Hi, I hope this is the correct place to ask the mailing list is looking pretty > > quiet according to sourceforge's mail archives :) > > > > I've ran into difficulty parsing certain sites that attempt to set cookies in > > an effort to read them later. The sites in question generally spit back pages > > saying that your browser or whatever is misconfigured and deny access. I assume > > this is simply because they can't read the cookies that should've been set and > > I wondered if there was an option to do so somewhere. > > > > I'm not trying to set *actual* cookies which most of the examples I've seen > > indicate how to do, but just allow pages to dump cookies and read them later. > > Is this possible or (possibly) unrealistic and difficult to implement? > > > > The current code I've got is: > > > > ... > > ConnectionManager manager = Parser.getConnectionManager(); > > manager.setRedirectionProcessingEnabled(false); > > manager.setCookieProcessingEnabled(true); > > Parser parser = new Parser(...); > > ---- > > > > etc. > > > > > > Cheers for any responses! > > > > Gavin. > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Gavin G. <ga...@br...> - 2007-01-30 13:16:38
|
Hi Derrick, cheers for the quick reply. I knew I'd regret posting the code I did because the actual code I'm using does in fact have redirection processing enabled. (I had manged to mix up similar code elsewhere that actually needs the redirection stuff disabled.) Here's a quick example of the sort of problems I'm having (including a url of a page causing issues): --- import org.htmlparser.*; import org.htmlparser.http.*; import org.htmlparser.util.*; public class TestParser { public static void main(String[] argv) { try { ConnectionManager manager = Parser.getConnectionManager(); manager.setRedirectionProcessingEnabled(true); manager.setCookieProcessingEnabled(true); Parser parser = new Parser("http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT?CRETRY=1&SRETRY=0"); System.out.print(((NodeList)parser.parse(null)).toHtml()); } catch (ParserException pe) { pe.printStackTrace(); } } } sokar:~/junk% javac TestParser.java sokar:~/junk% java TestParser | grep -i --color Cookie <title>Wiley InterScience :: Session Cookies</title> <h2>Session Cookie Error</h2> <strong>An error has occured because we were unable to send a cookie to your web browser.</strong> Session cookies are commonly used to facilitate improved site navigation. In order to use Wiley InterScience you must have your browser set to accept cookies. Once you have logged in to Wiley InterScience, our Web server uses a temporary cookie to help us manage your visit. This <strong>Session Cookie</strong> is deleted when you logoff Wiley InterScience, or when you quit your browser. The cookie allows us to quickly determine your access control rights and your personal preferences during your online session. The Session Cookie is set out of necessity and not out of convenience. -- I'm getting the feeling that maybe the sites using some sort of strange technique for restricting access, but I've similar examples of other sites doing the same kind of things. Cheers again, Gavin. On Tue, Jan 30, 2007 at 04:36:49AM -0800, Derrick Oswald wrote: > Hello, > > In my experience, you will need both redirection processing and cookie processing. > The general scenario is a page is returned with cookie's and a redirect. > If the redirect is taken, the server looks for the cookie it tried to set with the first page. > You can monitor the traffic in the header by implementing ConnectionMonitor (like the Parser). > > Derrick > > ----- Original Message ---- > From: Gavin Gilmour <ga...@br...> > To: htm...@li... > Sent: Tuesday, January 30, 2007 5:59:41 AM > Subject: [Htmlparser-user] Cookie handling > > Hi, I hope this is the correct place to ask the mailing list is looking pretty > quiet according to sourceforge's mail archives :) > > I've ran into difficulty parsing certain sites that attempt to set cookies in > an effort to read them later. The sites in question generally spit back pages > saying that your browser or whatever is misconfigured and deny access. I assume > this is simply because they can't read the cookies that should've been set and > I wondered if there was an option to do so somewhere. > > I'm not trying to set *actual* cookies which most of the examples I've seen > indicate how to do, but just allow pages to dump cookies and read them later. > Is this possible or (possibly) unrealistic and difficult to implement? > > The current code I've got is: > > ... > ConnectionManager manager = Parser.getConnectionManager(); > manager.setRedirectionProcessingEnabled(false); > manager.setCookieProcessingEnabled(true); > Parser parser = new Parser(...); > ---- > > etc. > > > Cheers for any responses! > > Gavin. > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <der...@ro...> - 2007-01-30 12:36:59
|
Hello,=0A=0AIn my experience, you will need both redirection processing and= cookie processing.=0AThe general scenario is a page is returned with cooki= e's and a redirect.=0AIf the redirect is taken, the server looks for the co= okie it tried to set with the first page.=0AYou can monitor the traffic in = the header by implementing ConnectionMonitor (like the Parser).=0A=0ADerric= k=0A=0A----- Original Message ----=0AFrom: Gavin Gilmour <gavin@brokentrain= .net>=0ATo: htm...@li...=0ASent: Tuesday, January = 30, 2007 5:59:41 AM=0ASubject: [Htmlparser-user] Cookie handling=0A=0AHi, I= hope this is the correct place to ask the mailing list is looking pretty= =0Aquiet according to sourceforge's mail archives :)=0A=0AI've ran into dif= ficulty parsing certain sites that attempt to set cookies in=0Aan effort to= read them later. The sites in question generally spit back pages=0Asaying = that your browser or whatever is misconfigured and deny access. I assume=0A= this is simply because they can't read the cookies that should've been set = and=0AI wondered if there was an option to do so somewhere.=0A=0AI'm not tr= ying to set *actual* cookies which most of the examples I've seen=0Aindicat= e how to do, but just allow pages to dump cookies and read them later.=0AIs= this possible or (possibly) unrealistic and difficult to implement?=0A=0AT= he current code I've got is:=0A=0A...=0AConnectionManager manager =3D Parse= r.getConnectionManager();=0Amanager.setRedirectionProcessingEnabled(false);= =0Amanager.setCookieProcessingEnabled(true);=0AParser parser =3D new Parser= (...);=0A----=0A=0Aetc.=0A=0A=0ACheers for any responses!=0A=0AGavin.=0A=0A= -------------------------------------------------------------------------= =0ATake Surveys. Earn Cash. Influence the Future of IT=0AJoin SourceForge.n= et's Techsay panel and you'll get the chance to share your=0Aopinions on IT= & business topics through brief surveys - and earn cash=0Ahttp://www.techs= ay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3DDEVDEV=0A_________= ______________________________________=0AHtmlparser-user mailing list=0AHtm= lpa...@li...=0Ahttps://lists.sourceforge.net/lists/li= stinfo/htmlparser-user=0A=0A=0A=0A=0A |
From: Gavin G. <ga...@br...> - 2007-01-30 10:59:55
|
Hi, I hope this is the correct place to ask the mailing list is looking pretty quiet according to sourceforge's mail archives :) I've ran into difficulty parsing certain sites that attempt to set cookies in an effort to read them later. The sites in question generally spit back pages saying that your browser or whatever is misconfigured and deny access. I assume this is simply because they can't read the cookies that should've been set and I wondered if there was an option to do so somewhere. I'm not trying to set *actual* cookies which most of the examples I've seen indicate how to do, but just allow pages to dump cookies and read them later. Is this possible or (possibly) unrealistic and difficult to implement? The current code I've got is: ... ConnectionManager manager = Parser.getConnectionManager(); manager.setRedirectionProcessingEnabled(false); manager.setCookieProcessingEnabled(true); Parser parser = new Parser(...); ---- etc. Cheers for any responses! Gavin. |
From: Martin S. <mst...@gm...> - 2007-01-25 15:16:34
|
2007/1/25, Martin Sturm <mst...@gm...>: > > I looked in to the source code of HTMLParser 2.0 and the current > behaviour of HTMLParser is: > - use the charset defined by the "Content-Type" field in the HTTP header > - Change to the charset defined using a META declaration with > "http-equiv" if it differ from the charset defined by the HTTP header. > > This last step is causing the error in the Microsoft.com example. The > http headers define a charset utf-8, the first META declaration > changes this to UTF-16 and the second META declaration (however, this > declaration is after the TITLE tag) changes this back to UTF-8. > I think the correct behaviour should be: use the charset defined by > the HTTP header if it differs from the default charset (which is: > ISO-8859-1 aka Latin-1), and only use the charset defined by a META > declaration if the HTTP headers define no charset or the default > (ISO-8859-1). I've created a small patch which includes this behavior. See http://sourceforge.net/support/tracker.php?aid=1644504 This solves my problem and closes (as far as I can see) bug http://sourceforge.net/tracker/index.php?func=detail&aid=1592517&group_id=24399&atid=381399 -- Martin Sturm |
From: Martin S. <mst...@gm...> - 2007-01-25 13:22:24
|
2007/1/25, Martin Sturm <mst...@gm...>: > I'm not sure if it is allowed by the html-specifications to define the > charset multiple time, but I guess not. I don't think it is really a > bug in HTMLParser, but if it takes the last defined charset (utf-8) it > would parse the site correctly. Why doesn't HTMLParser not do this? I did some more research on this issue. The W3C specifications for HTML 4.01 (which applies to this document, because it is a HTML 4 document according to the first line): To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): 1. An HTTP "charset" parameter in a "Content-Type" field. 2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". 3. The charset attribute set on an element that designates an external resource. I looked in to the source code of HTMLParser 2.0 and the current behaviour of HTMLParser is: - use the charset defined by the "Content-Type" field in the HTTP header - Change to the charset defined using a META declaration with "http-equiv" if it differ from the charset defined by the HTTP header. This last step is causing the error in the Microsoft.com example. The http headers define a charset utf-8, the first META declaration changes this to UTF-16 and the second META declaration (however, this declaration is after the TITLE tag) changes this back to UTF-8. I think the correct behaviour should be: use the charset defined by the HTTP header if it differs from the default charset (which is: ISO-8859-1 aka Latin-1), and only use the charset defined by a META declaration if the HTTP headers define no charset or the default (ISO-8859-1). -- Martin Sturm |
From: Martin S. <mst...@gm...> - 2007-01-25 11:37:52
|
2007/1/16, Martin Sturm <mst...@gm...>: > During the testing phase, I discovered that some web pages are not > parsed correctly by HTMLParser. One of these webpages is for example > http://www.microsoft.com. > I think the problem is that according to the HTTP headers, the > encoding is in UTF-8, but in HTML META tags this is changed to UTF-16. Today, I decided I wanted to know exactly what was going wrong. It turned out that in the HTML code of www.microsoft.com, the charset is defined two times using meta-tags (http-equiv="Content-Type"), first as utf-16 and after that as utf-8. The actual encoding is apparently utf-8, because if I remove the meta-tag for utf-16 of the html, the page is parsed correctly. Below is the offending HTML-code: <html lang="en" dir="ltr"> <head> <META http-equiv="Content-Type" content="text/html; charset=utf-16"> <title>Microsoft Corporation</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="SearchTitle" content="Microsoft.com"> <meta name="SearchDescription" content="Microsoft.com Homepage"> I'm not sure if it is allowed by the html-specifications to define the charset multiple time, but I guess not. I don't think it is really a bug in HTMLParser, but if it takes the last defined charset (utf-8) it would parse the site correctly. Why doesn't HTMLParser not do this? |
From: Soumya <sou...@ya...> - 2007-01-18 13:41:12
|
Hi, I have spent a couple of days getting htmlparser work the way I need it to. The problem I have now is that my page has a nested table (tables inside table) and I want the individual table contents. Through NodeList I am able to loop and get the individual tables. Now, I want to get only the text and not the html, so I thought of going by the remark by Nick Burch in the StringBean class as follows, * You can also use the StringBean as a NodeVisitor on your own parser, * in which case you have to refetch your page if you change one of the * properties because it resets the Strings property:</p> * <pre> * StringBean sb = new StringBean (); * Parser parser = new Parser ("http://cbc.ca"); * parser.visitAllNodesWith (sb); * String s = sb.getStrings (); * sb.setLinks (true); * parser.reset (); * parser.visitAllNodesWith (sb); * String sl = sb.getStrings (); My Parser.getStringsForMe() code looks like the following : public String getStringsForMe() { StringBuffer resultPage = new StringBuffer(); this.setResource (pageURL); NodeList nl = this.parse(null); TagNameFilter tableFilter = new TagNameFilter("table"); tableFilter.setName("totalTable"); NodeList tableNodes = nl.extractAllNodesThatMatch(filter, true); SimpleNodeIterator it = tableNodes.elements(); StringBean sb = new StringBean(); while (it.hasMoreNodes()) { Node node = it.nextNode(); if (node.getText().contains("id=\"totalTable\"")) { String inputHTML = node.toHtml(); this.setInputHTML(inputHTML); // set the content to parse NodeList totalTableChildren = this.parse(new TagNameFilter("table")); SimpleNodeIterator it1 = totalTableChildren.elements(); while (it1.hasMoreNodes()) { Node childTableNode = it1.nextNode(); String tableContent = childTableNode.toString(); this.setInputHTML(tableContent); this.visitAllNodesWith(sb); resultPage.append(sb.getStrings()); this.reset(); sb.setLinks(false); } } } return resultPage; } The sb.getStrings() returns the same text every time, since the StringBean.getStrings() checks if mStrings is null and then proceeds to set / update strings. In the case as above where the StringBean is repeatedly used to getStrings(), after the first visit, StringBean.mStrings remains not-null and hence the newly extracted contents in StringBean.mBuffer are not appended to mStrings. I have made it work for me. I needed to know if there is anything else I am missing or is this a bug for cases like mine. thanks in advance, Soumya ____________________________________________________________________________________ Looking for earth-friendly autos? Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. http://autos.yahoo.com/green_center/ |
From: Ian M. <ian...@gm...> - 2007-01-16 13:32:50
|
To make font tags hold children: - make your own font tag (copy the existing one is the easiest) but make it extend compositetag instead of just tag. - unregister the old font tag from the prototypicalnodefactory then register your new one with it. This isn't done by default because HTML documents can be horribly malformed, especially where tags like FONT are concerned! Ian On 1/16/07, Kalyan K Kumar <kal...@gm...> wrote: > I'm trying to implement these two xpath methods in htmlparser. > > public String getElementXPath(Node element); > public Node getElementByXpath(String xpath); > > xpath will look something like > html/body/div/div[2[/div2/ul/li/a[3] > > The problem I'm facing is , some tags like font,center are not treated as > parents. > which is clearly mentioned in this thread > http://sourceforge.net/mailarchive/message.php?msg_id=15467503 > > is there any way to include every tag ? > Is some implementation of xpath already existing ? > > Thanks > Kalyan > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Martin S. <mst...@gm...> - 2007-01-16 11:34:15
|
SGVsbG8sCgpJJ20gdXNpbmcgSFRNTFBhcnNlciBmb3IgZXh0cmFjdGluZyB0ZXh0IGZyb20gYSBI VE1MIHBhZ2UgaW4gb3JkZXIgdG8KaW5kZXggaXQgdXNpbmcgYSBmdWxsIHRleHQgc2VhcmNoIGVu Z2luZS4KRHVyaW5nIHRoZSB0ZXN0aW5nIHBoYXNlLCBJIGRpc2NvdmVyZWQgdGhhdCBzb21lIHdl YiBwYWdlcyBhcmUgbm90CnBhcnNlZCBjb3JyZWN0bHkgYnkgSFRNTFBhcnNlci4gT25lIG9mIHRo ZXNlIHdlYnBhZ2VzIGlzIGZvciBleGFtcGxlCmh0dHA6Ly93d3cubWljcm9zb2Z0LmNvbS4KSSB0 aGluayB0aGUgcHJvYmxlbSBpcyB0aGF0IGFjY29yZGluZyB0byB0aGUgSFRUUCBoZWFkZXJzLCB0 aGUKZW5jb2RpbmcgaXMgaW4gVVRGLTgsIGJ1dCBpbiBIVE1MIE1FVEEgdGFncyB0aGlzIGlzIGNo YW5nZWQgdG8gVVRGLTE2LgpUaGlzIGNhbiBiZSBoYW5kbGVkIGJ5IGNhdGNoaW5nIHRoZSBFbmNv ZGluZ0NoYW5nZUV4Y2VwdGlvbiwgYnV0IHRoaXMKZG9lc24ndCBwcmV2ZW50IHRoZSB0ZXh0dWFs IGNvbnRlbnQgb2YgdGhlIHNpdGUgaW50ZXJwcmV0ZWQKaW5jb3JyZWN0bHkuCgpBIGNvbmNyZXRl IGV4YW1wbGUgdG8gc2VlIHRoZSBwcm9ibGVtOgoKICAgICAgICBTdHJpbmdCZWFuIHNiID0gbmV3 IFN0cmluZ0JlYW4oKTsKICAgICAgICBzYi5zZXRVUkwoImh0dHA6Ly93d3cubWljcm9zb2Z0LmNv bSIpOwogICAgICAgIFN5c3RlbS5vdXQucHJpbnRsbiAoc2IuZ2V0U3RyaW5ncygpKTsKClRoZSBv dXRwdXQgb2YgdGhlIGFib3ZlIGNvZGUgc25pcHBldCBpczoK5JGP5I2U5aWQ5JSg5qG05rWs4oGQ 5ZWC5LGJ5Iyg4oit4ryv5Zyz5Iyv4r2E5ZGE4oGI5ZGN5LCg45Cu44Cg5ZGy5oWu542p55Gp5r2u IC4uLi4KCk5vdCByZWFsbHkgd2hhdCBJIHdhcyBleHBlY3RpbmcuCgpBbSBJIG1pc3Npbmcgc29t ZXRoaW5nLCBvciBpcyB0aGlzIGEgYnVnIGluIHRoZSBIVE1MUGFyc2VyPwoKTWFydGluCg== |
From: Kalyan K K. <kal...@gm...> - 2007-01-16 08:26:18
|
I'm trying to implement these two xpath methods in htmlparser. public String getElementXPath(Node element); public Node getElementByXpath(String xpath); xpath will look something like html/body/div/div[2[/div2/ul/li/a[3] The problem I'm facing is , some tags like font,center are not treated as parents. which is clearly mentioned in this thread http://sourceforge.net/mailarchive/message.php?msg_id=15467503 is there any way to include every tag ? Is some implementation of xpath already existing ? Thanks Kalyan |
From: Derrick O. <der...@ro...> - 2007-01-14 19:13:27
|
The message indicates that your version of java is too old to use the jar f= ile directly.=0AYou can either upgrade or build the htmlparser.jar from scr= atch with your compiler.=0A=0A----- Original Message ----=0AFrom: negar far= hang <far...@ya...>=0ATo: htm...@li...= =0ASent: Sunday, January 14, 2007 1:50:55 PM=0ASubject: [Htmlparser-user] p= rj=0A=0Ahi=0A I want to use your Htmlparser .I install JDK1.2.2 from java.= I set CLASSPATH with =0A set CLASSPATH=3D[htmlp_dir]\lib\htmlparser.jar;%= CLASSPATH%=0A and then javac Test.java (Test.java is your program ). but t= here is a error and this message is make :=0A =0A error:Invalid class fi= le format in c:\htmlparser1_6\bin\htmlparser.jar(org/htmlparser/parser.clas= s).The major.minor version '48.0' is too recent for this tool to understand= .=0A Test.java:1:Class org.htmlparser.Parser not found in import .=0A im= port org.htmlparser.Parser;=0A 2 errors=0A =0A =0A please help me ? w= hat can I do ? =0A thanks.=0A =0A=0A=0AIt's here! Your new message!=0AGet= =0A new email alerts with the free Yahoo! Toolbar.-------------------------= ------------------------------------------------=0ATake Surveys. Earn Cash.= Influence the Future of IT=0AJoin SourceForge.net's Techsay panel and you'= ll get the chance to share your=0Aopinions on IT & business topics through = brief surveys - and earn cash=0Ahttp://www.techsay.com/default.php?page=3Dj= oin.php&p=3Dsourceforge&CID=3DDEVDEV=0A____________________________________= ___________=0AHtmlparser-user mailing list=0AH...@li...urcefo= rge.net=0Ahttps://lists.sourceforge.net/lists/listinfo/htmlparser-user=0A= =0A=0A=0A=0A |
From: negar f. <far...@ya...> - 2007-01-14 18:51:07
|
hi I want to use your Htmlparser .I install JDK1.2.2 from java. I set CLASSPATH with set CLASSPATH=[htmlp_dir]\lib\htmlparser.jar;%CLASSPATH% and then javac Test.java (Test.java is your program ). but there is a error and this message is make : error:Invalid class file format in c:\htmlparser1_6\bin\htmlparser.jar(org/htmlparser/parser.class).The major.minor version '48.0' is too recent for this tool to understand . Test.java:1:Class org.htmlparser.Parser not found in import . import org.htmlparser.Parser; 2 errors please help me ? what can I do ? thanks. --------------------------------- It's here! Your new message! Get new email alerts with the free Yahoo! Toolbar. |
From: Ian M. <ian...@gm...> - 2007-01-13 21:55:12
|
Unfortunately, the a tag is looked for by a class called LinkTag, which we can't just change for compatibility reasons. This means that we need to come up with a name for a tag that handles LINK elements that won't get confused with this tag. Suggestions are welcome :) Should you wish to parse this tag, you can always create such a tag yourself by extending CompositeTag (find another such tag to copy) and register it with the PrototypicalNodeFactory that you use. Ian On 1/13/07, li shuguang <lis...@gm...> wrote: > I don't think it can deal with that because the link tag does not appear in > the body. > > > > On 1/13/07, Pandian Annamalai < pan...@ya...> wrote: > > > > could someone throw lights on how to handle the <link> tag > > > > I want to extract the href value from the following html, > > > > <link rel="stylesheet" href="dummy11.css" type="text/css"/> > > > > I tried to use the LinkTag clas, > > > > NodeList linkTags = nodeList.extractAllNodesThatMatch(new > NodeClassFilter(LinkTag.class), true); > > NodeIterator nI = linkTags.elements(); > > > > but the filter recognizes only <a href="...."> tags... > > > > <link href=".." > are ignored. Any help here ? Appreciate your response. > > > > Regards, > > Pands > > > > > > ________________________________ > Now that's room service! Choose from over 150,000 hotels > > in 45,000 destinations on Yahoo! Travel to find your fit. > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: li s. <lis...@gm...> - 2007-01-13 20:26:12
|
I don't think it can deal with that because the link tag does not appear in the body. On 1/13/07, Pandian Annamalai <pan...@ya...> wrote: > > could someone throw lights on how to handle the <link> tag > > I want to extract the href value from the following html, > > <link rel="stylesheet" href="dummy11.css" type="text/css"/> > > I tried to use the LinkTag clas, > > NodeList linkTags = nodeList.extractAllNodesThatMatch(new NodeClassFilter( > LinkTag.class), true); > NodeIterator nI = linkTags.elements(); > > but the filter recognizes only <a href="...."> tags... > > <link href=".." > are ignored. Any help here ? Appreciate your response. > > Regards, > Pands > > ------------------------------ > Now that's room service! Choose from over 150,000 hotels > in 45,000 destinations on Yahoo! Travel<http://travel.yahoo.com/hotelsearchpage;_ylc=X3oDMTFtaTIzNXVjBF9TAzk3NDA3NTg5BF9zAzI3MTk0ODEEcG9zAzIEc2VjA21haWx0YWdsaW5lBHNsawNxMS0wNw--%0A>to find your fit. > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Pandian A. <pan...@ya...> - 2007-01-13 10:36:26
|
could someone throw lights on how to handle the <link> tag I want to extract the href value from the following html, <link rel="stylesheet" href="dummy11.css" type="text/css"/> I tried to use the LinkTag clas, NodeList linkTags = nodeList.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class), true); NodeIterator nI = linkTags.elements(); but the filter recognizes only <a href="...."> tags... <link href=".." > are ignored. Any help here ? Appreciate your response. Regards, Pands --------------------------------- Now that's room service! Choose from over 150,000 hotels in 45,000 destinations on Yahoo! Travel to find your fit. |
From: <oys...@me...> - 2007-01-11 17:20:54
|
Martin Sturm wrote: > Hi, > > It is indeed strange and I think it should be filed as a bug if you > are sure it isn't a problem with your code. The problem(for me) is when i issue sb.setURL(url) the method itself catches the ParserException and appends the exception-message as part of the string I receive in the end. I don't get to handle the real exception my self ...I think. If I use the Parser class and parser.setURL(url) I can handle the exception as I want. So that problem is now solved! But I also need the parser to figure out the encoding of the page by it self because the encoding sometimes varies. So this parser might not be what I'm looking for. Btw, it seems to stick to ISO-8859-1 no matter what encoding I set with parser.setEncoding("UTF-8");. (I have not tested this very much). > What version of the HTMLParser are you using? (I forgot to ask that > question earlier). I'm using HTMLParser Version 1.6 (Release Build Jun 10, 2006). -Øystein |
From: Martin S. <mst...@gm...> - 2007-01-11 14:26:55
|
Hi, It is indeed strange and I think it should be filed as a bug if you are sur= e it isn't a problem with your code. When I was reading the FAQ of the projec= t this morning (here), my eye catches the following question: http://htmlparser.sourceforge.net/faq.html#quiet It seems a bit related to your problem. What version of the HTMLParser are you using? (I forgot to ask that questio= n earlier). -- Martin 2007/1/11, =D8ystein Lervik Larsen <oys...@me...>: > > Hi, > > > The only thing I can think of is that your logger is printing the > > exception to standard out or something (you are passing the exception t= o > > the error method). Does the problem occur when you comment out the > > logger.error ("Could not parse url", e); ? > > The exception is written to the console even though I'm not using any > try/catch. > > StringBean sb; > sb =3D new StringBean (); > sb.setLinks(false); > sb.setURL(url); > content =3D sb.getStrings(); > return content; > > > Weird... thanks for your time though! > > -=D8ystein > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Joel <jo...@ha...> - 2007-01-11 14:23:26
|
I just didn't realize the need of end tag. That works. Thanks, Joel Martin Sturm wrote: > Hi, > > I did a quick test, and I think you forgot to add the endtag to the > toCreate object. > You should add: > > toCreate.setEndTag(new Span()); > > before you final println statement. > > The following code snippet works for me: > TagNode toCreate = new Span(); > toCreate.setAttribute("key", "Test", '"'); > NodeList nl = new NodeList(); > nl.add(new TextNode("Test2")); > toCreate.setChildren(nl); > toCreate.setEndTag(new Span()); > System.out.println(toCreate.toHtml()); > > Result: <SPAN key="Test">Test2<SPAN> > > Hope this will help you. > > -- Martin > > > 2007/1/10, Joel <jo...@ha... <mailto:jo...@ha...>>: > > I want to wrap text string with a span tag. I've tried the > folowing, but > I'm running into a problem, that the tag's children aren't being > displayed. > > //New <span key="x">some text here</span> > TagNode toCreate = new Span(); > toCreate.setAttribute("key", getKey(str), '"'); > NodeList nl = new NodeList(); > nl.add(new TextNode(str)); > toCreate.setChildren (nl); > System.out.println(toCreate.toHtml()); > > This ends up showing <span key="x"> without the text node and end tag, > what am I doing wrong? > > Joel > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to > share your > opinions on IT & business topics through brief surveys - and earn > cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > <http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV> > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > <mailto:Htm...@li...> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > <https://lists.sourceforge.net/lists/listinfo/htmlparser-user> > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: <oys...@me...> - 2007-01-11 14:03:53
|
Hi, > The only thing I can think of is that your logger is printing the > exception to standard out or something (you are passing the exception to > the error method). Does the problem occur when you comment out the > logger.error ("Could not parse url", e); ? The exception is written to the console even though I'm not using any try/catch. StringBean sb; sb = new StringBean (); sb.setLinks(false); sb.setURL(url); content = sb.getStrings(); return content; Weird... thanks for your time though! -Øystein |