Thread: Re: [Htmlparser-user] Cookie handling
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2007-01-30 12:36:59
|
Hello,=0A=0AIn my experience, you will need both redirection processing and= cookie processing.=0AThe general scenario is a page is returned with cooki= e's and a redirect.=0AIf the redirect is taken, the server looks for the co= okie it tried to set with the first page.=0AYou can monitor the traffic in = the header by implementing ConnectionMonitor (like the Parser).=0A=0ADerric= k=0A=0A----- Original Message ----=0AFrom: Gavin Gilmour <gavin@brokentrain= .net>=0ATo: htm...@li...=0ASent: Tuesday, January = 30, 2007 5:59:41 AM=0ASubject: [Htmlparser-user] Cookie handling=0A=0AHi, I= hope this is the correct place to ask the mailing list is looking pretty= =0Aquiet according to sourceforge's mail archives :)=0A=0AI've ran into dif= ficulty parsing certain sites that attempt to set cookies in=0Aan effort to= read them later. The sites in question generally spit back pages=0Asaying = that your browser or whatever is misconfigured and deny access. I assume=0A= this is simply because they can't read the cookies that should've been set = and=0AI wondered if there was an option to do so somewhere.=0A=0AI'm not tr= ying to set *actual* cookies which most of the examples I've seen=0Aindicat= e how to do, but just allow pages to dump cookies and read them later.=0AIs= this possible or (possibly) unrealistic and difficult to implement?=0A=0AT= he current code I've got is:=0A=0A...=0AConnectionManager manager =3D Parse= r.getConnectionManager();=0Amanager.setRedirectionProcessingEnabled(false);= =0Amanager.setCookieProcessingEnabled(true);=0AParser parser =3D new Parser= (...);=0A----=0A=0Aetc.=0A=0A=0ACheers for any responses!=0A=0AGavin.=0A=0A= -------------------------------------------------------------------------= =0ATake Surveys. Earn Cash. Influence the Future of IT=0AJoin SourceForge.n= et's Techsay panel and you'll get the chance to share your=0Aopinions on IT= & business topics through brief surveys - and earn cash=0Ahttp://www.techs= ay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3DDEVDEV=0A_________= ______________________________________=0AHtmlparser-user mailing list=0AHtm= lpa...@li...=0Ahttps://lists.sourceforge.net/lists/li= stinfo/htmlparser-user=0A=0A=0A=0A=0A |
From: Derrick O. <der...@ro...> - 2007-02-01 02:00:37
|
=0AThat should be handled OK.=0ABut it doesn't look like what you want from= that URL.=0A=0A----- Original Message ----=0AFrom: Gavin Gilmour <gavin@br= okentrain.net>=0ATo: htmlparser user list <htm...@li...urceforg= e.net>=0ASent: Tuesday, January 30, 2007 8:49:25 AM=0ASubject: Re: [Htmlpar= ser-user] Cookie handling=0A=0AWow, good point. The URL seems to load fine = here in my browser but on further=0Ainspection:=0A=0Asokar:~/junk% curl htt= p://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT\?CRETRY\= =3D1\&SRETRY\=3D0=0A<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">=0A<h= tml><head>=0A<title>302 Found</title>=0A</head><body>=0A<h1>Found</h1>=0A<p= >The document has moved <a=0Ahref=3D"http://www3.interscience.wiley.com/coo= kie_setting_error.html">here</a>.</p>=0A</body></html>=0A=0ASeems to have a= lready decided it's dud, weird.=0A=0AAfter a bit of investigating, the full= story is a bit worse than I thought and=0Ais involing multiple redirects. = The first link I need comes back from this service:=0Ahttp://eutils.ncbi.nl= m.nih.gov/entrez/eutils/elink.fcgi?dbfrom=3Dpubmed&id=3D10629107&retmode=3D= ref&cmd=3Dprlinks=0A- which just offers up a 302 or whatever and then issue= s a redirect.=0A=0AFair enough, so:=0A=0A---=0Asokar:~/junk% curl 'http://e= utils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=3Dpubmed&id=3D106291= 07&retmode=3Dref&cmd=3Dprlinks'=0A<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML = 2.0//EN">=0A<html><head>=0A<title>302 Found</title>=0A</head><body>=0A<h1>F= ound</h1>=0A<p>The document has moved <a=0Ahref=3D"http://dx.doi.org/10.100= 2/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M">here</a>.</p>= =0A</body></html>=0A---=0A=0AIs giving 'http://dx...' which is what the par= ser is trying next I'd imagine. So then:=0A=0Asokar:~/junk% curl 'http://dx= .doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M'=0A<H= TML><HEAD><TITLE>Handle Redirect</TITLE></HEAD>=0A<BODY><A=0AHREF=3D"http:/= /doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAI= D-JCB11%3E3.0.CO%3B2-M">http://doi.wiley.com/10.1002/%28SICI%291097-4644%28= 1999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M</A></BODY></HTML>=0A= =0ALooking at this URL, it seems to be the 'final one' which is leading to = the=0A(desired) destination in a browser. (Does that output even look like = something=0Athe parser would handle though?)=0A=0AWhat a mess :(=0A=0AGavin= .=0A=0AP.S. Sorry about the horribly formatted mail due to the unsightly ur= ls=0Ainvolved.=0A=0A=0A=0A=0A=0A |
From: Gavin G. <ga...@br...> - 2007-02-01 10:49:44
|
Ah, I see what you mean. Looking at the last url it seems even more redirects are taking place until it's arriving at the final url - I've counted around 7 so far. Regardless though, I'm still at a loss why the redirects aren't being processed and the cookie stuff handled in the end. Gavin. On Wed, Jan 31, 2007 at 06:00:17PM -0800, Derrick Oswald wrote: > > That should be handled OK. > But it doesn't look like what you want from that URL. > > ----- Original Message ---- > From: Gavin Gilmour <ga...@br...> > To: htmlparser user list <htm...@li...> > Sent: Tuesday, January 30, 2007 8:49:25 AM > Subject: Re: [Htmlparser-user] Cookie handling > > Wow, good point. The URL seems to load fine here in my browser but on further > inspection: > > sokar:~/junk% curl http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT\?CRETRY\=1\&SRETRY\=0 > <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> > <html><head> > <title>302 Found</title> > </head><body> > <h1>Found</h1> > <p>The document has moved <a > href="http://www3.interscience.wiley.com/cookie_setting_error.html">here</a>.</p> > </body></html> > > Seems to have already decided it's dud, weird. > > After a bit of investigating, the full story is a bit worse than I thought and > is involing multiple redirects. The first link I need comes back from this service: > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks > - which just offers up a 302 or whatever and then issues a redirect. > > Fair enough, so: > > --- > sokar:~/junk% curl 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks' > <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> > <html><head> > <title>302 Found</title> > </head><body> > <h1>Found</h1> > <p>The document has moved <a > href="http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M">here</a>.</p> > </body></html> > --- > > Is giving 'http://dx...' which is what the parser is trying next I'd imagine. So then: > > sokar:~/junk% curl 'http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M' > <HTML><HEAD><TITLE>Handle Redirect</TITLE></HEAD> > <BODY><A > HREF="http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M">http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M</A></BODY></HTML> > > Looking at this URL, it seems to be the 'final one' which is leading to the > (desired) destination in a browser. (Does that output even look like something > the parser would handle though?) > > What a mess :( > > Gavin. > > P.S. Sorry about the horribly formatted mail due to the unsightly urls > involved. > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier. > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Gavin G. <ga...@br...> - 2007-01-30 13:16:38
|
Hi Derrick, cheers for the quick reply. I knew I'd regret posting the code I did because the actual code I'm using does in fact have redirection processing enabled. (I had manged to mix up similar code elsewhere that actually needs the redirection stuff disabled.) Here's a quick example of the sort of problems I'm having (including a url of a page causing issues): --- import org.htmlparser.*; import org.htmlparser.http.*; import org.htmlparser.util.*; public class TestParser { public static void main(String[] argv) { try { ConnectionManager manager = Parser.getConnectionManager(); manager.setRedirectionProcessingEnabled(true); manager.setCookieProcessingEnabled(true); Parser parser = new Parser("http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT?CRETRY=1&SRETRY=0"); System.out.print(((NodeList)parser.parse(null)).toHtml()); } catch (ParserException pe) { pe.printStackTrace(); } } } sokar:~/junk% javac TestParser.java sokar:~/junk% java TestParser | grep -i --color Cookie <title>Wiley InterScience :: Session Cookies</title> <h2>Session Cookie Error</h2> <strong>An error has occured because we were unable to send a cookie to your web browser.</strong> Session cookies are commonly used to facilitate improved site navigation. In order to use Wiley InterScience you must have your browser set to accept cookies. Once you have logged in to Wiley InterScience, our Web server uses a temporary cookie to help us manage your visit. This <strong>Session Cookie</strong> is deleted when you logoff Wiley InterScience, or when you quit your browser. The cookie allows us to quickly determine your access control rights and your personal preferences during your online session. The Session Cookie is set out of necessity and not out of convenience. -- I'm getting the feeling that maybe the sites using some sort of strange technique for restricting access, but I've similar examples of other sites doing the same kind of things. Cheers again, Gavin. On Tue, Jan 30, 2007 at 04:36:49AM -0800, Derrick Oswald wrote: > Hello, > > In my experience, you will need both redirection processing and cookie processing. > The general scenario is a page is returned with cookie's and a redirect. > If the redirect is taken, the server looks for the cookie it tried to set with the first page. > You can monitor the traffic in the header by implementing ConnectionMonitor (like the Parser). > > Derrick > > ----- Original Message ---- > From: Gavin Gilmour <ga...@br...> > To: htm...@li... > Sent: Tuesday, January 30, 2007 5:59:41 AM > Subject: [Htmlparser-user] Cookie handling > > Hi, I hope this is the correct place to ask the mailing list is looking pretty > quiet according to sourceforge's mail archives :) > > I've ran into difficulty parsing certain sites that attempt to set cookies in > an effort to read them later. The sites in question generally spit back pages > saying that your browser or whatever is misconfigured and deny access. I assume > this is simply because they can't read the cookies that should've been set and > I wondered if there was an option to do so somewhere. > > I'm not trying to set *actual* cookies which most of the examples I've seen > indicate how to do, but just allow pages to dump cookies and read them later. > Is this possible or (possibly) unrealistic and difficult to implement? > > The current code I've got is: > > ... > ConnectionManager manager = Parser.getConnectionManager(); > manager.setRedirectionProcessingEnabled(false); > manager.setCookieProcessingEnabled(true); > Parser parser = new Parser(...); > ---- > > etc. > > > Cheers for any responses! > > Gavin. > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Gavin G. <ga...@br...> - 2007-01-30 13:22:48
|
Ach, I thought I was on to something there when I saw the method: 'static void setConnectionManager(ConnectionManager manager)' sitting in the API over at http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#main(java.lang.String[]) - However, adding Parser.setConnectionManager(manager) to the code below hasn't seemed to help :/ Gavin. On Tue, Jan 30, 2007 at 01:16:20PM +0000, Gavin Gilmour wrote: > Hi Derrick, cheers for the quick reply. > > I knew I'd regret posting the code I did because the actual code I'm using does > in fact have redirection processing enabled. (I had manged to mix up similar > code elsewhere that actually needs the redirection stuff disabled.) > > Here's a quick example of the sort of problems I'm having (including a url of a > page causing issues): > > --- > import org.htmlparser.*; > import org.htmlparser.http.*; > import org.htmlparser.util.*; > > public class TestParser { > > public static void main(String[] argv) { > > try { > ConnectionManager manager = Parser.getConnectionManager(); > manager.setRedirectionProcessingEnabled(true); > manager.setCookieProcessingEnabled(true); > Parser parser = new Parser("http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT?CRETRY=1&SRETRY=0"); > System.out.print(((NodeList)parser.parse(null)).toHtml()); > } catch (ParserException pe) { > pe.printStackTrace(); > } > } > } > > sokar:~/junk% javac TestParser.java > sokar:~/junk% java TestParser | grep -i --color Cookie > > <title>Wiley InterScience :: Session Cookies</title> > <h2>Session Cookie Error</h2> > <strong>An error has occured because we were unable to send a cookie to your > web browser.</strong> Session cookies are commonly used to facilitate improved > site navigation. > In order to use Wiley InterScience you must have your browser set to accept > cookies. Once you have logged in to Wiley InterScience, our Web server uses a > temporary cookie to help us manage your visit. > This <strong>Session Cookie</strong> is deleted when you logoff Wiley > InterScience, or when you quit your browser. The cookie allows us to quickly > determine your access control rights and your personal preferences during your > online session. The Session Cookie is set out of necessity and not out of > convenience. > -- > > > I'm getting the feeling that maybe the sites using some sort of strange > technique for restricting access, but I've similar examples of other sites > doing the same kind of things. > > Cheers again, > > Gavin. > > On Tue, Jan 30, 2007 at 04:36:49AM -0800, Derrick Oswald wrote: > > Hello, > > > > In my experience, you will need both redirection processing and cookie processing. > > The general scenario is a page is returned with cookie's and a redirect. > > If the redirect is taken, the server looks for the cookie it tried to set with the first page. > > You can monitor the traffic in the header by implementing ConnectionMonitor (like the Parser). > > > > Derrick > > > > ----- Original Message ---- > > From: Gavin Gilmour <ga...@br...> > > To: htm...@li... > > Sent: Tuesday, January 30, 2007 5:59:41 AM > > Subject: [Htmlparser-user] Cookie handling > > > > Hi, I hope this is the correct place to ask the mailing list is looking pretty > > quiet according to sourceforge's mail archives :) > > > > I've ran into difficulty parsing certain sites that attempt to set cookies in > > an effort to read them later. The sites in question generally spit back pages > > saying that your browser or whatever is misconfigured and deny access. I assume > > this is simply because they can't read the cookies that should've been set and > > I wondered if there was an option to do so somewhere. > > > > I'm not trying to set *actual* cookies which most of the examples I've seen > > indicate how to do, but just allow pages to dump cookies and read them later. > > Is this possible or (possibly) unrealistic and difficult to implement? > > > > The current code I've got is: > > > > ... > > ConnectionManager manager = Parser.getConnectionManager(); > > manager.setRedirectionProcessingEnabled(false); > > manager.setCookieProcessingEnabled(true); > > Parser parser = new Parser(...); > > ---- > > > > etc. > > > > > > Cheers for any responses! > > > > Gavin. > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: sebb <se...@gm...> - 2007-01-30 13:22:38
|
Have you tried pasting the URL in a browser? If it works for you - it did not for me - perhaps the browser has secretly fed the site with a tasty cookie it kept around for just such a purpose ... On 30/01/07, Gavin Gilmour <ga...@br...> wrote: > Hi Derrick, cheers for the quick reply. > > I knew I'd regret posting the code I did because the actual code I'm using does > in fact have redirection processing enabled. (I had manged to mix up similar > code elsewhere that actually needs the redirection stuff disabled.) > > Here's a quick example of the sort of problems I'm having (including a url of a > page causing issues): > > --- > import org.htmlparser.*; > import org.htmlparser.http.*; > import org.htmlparser.util.*; > > public class TestParser { > > public static void main(String[] argv) { > > try { > ConnectionManager manager = Parser.getConnectionManager(); > manager.setRedirectionProcessingEnabled(true); > manager.setCookieProcessingEnabled(true); > Parser parser = new Parser("http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT?CRETRY=1&SRETRY=0"); > System.out.print(((NodeList)parser.parse(null)).toHtml()); > } catch (ParserException pe) { > pe.printStackTrace(); > } > } > } > > sokar:~/junk% javac TestParser.java > sokar:~/junk% java TestParser | grep -i --color Cookie > > <title>Wiley InterScience :: Session Cookies</title> > <h2>Session Cookie Error</h2> > <strong>An error has occured because we were unable to send a cookie to your > web browser.</strong> Session cookies are commonly used to facilitate improved > site navigation. > In order to use Wiley InterScience you must have your browser set to accept > cookies. Once you have logged in to Wiley InterScience, our Web server uses a > temporary cookie to help us manage your visit. > This <strong>Session Cookie</strong> is deleted when you logoff Wiley > InterScience, or when you quit your browser. The cookie allows us to quickly > determine your access control rights and your personal preferences during your > online session. The Session Cookie is set out of necessity and not out of > convenience. > -- > > > I'm getting the feeling that maybe the sites using some sort of strange > technique for restricting access, but I've similar examples of other sites > doing the same kind of things. > > Cheers again, > > Gavin. > > On Tue, Jan 30, 2007 at 04:36:49AM -0800, Derrick Oswald wrote: > > Hello, > > > > In my experience, you will need both redirection processing and cookie processing. > > The general scenario is a page is returned with cookie's and a redirect. > > If the redirect is taken, the server looks for the cookie it tried to set with the first page. > > You can monitor the traffic in the header by implementing ConnectionMonitor (like the Parser). > > > > Derrick > > > > ----- Original Message ---- > > From: Gavin Gilmour <ga...@br...> > > To: htm...@li... > > Sent: Tuesday, January 30, 2007 5:59:41 AM > > Subject: [Htmlparser-user] Cookie handling > > > > Hi, I hope this is the correct place to ask the mailing list is looking pretty > > quiet according to sourceforge's mail archives :) > > > > I've ran into difficulty parsing certain sites that attempt to set cookies in > > an effort to read them later. The sites in question generally spit back pages > > saying that your browser or whatever is misconfigured and deny access. I assume > > this is simply because they can't read the cookies that should've been set and > > I wondered if there was an option to do so somewhere. > > > > I'm not trying to set *actual* cookies which most of the examples I've seen > > indicate how to do, but just allow pages to dump cookies and read them later. > > Is this possible or (possibly) unrealistic and difficult to implement? > > > > The current code I've got is: > > > > ... > > ConnectionManager manager = Parser.getConnectionManager(); > > manager.setRedirectionProcessingEnabled(false); > > manager.setCookieProcessingEnabled(true); > > Parser parser = new Parser(...); > > ---- > > > > etc. > > > > > > Cheers for any responses! > > > > Gavin. > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Gavin G. <ga...@br...> - 2007-01-30 13:49:39
|
Wow, good point. The URL seems to load fine here in my browser but on further inspection: sokar:~/junk% curl http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT\?CRETRY\=1\&SRETRY\=0 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="http://www3.interscience.wiley.com/cookie_setting_error.html">here</a>.</p> </body></html> Seems to have already decided it's dud, weird. After a bit of investigating, the full story is a bit worse than I thought and is involing multiple redirects. The first link I need comes back from this service: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks - which just offers up a 302 or whatever and then issues a redirect. Fair enough, so: --- sokar:~/junk% curl 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks' <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M">here</a>.</p> </body></html> --- Is giving 'http://dx...' which is what the parser is trying next I'd imagine. So then: sokar:~/junk% curl 'http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M' <HTML><HEAD><TITLE>Handle Redirect</TITLE></HEAD> <BODY><A HREF="http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M">http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M</A></BODY></HTML> Looking at this URL, it seems to be the 'final one' which is leading to the (desired) destination in a browser. (Does that output even look like something the parser would handle though?) What a mess :( Gavin. P.S. Sorry about the horribly formatted mail due to the unsightly urls involved. On Tue, Jan 30, 2007 at 01:22:33PM +0000, sebb wrote: > Have you tried pasting the URL in a browser? > > If it works for you - it did not for me - perhaps the browser has > secretly fed the site with a tasty cookie it kept around for just such > a purpose ... > > On 30/01/07, Gavin Gilmour <ga...@br...> wrote: > > Hi Derrick, cheers for the quick reply. > > > > I knew I'd regret posting the code I did because the actual code I'm using does > > in fact have redirection processing enabled. (I had manged to mix up similar > > code elsewhere that actually needs the redirection stuff disabled.) > > > > Here's a quick example of the sort of problems I'm having (including a url of a > > page causing issues): > > > > --- > > import org.htmlparser.*; > > import org.htmlparser.http.*; > > import org.htmlparser.util.*; > > > > public class TestParser { > > > > public static void main(String[] argv) { > > > > try { > > ConnectionManager manager = Parser.getConnectionManager(); > > manager.setRedirectionProcessingEnabled(true); > > manager.setCookieProcessingEnabled(true); > > Parser parser = new Parser("http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT?CRETRY=1&SRETRY=0"); > > System.out.print(((NodeList)parser.parse(null)).toHtml()); > > } catch (ParserException pe) { > > pe.printStackTrace(); > > } > > } > > } > > > > sokar:~/junk% javac TestParser.java > > sokar:~/junk% java TestParser | grep -i --color Cookie > > > > <title>Wiley InterScience :: Session Cookies</title> > > <h2>Session Cookie Error</h2> > > <strong>An error has occured because we were unable to send a cookie to your > > web browser.</strong> Session cookies are commonly used to facilitate improved > > site navigation. > > In order to use Wiley InterScience you must have your browser set to accept > > cookies. Once you have logged in to Wiley InterScience, our Web server uses a > > temporary cookie to help us manage your visit. > > This <strong>Session Cookie</strong> is deleted when you logoff Wiley > > InterScience, or when you quit your browser. The cookie allows us to quickly > > determine your access control rights and your personal preferences during your > > online session. The Session Cookie is set out of necessity and not out of > > convenience. > > -- > > > > > > I'm getting the feeling that maybe the sites using some sort of strange > > technique for restricting access, but I've similar examples of other sites > > doing the same kind of things. > > > > Cheers again, > > > > Gavin. > > > > On Tue, Jan 30, 2007 at 04:36:49AM -0800, Derrick Oswald wrote: > > > Hello, > > > > > > In my experience, you will need both redirection processing and cookie processing. > > > The general scenario is a page is returned with cookie's and a redirect. > > > If the redirect is taken, the server looks for the cookie it tried to set with the first page. > > > You can monitor the traffic in the header by implementing ConnectionMonitor (like the Parser). > > > > > > Derrick > > > > > > ----- Original Message ---- > > > From: Gavin Gilmour <ga...@br...> > > > To: htm...@li... > > > Sent: Tuesday, January 30, 2007 5:59:41 AM > > > Subject: [Htmlparser-user] Cookie handling > > > > > > Hi, I hope this is the correct place to ask the mailing list is looking pretty > > > quiet according to sourceforge's mail archives :) > > > > > > I've ran into difficulty parsing certain sites that attempt to set cookies in > > > an effort to read them later. The sites in question generally spit back pages > > > saying that your browser or whatever is misconfigured and deny access. I assume > > > this is simply because they can't read the cookies that should've been set and > > > I wondered if there was an option to do so somewhere. > > > > > > I'm not trying to set *actual* cookies which most of the examples I've seen > > > indicate how to do, but just allow pages to dump cookies and read them later. > > > Is this possible or (possibly) unrealistic and difficult to implement? > > > > > > The current code I've got is: > > > > > > ... > > > ConnectionManager manager = Parser.getConnectionManager(); > > > manager.setRedirectionProcessingEnabled(false); > > > manager.setCookieProcessingEnabled(true); > > > Parser parser = new Parser(...); > > > ---- > > > > > > etc. > > > > > > > > > Cheers for any responses! > > > > > > Gavin. > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |