Re: [Htmlparser-user] Cookie handling
Brought to you by:
derrickoswald
From: Gavin G. <ga...@br...> - 2007-01-30 13:49:39
|
Wow, good point. The URL seems to load fine here in my browser but on further inspection: sokar:~/junk% curl http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT\?CRETRY\=1\&SRETRY\=0 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="http://www3.interscience.wiley.com/cookie_setting_error.html">here</a>.</p> </body></html> Seems to have already decided it's dud, weird. After a bit of investigating, the full story is a bit worse than I thought and is involing multiple redirects. The first link I need comes back from this service: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks - which just offers up a 302 or whatever and then issues a redirect. Fair enough, so: --- sokar:~/junk% curl 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10629107&retmode=ref&cmd=prlinks' <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M">here</a>.</p> </body></html> --- Is giving 'http://dx...' which is what the parser is trying next I'd imagine. So then: sokar:~/junk% curl 'http://dx.doi.org/10.1002/(SICI)1097-4644(1999)75:32+<84::AID-JCB11>3.0.CO;2-M' <HTML><HEAD><TITLE>Handle Redirect</TITLE></HEAD> <BODY><A HREF="http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M">http://doi.wiley.com/10.1002/%28SICI%291097-4644%281999%2975%3A32%2B%3C84%3A%3AAID-JCB11%3E3.0.CO%3B2-M</A></BODY></HTML> Looking at this URL, it seems to be the 'final one' which is leading to the (desired) destination in a browser. (Does that output even look like something the parser would handle though?) What a mess :( Gavin. P.S. Sorry about the horribly formatted mail due to the unsightly urls involved. On Tue, Jan 30, 2007 at 01:22:33PM +0000, sebb wrote: > Have you tried pasting the URL in a browser? > > If it works for you - it did not for me - perhaps the browser has > secretly fed the site with a tasty cookie it kept around for just such > a purpose ... > > On 30/01/07, Gavin Gilmour <ga...@br...> wrote: > > Hi Derrick, cheers for the quick reply. > > > > I knew I'd regret posting the code I did because the actual code I'm using does > > in fact have redirection processing enabled. (I had manged to mix up similar > > code elsewhere that actually needs the redirection stuff disabled.) > > > > Here's a quick example of the sort of problems I'm having (including a url of a > > page causing issues): > > > > --- > > import org.htmlparser.*; > > import org.htmlparser.http.*; > > import org.htmlparser.util.*; > > > > public class TestParser { > > > > public static void main(String[] argv) { > > > > try { > > ConnectionManager manager = Parser.getConnectionManager(); > > manager.setRedirectionProcessingEnabled(true); > > manager.setCookieProcessingEnabled(true); > > Parser parser = new Parser("http://www3.interscience.wiley.com/cgi-bin/abstract/68504762/ABSTRACT?CRETRY=1&SRETRY=0"); > > System.out.print(((NodeList)parser.parse(null)).toHtml()); > > } catch (ParserException pe) { > > pe.printStackTrace(); > > } > > } > > } > > > > sokar:~/junk% javac TestParser.java > > sokar:~/junk% java TestParser | grep -i --color Cookie > > > > <title>Wiley InterScience :: Session Cookies</title> > > <h2>Session Cookie Error</h2> > > <strong>An error has occured because we were unable to send a cookie to your > > web browser.</strong> Session cookies are commonly used to facilitate improved > > site navigation. > > In order to use Wiley InterScience you must have your browser set to accept > > cookies. Once you have logged in to Wiley InterScience, our Web server uses a > > temporary cookie to help us manage your visit. > > This <strong>Session Cookie</strong> is deleted when you logoff Wiley > > InterScience, or when you quit your browser. The cookie allows us to quickly > > determine your access control rights and your personal preferences during your > > online session. The Session Cookie is set out of necessity and not out of > > convenience. > > -- > > > > > > I'm getting the feeling that maybe the sites using some sort of strange > > technique for restricting access, but I've similar examples of other sites > > doing the same kind of things. > > > > Cheers again, > > > > Gavin. > > > > On Tue, Jan 30, 2007 at 04:36:49AM -0800, Derrick Oswald wrote: > > > Hello, > > > > > > In my experience, you will need both redirection processing and cookie processing. > > > The general scenario is a page is returned with cookie's and a redirect. > > > If the redirect is taken, the server looks for the cookie it tried to set with the first page. > > > You can monitor the traffic in the header by implementing ConnectionMonitor (like the Parser). > > > > > > Derrick > > > > > > ----- Original Message ---- > > > From: Gavin Gilmour <ga...@br...> > > > To: htm...@li... > > > Sent: Tuesday, January 30, 2007 5:59:41 AM > > > Subject: [Htmlparser-user] Cookie handling > > > > > > Hi, I hope this is the correct place to ask the mailing list is looking pretty > > > quiet according to sourceforge's mail archives :) > > > > > > I've ran into difficulty parsing certain sites that attempt to set cookies in > > > an effort to read them later. The sites in question generally spit back pages > > > saying that your browser or whatever is misconfigured and deny access. I assume > > > this is simply because they can't read the cookies that should've been set and > > > I wondered if there was an option to do so somewhere. > > > > > > I'm not trying to set *actual* cookies which most of the examples I've seen > > > indicate how to do, but just allow pages to dump cookies and read them later. > > > Is this possible or (possibly) unrealistic and difficult to implement? > > > > > > The current code I've got is: > > > > > > ... > > > ConnectionManager manager = Parser.getConnectionManager(); > > > manager.setRedirectionProcessingEnabled(false); > > > manager.setCookieProcessingEnabled(true); > > > Parser parser = new Parser(...); > > > ---- > > > > > > etc. > > > > > > > > > Cheers for any responses! > > > > > > Gavin. > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |