From: Pranay P. <ssp...@ya...> - 2012-08-01 14:55:39
|
I am revisiting this question after a while. How is wayback expected to display pages that are password protected? As Noah pointed out as well, is it the case that wayback replays only the 401 (very first fetch attempted) and doesn't consider the 2nd fetch which is 200 OK? I have crawled couple of password protected and HTML login form based sites, however wayback display seems to pick 401 responses only. Thanks, Pranay ----- Forwarded Message ----- From: Pranay Pandey <ssp...@ya...> To: "arc...@ya..." <arc...@ya...> Sent: Wednesday, April 25, 2012 10:04 AM Subject: Re: [archive-crawler] Re: H3.0 config settings to crawl password protected pages. That makes sense. I do see the timestamp for 200 being little ahead of 401. Even I suspect, wayback could be the culprit in improper handling of 401s. I will inquire about it on the access listserv. Thanks Noah! Pranay. ________________________________ From: Noah Levitt <nl...@ar...> To: arc...@ya... Cc: Pranay Pandey <ssp...@ya...> Sent: Tuesday, April 24, 2012 9:33 PM Subject: Re: [archive-crawler] Re: H3.0 config settings to crawl password protected pages. Hello Pranay, The way it works, heritrix requests the url, gets the 401, then requeues the url, noting that next time it should try with the credential. Next time around the url is crawled again with auth. The first time through, on the 401, almost all the normal processing completes, short of logging to crawl.log. Full warc records are written. Normally this includes a request and a metadata record. So that's the only thing that's puzzling about your scenario. There should also be 2 request records and 2 metadata records, and the timestamps on the 401 and 200 should be different. As far as wayback goes, it's quite possible that it doesn't handle 401s well. (Maybe the ideal behavior would be to replicate what happens on the live web, that is, replay the 401, look for basic auth matching the request record of the 200 and only serve the content in that case.) Noah On 03/12/2012 07:58 AM, Pranay Pandey wrote: > Group, > >I am using 3.1.0 and I find it very strange to see just one response code (200) in crawl.log and two response codes (200 and 401) in form of two response records in the warc written. >I am trying to crawl a password protected site with the config settings I had outlined in my older email below. > >My question: >Are we ever supposed to get two "response" records in the WARC for the same object, fetched at the same time? >While the first response gives 401, the second one has 200 OK. > >And while I see 200 OK written in both WARC and crawl.log, wayback displays the authentication failed page (401) for all the protected pages. > >These are the two responses I see in the WARC, in the order they were written. > >WARC/1.0^M >WARC-Type: response^M >WARC-Target-URI: http://www.xyz.com/index.html/^M >WARC-Date: 2012-03-09T19:27:57Z^M >WARC-Payload-Digest: sha1:XNMHLNSS4YF24HWXC52LDEETXXVHV47X^M >WARC-IP-Address: 207.45.182.58^M >WARC-Record-ID: <urn:uuid:513f8d8f-edcc-42f8-ac07-136d2dcfe4a4>^M >Content-Type: application/http; msgtype=response^M >Content-Length: 3403^M >^M >HTTP/1.1 401 Authorization Required^M >Date: Fri, 09 Mar 2012 19:27:56 GMT^M >Server: Apache^M >WWW-Authenticate: Basic realm="Netpreserve"^M >Accept-Ranges: bytes^M >Connection: close^M >Content-Type: text/html^M > >WARC/1.0^M >WARC-Type: response^M >WARC-Target-URI: http://www.xyz.com/index.html/^M >WARC-Date: 2012-03-09T19:27:57Z^M >WARC-Payload-Digest: sha1:FGAN7GIN5JNYCHGL6BCB3TDPKDCFUJER^M >WARC-IP-Address: 207.45.182.58^M >WARC-Record-ID: <urn:uuid:f7f4761d-5c3b-4460-8ace-fed4e3477cd3>^M >Content-Type: application/http; msgtype=response^M >Content-Length: 6052^M >^M >HTTP/1.1 200 OK^M >Date: Fri, 09 Mar 2012 19:27:56 GMT^M >Server: Apache^M >X-Powered-By: PHP/5.2.17^M >Expires: Thu, 19 Nov 1981 08:52:00 GMT^M >Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0^M >Pragma: no-cache^M >Set-Cookie: PHPSESSID=7289c28a8ac5ba1b87aead5f94bc5473; path=/^M >Content-Length: 5687^M >Connection: close^M >Content-Type: text/html^M > > > > >________________________________ > From: Patrick <la...@ya...> >To: arc...@ya... >Sent: Monday, February 27, 2012 11:04 AM >Subject: [archive-crawler] Re: H3.0 config settings to crawl password protected pages. > > > >I have got the same problem. Have you worked out the solution? > >--- In arc...@ya..., Pranay Pandey <sspranay@...> wrote: >> >> >> OK, I took a shot at configuring the beans, it doesn't seem to be helping. >> The job gets build and run but doesn't pass the authentication stage. >> Any suggestion(s)? >> >> This is what I tried : >> ------------ >> <bean id="HttpAuthenticationCredential" class="org.archive.modules.credential.HttpAuthenticationCredential"> >> Â Â Â <property name="domain" value="www.mysite.com"/> >> Â Â Â <property name="login" value="xxxxx"/> >> Â Â Â <property name="password" value="xxxxxx"/> >> </bean> >> >> <bean id="credentialStore" class="org.archive.modules.credential.CredentialStore"> >> Â Â <property name="credentials"> >> Â Â Â Â Â Â <map> >> Â Â Â Â Â Â Â Â Â Â <entry key="credentials" value-ref="HttpAuthenticationCredential" /> >> Â Â Â Â Â Â </map> >> Â Â </property> >> </bean> >> -------------- >> And inside fetchFTTP, I add the property "credentialStore" as below. >> >> Â <bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP"> >> Â Â <property name="credentialStore"> >> <ref bean="credentialStore"/> >> Â </property> >> >> Thanks, >> Pranay >> >> --- On Thu, 6/16/11, Pranay Pandey <sspranay@...> wrote: >> >> From: Pranay Pandey <sspranay@...> >> Subject: [archive-crawler] H3.0 config settings to crawl password protected pages. >> To: "arc...@ya..." <arc...@ya...> >> Date: Thursday, June 16, 2011, 11:16 AM >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Â >> >> >> >> >> >> >> >> >> >> >> Â Hello, >> >> I have been looking around to get the related beans setting configuration to be able to crawl password protected sites/pages. >> Does anyone has it handy for H-3.0? >> >> Thanks! >> Pranay >> > > > > __._,_.___ Reply to sender | Reply to group | Reply via web post | Start a New Topic Messages in this topic (6) Recent Activity: * New Members 1 Visit Your Group Switch to: Text-Only, Daily Digest • Unsubscribe • Terms of Use . __,_._,___ |