Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 Logging in (HTTP POST, Basic Auth, etc.) - ID: 914301
Last Update: Comment added ( karl-ia )

There should be a way to preload Heritrix w/
credentials -- login/passwords -- and have it volunteer
these credentials at the appropriate juncture so it can
get at content that resides behind authentications
barriers (e.g. login pages, basic auth., etc.).


Nobody/Anonymous ( nobody ) - 2004-03-11 17:48

7

Closed

None

Michael Stack

Network/Protocols

None

Public


Comments ( 9 )

Date: 2007-03-14 01:27
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-742 -- please add further
comments at that location.


Date: 2006-11-12 05:43
Sender: nobody

Logged In: NO

Well done!
http://rmbbtuis.com/xszj/jhkx.html | http://saoflhqd.com/nlkr/wski.html


Date: 2006-11-12 05:42
Sender: nobody

Logged In: NO

Well done!
<a href="http://rmbbtuis.com/xszj/jhkx.html">My homepage</a> | <a
href="http://gapwipwt.com/dojm/lfxu.html">Please visit</a>


Date: 2006-11-12 05:42
Sender: nobody

Logged In: NO

Good design!
[url=http://rmbbtuis.com/xszj/jhkx.html]My homepage[/url] |
[url=http://cxbodmfk.com/mivt/wsxf.html]Cool site[/url]


Date: 2006-11-09 07:22
Sender: nobody

Logged In: NO

Well done!
[url=http://dvdprrdx.com/sfyu/ayvo.html]My homepage[/url] |
[url=http://kpsypyci.com/ewjz/mkhs.html]Cool site[/url]


Date: 2004-05-05 19:27
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing as completed.


Date: 2004-04-28 01:50
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Did testing against sourceforget and my.yahoo.com. Found
prob. whereby http and https get same crawlserver instance.
Added doc. to manual. Added selftests that test basic,
digest, get and post.

Closing. Implemented.


Date: 2004-04-19 23:18
Sender: nobody

Logged In: NO

Finished adding of BASIC/DIGEST AUTH Support. Added
documentation to user manual. Here are things I've learned.

+ Have to manage the adding of authentication headers myself
rather than let httpclient do it for various reasons:
- Can add credentials to a HttpState but no means
subsequently of removing them if for instance they fail.
- Its in the the nature of HttpClient to preemptively offer
credentials whereas we want to mark our arc files w/ the
fact that this page got 401s.
+ If more than one auth -- say a BASIC and a DIGEST on the
one server, httpclient overwrites any header already present
in the request rather than add a new one or compound the
two. Will wait on request by user before spending time
trying to make more than one RFC2617 work at a time going
up against the one CrawlServer.
+ Apache writes its DIGEST WWW-Authenticate header
distingushing the pieces of the header using commas whereas
Jetty writes the header using spaces. HttpClient likes the
way apache writes the header and burps on the way Jetty
writes the header. Jetty seems to adhere to the spec.
according to my reading of "3.2.1 The WWW-Authenticate
Response Header" in rfc2617 (Digest is probably rare anyways
going by the caveat on this page:
http://httpd.apache.org/docs/howto/auth.html)
+ In the poposal written up for this feature,
http://crawler.archive.org/articles/auth_proposal.html, it
talks of support preemptive offering of credentials. This
is not implemented and won't be unless explicitly asked for
(Would mean adding of a precondition).
+ Of note, on successful RFC2617 authentication, the
credentials that succeeded are added to the CrawlServer and
volunteered for all subsequent requests going against this
server. IF the CrawlServer is serialized, the credentials
are NOT serialized. This is probably ok. Means that when
the CrawlServer is revivified, we have to go through the log
in again.
+ Would be good if we could do a site first w/o credentials
and then the same site with credentials so we could collect
both states.
+ org.archive.crawler.fetcher.FetchHTTP.level = FINE to get
authentication loggings.


Moving now to work on HTML Form logins.


Date: 2004-03-30 21:12
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Assigned to myself and upped the priority.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-05-05 19:27 stack-sf
close_date - 2004-05-05 19:27 stack-sf
priority 5 2004-03-30 21:12 stack-sf
assigned_to nobody 2004-03-30 21:12 stack-sf