Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

8 ssl doesn't work - ID: 903910
Last Update: Comment added ( karl-ia )

A seed of https://weather.iwz.usmc.mil/ gives the below
in the arc file w/ no obvious errors. We're failing to
progress past the 404 we get when we ask for robots.txt
(We're not writing the 404 result to the arc -- see arc
contents below).

Checking to see if the addition of the new protocol
factory which sets our trustmanager is responsible for
the breakage.

Here are arc file contents.

filedesc://IAH20040224154443-0.arc.gz 0.0.0.0
20040224154443 text/plain 77
1 0 InternetArchive
URL IP-address Archive-date Content-type Archive-length


dns:weather.iwz.usmc.mil 63.203.238.114 20040224154442
text/dns 62
20040224154442
weather.iwz.usmc.mil. 86400 IN A 205.110.15.10


Nobody/Anonymous ( nobody ) - 2004-02-25 03:44

8

Closed

Fixed

Michael Stack

Extraction

None

Public


Comments ( 4 )

Date: 2007-03-14 00:08
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-81 -- please add further
comments at that location.


Date: 2004-02-25 13:30
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed. Commit message below.

Fixes that make crawling SSL work (again?). In a couple of
places, assumption that we only do had 'http' crept in.
* src/java/org/archive/crawler/basic/ARCWriterProcessor.java
Allow for a scheme of https.
* src/java/org/archive/crawler/basic/CrawlStateUpdater.java
Formatting (80 chars).
(innerProcess): Allow for a scheme of https.
* src/java/org/archive/crawler/basic/PreconditionEnforcer.java
Formatting (80 chars).
(innerProcess): Allow for a scheme of https.
* src/java/org/archive/crawler/datamodel/UURI.java
Formatting and tolowercase on the scheme string (This
class was set to
do http and https).
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Minor formatting.


Date: 2004-02-25 07:07
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Ok. Tried going to http://www.verisign.com w/o the recent
trustfactory and specialized ssl socket factory commented
out and it fails in the same way. Either the new httpclient
broke things or SSL just never worked.

Upping the priority since we're so close to getting it working.


Date: 2004-02-25 03:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Assigned to myself (I created issue). Upping priority.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2004-02-25 13:30 stack-sf
resolution_id None 2004-02-25 13:30 stack-sf
close_date - 2004-02-25 13:30 stack-sf
priority 7 2004-02-25 07:07 stack-sf
priority 5 2004-02-25 03:46 stack-sf
assigned_to nobody 2004-02-25 03:46 stack-sf