Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 https SSLHandshakeException: unknown certificate - ID: 896788
Last Update: Comment added ( karl-ia )

We get this exception for various reasons: Expired
certicate, Certificate signed by an authority we don't
recognize, or just bogus cert.

We might be able to fix a percentage by supplying a
keystore loaded w/ more than the default set of trusted
Certificate Authories (Copy from mozilla?).

We should also look at forcing the crawler to trust
certifcates whose CA it doesn't recognize or accepting
certs that are expired anyways (I don't think this can
be done but we should take a looksee anyways).

Here's some URLs to play with:

https://foia.aphis.usda.gov (self-signed)
https://www1.lmi.org/USDAIT (Unrecognized CA)
https://cert.myforms.sc.egov.usda.gov/myforms/loginservlet
(Expired).

Look in local-errors.log from a broad-crawl for others.


Michael Stack ( stack-sf ) - 2004-02-13 22:10

5

Closed

Fixed

Michael Stack

Protocols

None

Public


Comments ( 4 )

Date: 2007-03-14 00:07
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-68 -- please add further
comments at that location.


Date: 2004-02-21 01:22
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed. Here is commit message:

Added a heritrix truststore. Its currently copy of the SUN
JVM cacert.
This has at least twice the certs that ship w/ the IBM JVM
(IBM JVM fails
going against verisign for instance). Also added a new heritrix
SSLSocketFactory which uses a new configurable
HeritrixX509TrustManager.
Default setting is to trust anything given to us. Testing
shows us getting
into sites w/ selfsigned and expired certs as well as into
sites for which we
do not have the cert's CA signer.
* src/conf/heritrix.properties
(javax.net.ssl.trustStore): Added. Point it a heritrix
trustStore
added as part of this commit.
* src/java/org/archive/crawler/Heritrix.java
(getProperty): Made methods publically available so
HeritrixX509TrustManager can get at the truststore property.
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Made lines 80 characters and added javadoc.
Registered our new HeritrixSSLSocketFactory w/ the
httpclient. Added
new ATTR_TRUST attribute so you can specify trust level.
HttpClient http = null;
* src/conf/heritrix.cacerts
*
src/java/org/archive/httpclient/HeritrixSSLProtocolSocketFactory.java
*
src/java/org/archive/httpclient/HeritrixSSLProtocolSocketFactoryTest.java
* src/java/org/archive/httpclient/HeritrixX509TrustManager.java
* src/java/org/archive/httpclient/package.html
Added.



Date: 2004-02-17 20:06
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

More urls from Andy's log.

LR http://www.mci.usmc.mil/
javax.net.ssl.SSLHandshakeException: unknown certificate

LRLP
https://kuwait.manpower.usmc.mil/manpower/mm/mmma/awards.nsf?open
javax.net.ssl.SSLHandshakeException: unknown certificate

LXP
https://osprey.manpower.usmc.mil/manpower/mi/mra_ofct.nsf/RA/Reserve+Affairs+Division+Home
java.net.SocketException: SSL handshake failure

LXP https://www.marineforlife.com/
java.net.SocketException: SSL handshake failure


ELLXP https://mypay.dfas.mil/
java.io.IOException: SSL failure




Date: 2004-02-17 19:15
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Andy Boyko's workshop crawl at:

/0/home/aboy/heritrix-0.4.1/jobs/workshop-20040213001126080/disk

...contains a lot of these errors.

Is there any deeper info we can extract and log from the
exception?

There should be a way to set the HTTPClient library to
'blindly trust' -- after all, browsers give their users that
option, on a case-by-case basis. It's not as if when we
encounter a cert problem we're going to try another way of
getting the same URI -- we should note the problem and
continue.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
close_date - 2004-02-21 01:22 stack-sf
status_id Open 2004-02-21 01:22 stack-sf
resolution_id None 2004-02-21 01:22 stack-sf
assigned_to nobody 2004-02-17 19:15 gojomo