Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 recovery fills heritrix_out with "Relative URI but no base.. - ID: 1216633
Last Update: Comment added ( karl-ia )

Recovering a crawl puts the path to the recovery log
into the 'via' of many URIs, resulting in a lot of
these errors in heritrix_out:

org.apache.commons.httpclient.URIException: Relative
URI but no base:
/1/webcrawl-test/results/heritrix-1.5.0-200506061400-20050606T203234/tAU/be
gin/logs/recover.gz
at
org.archive.crawler.datamodel.UURIFactory.fixup(UURIFactory.java:438)
at
org.archive.crawler.datamodel.UURIFactory.create(UURIFactory.java:296)
at
org.archive.crawler.datamodel.UURIFactory.create(UURIFactory.java:285)
at
org.archive.crawler.datamodel.UURIFactory.getInstance(UURIFactory.java:240)

at
org.archive.crawler.frontier.RecoveryJournal.importRecoverLog(RecoveryJourn
al.java:221)
at
org.archive.crawler.frontier.AbstractFrontier.importRecoverLog(AbstractFron
tier.java:799)
at
org.archive.crawler.framework.CrawlController.setupCrawlModules(CrawlContro
ller.java:582)
at
org.archive.crawler.framework.CrawlController.initialize(CrawlController.ja
va:336)
at
org.archive.crawler.admin.CrawlJobHandler.startNextJobInternal(CrawlJobHand
ler.java:1066)
at
org.archive.crawler.admin.CrawlJobHandler$2.run(CrawlJobHandler.java:1032)
at java.lang.Thread.run(Thread.java:595)

This is harmless, but pollutes the heritrix_out with
expected, uninteresting output. Some adjustment to the
current practice should prevent this.


Gordon Mohr ( gojomo ) - 2005-06-07 18:56

5

Closed

Fixed

Gordon Mohr

None

1.6.0

Public


Comments ( 2 )

Date: 2007-03-14 00:53
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-427 -- please add further
comments at that location.


Date: 2005-07-22 02:53
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fixed (2005-06-08) by no longer placing recovery-log path in
as pseudo-Via. Commit comment:

Fix for bug [recovery generates spurious URI
relative-but-no-base output]
Implementation of RFE [timestamps in recovery log]
* RecoveryJournal.java
don't insert recovery-path as pseudo-Via now that via is
a UURI not a String
dump a timestamp line to log every 10000 lines


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:02 gojomo
status_id Open 2005-07-22 02:53 gojomo
resolution_id None 2005-07-22 02:53 gojomo
close_date - 2005-07-22 02:53 gojomo
assigned_to nobody 2005-06-22 19:24 gojomo