Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 seeds listed without scheme, but with path, being ignored - ID: 1207378
Last Update: Comment added ( karl-ia )

If a seed does not match the regex
^\s*(\w\S+)\s*(#.*)?$ it would not be crawled. In other
words, if a seed does not start with a scheme (e.g.
http://a.com) or if it does not look like a host
(a.com) it would not be crawled.

For example:

http://a.com -- seed in this format is recognized
a.com -- seed in this format is recognized
a.com #a.com -- seed in this format is recognized

a.com/ -- seed in this format is NOT recognized and it
is silently dropped out of the seed list and all other
reports.


Igor Ranitovic ( ia_igor ) - 2005-05-23 22:01

7

Closed

None

Karl Thiessen

General

1.6.0

Public


Comments ( 4 )

Date: 2007-03-14 00:52
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-415 -- please add further
comments at that location.


Date: 2005-07-01 23:19
Sender: karl-ia

Logged In: YES
user_id=1269624

Test in harness, verified bug existence, verified fix. Closing.


Date: 2005-06-02 23:59
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Added rough feature to make any future inadvertently ignored
seed-file items (non-comment whitespace-trimmed lines)
prominent in the UI's seeds report, so that problems can be
corrected. Commit comment:

feature suggested by [ 1207378 ] seeds listed without
scheme, but with path, being ignored
* webapps/admin/reports/seeds.jsp
If any 'ignored seed items' were recorded the last time
seeds were imported to the frontier, report them on the
seeds report for operator review
* CrawlJob.java
add convenience method for getting the ignored seed
items (if any) as a string
* CrawlScope.java
adjust seedsIterator to allow optional passed-in writer
for reporting ignored seed items
* AbstractFrontier.java
extend loadSeeds to either save any ignored seeds, or
delete older saved ignored seeds if none on most recent scan
* AdaptiveRevisitFrontier.java, HostQueuesFrontier.java
mimic AbstractFrontier's saving of any ignored seeds
* SeedFileIterator.java
allow ignoredWriter to be any Writer (not just
BufferedWriter)

==
Bug believed fixed and extra insurance added against further
problems. Assigning to Karl for final bug disposition.



Date: 2005-06-02 02:00
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Immediate fix committed. Commit comment:

Fix for [ 1207378 ] seeds listed without scheme, but with
path, being ignored
* SeedFileIterator.java
be more tolerant of abbriviated URI formats: try adding
"http://" to the front of any entry-line that doesn't
already have a URI scheme
also: new facility to report ignored entry lines to
optionally supplied writer (not yet used elsewhere)
also: close reader/writer when done
* SeedFileIteratorTest.java
unit test for SeedFileIterator, including pattern which
triggers referenced bug in unfixed code
* TransformingIteratorWrapper.java
new hook for cleanup (such as IO closing) when iterator
exhausted
* DomainScope.java, HostScope.java, PathScope.java, CrawlScope
add hooks for 'just-in-case' closing of SeedFileIterator
instances (if used)

===
Leaving open to add new section to seeds-report,
highlighting any non-comment entries in a seeds file that
were ignored during seed parsing (in progress).


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:02 gojomo
status_id Open 2005-07-01 23:19 karl-ia
close_date - 2005-07-01 23:19 karl-ia
assigned_to gojomo 2005-06-02 23:59 gojomo
summary seeds not being crawled 2005-06-01 02:44 gojomo
assigned_to nobody 2005-05-24 21:54 gojomo