Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 seeds text area truncates seeds; big seed lists break config - ID: 1102755
Last Update: Comment added ( karl-ia )

Two related problems both linked off the current
text-area based seed-submission in config screens, and
giant seed lists:

(1) The text area has a maximum effective contents
size; a giant seed list can exceed this. As a result,
the seeds displayed from a running crawl (for example,
one started from teh command line) can be truncated.
Upon making any configuration change, the reread of the
seeds from the textarea loses seeds from before the
config change. If the scope is derived from the seeds,
URLs that should be in scope are now ruled out of
scope, and thus discarded (with -5000 status) upon recheck.

(2) Altenatively, attempting to submit a too-long seed
list sometimes generates a
ArrayIndexOutOfBoundsException (often with a '200000'
index) inside Jetty code. This seems somewhat random;
sometimes it does, sometimes it doesn't (perhaps
silently truncating the content).

In general, the textarea-based way of entering seeds,
and having that textarea input clobber anything in a
prexisting file, is seriously flawed and needs to be
fixed.


Gordon Mohr ( gojomo ) - 2005-01-15 02:27

7

Closed

Fixed

Gordon Mohr

None

None

Public


Comments ( 3 )

Date: 2007-03-14 00:20
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-336 -- please add further
comments at that location.


Date: 2005-03-02 19:51
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Immediate problem is fixed (awkwardly). Deferring larger
issues to a separate RFE (to be created if it doesn't
already exist) to better handle giant seed lists, avoid
problems with seed files.


Date: 2005-01-17 13:22
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Added protection against too-long seeds files being editted;
simply won't let textarea appear. Commit comment:

improvement for [ 1102755 ] seeds text area truncates seeds;
big seed lists break config
* JobConfigureUtils.java
add test of whether seed file is of plausibly edittable size
* configure.jsp
if seeds file too long, don't offer editor


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2005-03-02 19:51 gojomo
resolution_id None 2005-03-02 19:51 gojomo
close_date - 2005-03-02 19:51 gojomo
priority 9 2005-01-17 13:22 gojomo