User (miles crawford @ UW) wants to use Heritrix to
crawl just certain (text-oriented) documents. Of
course, for many URIs, you have to begin fetching the
document before you know its type. However, if it's not
of the desired type, it would be beneficial to abort
the fetch early, rather than complete the fetch (or
wait for it to hit other length/time limits) and then
ignore it.
This could probably be best achieved with some
operator-specified filters hooked into FetchHTTP
between the header-fetch and the readFully...
Michael Stack
None
None
Public
|
Date: 2007-03-14 01:29
|
|
Date: 2004-10-05 18:04 Logged In: YES |
|
Date: 2004-10-05 18:04 Logged In: YES |
|
Date: 2004-10-05 18:03 Logged In: YES |
|
Date: 2004-09-18 01:07 Logged In: YES |
|
Date: 2004-09-02 14:41 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2004-10-05 18:04 | stack-sf |
| close_date | - | 2004-10-05 18:04 | stack-sf |
| assigned_to | nobody | 2004-09-01 23:19 | gojomo |
| priority | 6 | 2004-09-01 21:49 | gojomo |
| priority | 5 | 2004-07-29 00:55 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use