Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 Extractors should not extract if links already extracted - ID: 1111656
Last Update: Comment added ( karl-ia )

Feature requested by Dave Skinner. Here is note from
the list asking for feature. We chatted inhouse and
decided it makes sense:

Dave Skinner wrote:

> This was not in my list of three but it is easier to
send....
>
> ExtractorUniversal.java contains the following check
>
> protected void innerProcess(CrawlURI curi) {
> if(curi.hasBeenLinkExtracted()){
> //Some other extractor already handled
this one. We'll
> pass on it.
> return;
> }
>
> I think all the extractors should have the same or
similar code.
> Right now
> it is not easy to prevent a curi from having its
links followed. I cant
> find anywhere in the standard code where this is
checked other than
> the one
> place in ExtractorUniversal.

We had a chat here and it makes sense that extractors
should default to
not run if links have already been extracted.




Michael Stack ( stack-sf ) - 2005-01-28 18:56

5

Closed

None

Michael Stack

API

None

Public


Comments ( 2 )

Date: 2007-03-14 01:38
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-890 -- please add further
comments at that location.


Date: 2005-01-28 19:13
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed. Closing. Below is commit message. Of note, the only
extractor that does not by default respect the flag that
says links have already been extracted is the ExtractorHTTP
which runs against HTTP headers. This one should probably
always whatever the flag says. Thats how it currently is.


[debord 342] heritrix > more /tmp/diff.txt
Fix for '[ 1111656 ] Extractors should not extract if
already done'
* src/java/org/archive/crawler/extractor/ExtractorCSS.java
* src/java/org/archive/crawler/extractor/ExtractorDOC.java
* src/java/org/archive/crawler/extractor/ExtractorHTML.java
* src/java/org/archive/crawler/extractor/ExtractorJS.java
* src/java/org/archive/crawler/extractor/ExtractorPDF.java
* src/java/org/archive/crawler/extractor/ExtractorSWF.java
Add check if link extraction has already been done. If
so, don't
run.
* src/java/org/archive/crawler/extractor/ExtractorHTTP.java
Refactoring.
* src/java/org/archive/crawler/extractor/ExtractorUniversal.java
Set links extracted flag at end of processing to be
consistent
with other extractors.
* src/java/org/archive/io/arc/ARCReaderFactory.java
Javadoc warning fix.




Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
status_id Open 2005-01-28 19:13 stack-sf
summary Extractors should not extract if already done 2005-01-28 19:13 stack-sf
close_date - 2005-01-28 19:13 stack-sf