Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 Improved out-of-the-box defaults - ID: 983109
Last Update: Comment added ( karl-ia )

Andy Boyko says:

...I've got a few questions about the default
configuration
values, towards making the crawler most useful right
out of the box.

- Invocation: it seems pretty clear that crawls are
safest with some
memory headroom. Can the bin/heritrix script default
JAVA_OPTS to
"-Xmx256m" if it's not otherwise set?

- Extractors: how experimental are the non-enabled
variants? I recall
that in February, we were encouraged to use HTML2 and
CSS, but that some
leaked memory (PDF? DOC? SWF?). Is it safe to change
the defaults to,
at the least, include CSS? Of the potentially leaky
ones, SWF seems the
most compelling for enabling by default.

- Filters: If I understand the "recheck-scope" setting
on the
preselector, it needs to be set for scope changes
during the crawl to be
detected and honored. Assuming it doesn't affect
performance too much,
can it default to on?

Is it worthwhile to enable by default either PathDepth or
PathologicalPath filters, presumably with generous but
non-infinite values?

- Politeness: I know per-host bandwidth usage got moved
to the "expert"
section, but it might be good to default the per-host
cap to something
maybe T1-like (thus, perhaps 150KBps or so) to avoid
pummelling sites
with large files, when crawling from a large pipe. (Is
the per-host cap
like the total-bandwidth cap, in that it doesn't
actually constrain
instantaneous traffic, only the average?)

At any rate, it's clear there's not a lot between here
and 1.0 (but
please see the bug I submitted today :) Thanks for
the great work and
breakneck pace, Heritrix-meisters!

Regards,
Andy Boyko aboy@loc.gov
Library of Congress


Igor responds....

>> - Invocation: it seems pretty clear that crawls are
safest with some
>> memory headroom. Can the bin/heritrix script
default JAVA_OPTS to
>> "-Xmx256m" if it's not otherwise set?


Sure. Maybe we should adopt IBM's JVM approach where
default size of the heap is a half of a
system's physical memory.


>> - Extractors: how experimental are the non-enabled
variants? I recall
>> that in February, we were encouraged to use HTML2
and CSS, but that some
>> leaked memory (PDF? DOC? SWF?). Is it safe to
change the defaults to,
>> at the least, include CSS? Of the potentially leaky
ones, SWF seems the
>> most compelling for enabling by default.


We should have HTTP, HTML, CSS, JS and SWF be defualt
extractors.
We changed SWF extractor to use memory more efficiently
and I have not have memory problems with it
since.
PDF and DOC extractors are still problematic. I have
been working on a new, more memory efficient
PDF extractor but is not ready yet. DOC parsing is a
problem since DOCs cannot be parsed by treating
them as randomly accesable streams. Beacause of this it
is necessary to load entire DOCs into memory
in order to parse them.
HTML2 extractor(horrible name btw) is making two passes
on javascript code. One pass examains all
strings in javascript code and second pass parses
javascript code as html. HTML extractor is making
only the fisrt pass. I believe that HTML extracotr got
better and that there is no need of HTML2
anymore.
I will have to do some comparison to confirm this.


>> - Filters: If I understand the "recheck-scope"
setting on the
>> preselector, it needs to be set for scope changes
during the crawl to be
>> detected and honored. Assuming it doesn't affect
performance too much,
>> can it default to on?


I have no preferences on this one. Though, it seems
right as is.


>> Is it worthwhile to enable by default either
PathDepth or
>> PathologicalPath filters, presumably with generous
but non-infinite values?


I agree. I usually set PathDepth to 20 and
PathologicalPath to max of 3 repetitions of a pattern.


>> - Politeness: I know per-host bandwidth usage got
moved to the "expert"
>> section, but it might be good to default the
per-host cap to something
>> maybe T1-like (thus, perhaps 150KBps or so) to avoid
pummelling sites
>> with large files, when crawling from a large pipe.
(Is the per-host cap
>> like the total-bandwidth cap, in that it doesn't
actually constrain
>> instantaneous traffic, only the average?)


I am not sure if need to do this. It seems that default
values of dynamic politeness will
significantly delay request to sites with large files
when crawling from a large pipe.
During fetching we just might be OK by relying on TCP's
congestion control and not worry about
solely saturating the sites' bandwidth.

Take care.
i.



Andy supplies attached patch.


Michael Stack ( stack-sf ) - 2004-07-01 00:18

5

Closed

None

Michael Stack

Configuration

None

Public


Comments ( 5 )

Date: 2007-03-14 01:31
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-794 -- please add further
comments at that location.


Date: 2004-07-01 00:50
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here is the note that andy sent with his patch.

Andy Boyko wrote:

>I'm belatedly responding to Igor's thoughtful response
about what the
>out-of-the-box configuration should be like. Based on his
comments,
>I've got a patch to the initial profile's order.xml (as of
Heritrix
>0.10.0) which makes the following changes:
>
> - enables the CSS and SWF extractors
> - adds the PathDepth filter and PathologicalPath filters
to the
> exclude filter of the scope, with the default depth of
20 and
> default path repetitions of 3.
> - enables recheck-scope in the Preselector, which was
necessary
> to support the above filters correctly
>
>Should this patch be applied to the initial profile? I'd
like the
>extractors turned on, and am agnostic about the filters
(but note that
>it's tricky to get those filters configured so they
actually work, so if
>it's not on in the default, it needs documenting).
>
>Note that there are a few other changes in there, as the
profile was
>edited by the Web UI, and so it reflects the values written
by the code.
> Nothing significant except for the addition of what I
guess are a few
>new configurations.
>
>Regards,
>Andy Boyko aboy@loc.gov
>Library of Congress
>



Date: 2004-07-01 00:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Actually close.


Date: 2004-07-01 00:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I put in place andy's patch and added default JAVA_OPTS of
'-Xmx256m'. Closing.

Below is commit message:

Fix for [ 983109 ] Improved out-of-the-box defaults.
Bulk of the below was submitted by Andy and reviewed by Igor
and I.
* src/conf/profiles/Simple/order.xml
* src/conf/selftest/order.xml
Add pathdepth, pathologicalpathfilter, extractorswf, and
extractorcss.
* src/scripts/heritrix
Make default max heap size be 256megs.



Date: 2004-07-01 00:45
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I put in place andy's patch and added default JAVA_OPTS of
'-Xmx256m'. Closing.

Below is commit message:

Fix for [ 983109 ] Improved out-of-the-box defaults.
Bulk of the below was submitted by Andy and reviewed by Igor
and I.
* src/conf/profiles/Simple/order.xml
* src/conf/selftest/order.xml
Add pathdepth, pathologicalpathfilter, extractorswf, and
extractorcss.
* src/scripts/heritrix
Make default max heap size be 256megs.



Attached File ( 1 )

Filename Description Download
order.xml.patch order file patch Download

Changes ( 3 )

Field Old Value Date By
status_id Open 2004-07-01 00:46 stack-sf
close_date - 2004-07-01 00:46 stack-sf
File Added 92470: order.xml.patch 2004-07-01 00:18 stack-sf