Andy Boyko says:
...I've got a few questions about the default
configuration
values, towards making the crawler most useful right
out of the box.
- Invocation: it seems pretty clear that crawls are
safest with some
memory headroom. Can the bin/heritrix script default
JAVA_OPTS to
"-Xmx256m" if it's not otherwise set?
- Extractors: how experimental are the non-enabled
variants? I recall
that in February, we were encouraged to use HTML2 and
CSS, but that some
leaked memory (PDF? DOC? SWF?). Is it safe to change
the defaults to,
at the least, include CSS? Of the potentially leaky
ones, SWF seems the
most compelling for enabling by default.
- Filters: If I understand the "recheck-scope" setting
on the
preselector, it needs to be set for scope changes
during the crawl to be
detected and honored. Assuming it doesn't affect
performance too much,
can it default to on?
Is it worthwhile to enable by default either PathDepth or
PathologicalPath filters, presumably with generous but
non-infinite values?
- Politeness: I know per-host bandwidth usage got moved
to the "expert"
section, but it might be good to default the per-host
cap to something
maybe T1-like (thus, perhaps 150KBps or so) to avoid
pummelling sites
with large files, when crawling from a large pipe. (Is
the per-host cap
like the total-bandwidth cap, in that it doesn't
actually constrain
instantaneous traffic, only the average?)
At any rate, it's clear there's not a lot between here
and 1.0 (but
please see the bug I submitted today :) Thanks for
the great work and
breakneck pace, Heritrix-meisters!
Regards,
Andy Boyko aboy@loc.gov
Library of Congress
Igor responds....
>> - Invocation: it seems pretty clear that crawls are
safest with some
>> memory headroom. Can the bin/heritrix script
default JAVA_OPTS to
>> "-Xmx256m" if it's not otherwise set?
Sure. Maybe we should adopt IBM's JVM approach where
default size of the heap is a half of a
system's physical memory.
>> - Extractors: how experimental are the non-enabled
variants? I recall
>> that in February, we were encouraged to use HTML2
and CSS, but that some
>> leaked memory (PDF? DOC? SWF?). Is it safe to
change the defaults to,
>> at the least, include CSS? Of the potentially leaky
ones, SWF seems the
>> most compelling for enabling by default.
We should have HTTP, HTML, CSS, JS and SWF be defualt
extractors.
We changed SWF extractor to use memory more efficiently
and I have not have memory problems with it
since.
PDF and DOC extractors are still problematic. I have
been working on a new, more memory efficient
PDF extractor but is not ready yet. DOC parsing is a
problem since DOCs cannot be parsed by treating
them as randomly accesable streams. Beacause of this it
is necessary to load entire DOCs into memory
in order to parse them.
HTML2 extractor(horrible name btw) is making two passes
on javascript code. One pass examains all
strings in javascript code and second pass parses
javascript code as html. HTML extractor is making
only the fisrt pass. I believe that HTML extracotr got
better and that there is no need of HTML2
anymore.
I will have to do some comparison to confirm this.
>> - Filters: If I understand the "recheck-scope"
setting on the
>> preselector, it needs to be set for scope changes
during the crawl to be
>> detected and honored. Assuming it doesn't affect
performance too much,
>> can it default to on?
I have no preferences on this one. Though, it seems
right as is.
>> Is it worthwhile to enable by default either
PathDepth or
>> PathologicalPath filters, presumably with generous
but non-infinite values?
I agree. I usually set PathDepth to 20 and
PathologicalPath to max of 3 repetitions of a pattern.
>> - Politeness: I know per-host bandwidth usage got
moved to the "expert"
>> section, but it might be good to default the
per-host cap to something
>> maybe T1-like (thus, perhaps 150KBps or so) to avoid
pummelling sites
>> with large files, when crawling from a large pipe.
(Is the per-host cap
>> like the total-bandwidth cap, in that it doesn't
actually constrain
>> instantaneous traffic, only the average?)
I am not sure if need to do this. It seems that default
values of dynamic politeness will
significantly delay request to sites with large files
when crawling from a large pipe.
During fetching we just might be OK by relying on TCP's
congestion control and not worry about
solely saturating the sites' bandwidth.
Take care.
i.
Andy supplies attached patch.
Michael Stack
Configuration
None
Public
|
Date: 2007-03-14 01:31
|
|
Date: 2004-07-01 00:50 Logged In: YES |
|
Date: 2004-07-01 00:46 Logged In: YES |
|
Date: 2004-07-01 00:46 Logged In: YES |
|
Date: 2004-07-01 00:45 Logged In: YES |
| Filename | Description | Download |
|---|---|---|
| order.xml.patch | order file patch | Download |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2004-07-01 00:46 | stack-sf |
| close_date | - | 2004-07-01 00:46 | stack-sf |
| File Added | 92470: order.xml.patch | 2004-07-01 00:18 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use