Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

6 [contribution] New fixed number of queues policy - ID: 1122692
Last Update: Comment added ( karl-ia )

Christian Kohlschuetter wrote:

> Hi,
>
> here's another feature which I would like to contribute.
>
> Currently, I am performing broad crawls using
BroadScope/BdbFrontier. However,
> due to the number of host- or IP-keyed queues, an
OutOfMemoryError occurs
> very quickly after starting the crawl. One reason for
this is the RAM-based
> bookkeeping of subqueues -- the more queues, the more
heap.
>
> I have evaded this by writing a
BucketQueueAssignmentPolicy class, which
> produces a _fixed_ number of subqueues ("buckets"),
not one per host or per
> IP. The queue key is computed by hashing the hostname
(or the IP, if
> available) modulo N (a fixed number, such as 1000).
>
> This way, I was able to increase the number of
fetched pages from ca. 400,000
> to 1,000,000. For some other reason, I still get
OOMEs, but I think that is
> caused by a different problem -- the number of queues
did not grow over the
> specified limit.
>
> Furthermore, I have modified AbstractFrontier to be
able to choose arbitrary
> queue assignment policies and replaced the current
"ip-politness" option by a
> selectbox.
>
> The patch against CVS HEAD is attached.
>
> Greetings,
> --
> Christian Kohlschütter
> mailto: ck -at- NewsClub.de
>
> Yahoo! Groups Sponsor
> ADVERTISEMENT
>
> Yahoo! Groups Links
>
> * To visit your group on the web, go to:
> http://groups.yahoo.com/group/archive-crawler/
>
> * To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
> * Your use of Yahoo! Groups is subject to the
Yahoo! Terms of Service.
>
>
>
>
>Index: AbstractFrontier.java
>===================================================================
>RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/fr
ontier/AbstractFrontier.java,v
>retrieving revision 1.25
>diff -u -r1.25 AbstractFrontier.java
>--- AbstractFrontier.java 1 Feb 2005 18:05:42 -0000 1.25
>+++ AbstractFrontier.java 14 Feb 2005 12:57:29 -0000
>@@ -111,10 +111,13 @@
> public final static String ATTR_MAX_RETRIES =
"max-retries";
> protected final static Integer
DEFAULT_MAX_RETRIES = new Integer(30);
>
>- /** whether to reassign URIs to IP-address based
queues when IP known */
>- public final static String ATTR_IP_POLITENESS =
"ip-politeness";
>- // TODO: change default to true once well-tested
>- protected final static Boolean
DEFAULT_IP_POLITENESS = new Boolean(false);
>+ public final static String
ATTR_QUEUE_ASSIGNMENT_POLICY = "queue-assignment-policy";
>+ private final static String[]
AVAILABLE_QUEUE_ASSIGNMENT_POLICIES = new String[] {
>+ HostnameQueueAssignmentPolicy.class.getName(),
>+ IPQueueAssignmentPolicy.class.getName(),
>+ BucketQueueAssignmentPolicy.class.getName()
>+ };
>+ private final static String
DEFAULT_QUEUE_ASSIGNMENT_POLICY =
AVAILABLE_QUEUE_ASSIGNMENT_POLICIES[0];
>
> /** queue assignment to force onto CrawlURIs;
intended to be overridden */
> public final static String ATTR_FORCE_QUEUE =
"force-queue-assignment";
>@@ -202,11 +205,9 @@
> "limitation.",
> DEFAULT_MAX_HOST_BANDWIDTH_USAGE));
> t.setExpertSetting(true);
>- t = addElementToDefinition(new
SimpleType(ATTR_IP_POLITENESS,
>- "Whether to assign URIs to IP-address
based queues "+
>- "when possible, to remain polite on a
per-IP-address "+
>- "basis.",
>- DEFAULT_IP_POLITENESS));
>+ addElementToDefinition(new
SimpleType(ATTR_QUEUE_ASSIGNMENT_POLICY,
>+ "Defines how to assign URIs to queues.",
DEFAULT_QUEUE_ASSIGNMENT_POLICY,
>+ AVAILABLE_QUEUE_ASSIGNMENT_POLICIES));
> t.setExpertSetting(true);
> t.setOverrideable(false);
> t = addElementToDefinition(
>@@ -259,10 +260,16 @@
> String logsPath =
logsDisk.getAbsolutePath() + File.separatorChar;
> this.recover = new
RecoveryJournal(logsPath, LOGNAME_RECOVER);
> }
>-
if(((Boolean)getUncheckedAttribute(null,ATTR_IP_POLITENESS)).booleanValue()
)
{
>- queueAssignmentPolicy = new
IPQueueAssignmentPolicy();
>- } else {
>- queueAssignmentPolicy = new
HostnameQueueAssignmentPolicy();
>+ try {
>+ final Class qapClass = Class
>+ .forName((String)
getUncheckedAttribute(null,
>+ ATTR_QUEUE_ASSIGNMENT_POLICY));
>+
>+ queueAssignmentPolicy =
(QueueAssignmentPolicy) qapClass
>+ .newInstance();
>+ } catch (Exception e) {
>+ logger.log(Level.SEVERE, "Bad queue
assignment policy class", e);
>+ throw new
FatalConfigurationException(e.getMessage());
> }
> }
>
>Index: BucketQueueAssignmentPolicy.java
>===================================================================
>RCS file: BucketQueueAssignmentPolicy.java
>diff -N BucketQueueAssignmentPolicy.java
>--- /dev/null 1 Jan 1970 00:00:00 -0000
>+++ BucketQueueAssignmentPolicy.java 1 Jan 1970
00:00:00 -0000
>@@ -0,0 +1,30 @@
>+package org.archive.crawler.frontier;
>+
>+import org.archive.crawler.datamodel.CrawlHost;
>+import org.archive.crawler.datamodel.CrawlURI;
>+import org.archive.crawler.framework.CrawlController;
>+
>+/**
>+* Uses the target IPs as basis for queue-assignment,
>+* distributing them over a fixed number of sub-queues.
>+*
>+* @author Christian Kohlschuetter
>+*/
>+public class BucketQueueAssignmentPolicy extends
HostnameQueueAssignmentPolicy {
>+ private static final int DEFAULT_QUEUES_NOIP = 1000;
>+ private static final int DEFAULT_QUEUES_HOSTS = 1000;
>+
>+ public String getClassKey(CrawlController
controller, CrawlURI curi) {
>+ CrawlHost host =
controller.getServerCache().getHostFor(curi);
>+ if(host == null) {
>+ return "NO-HOST";
>+ } else if(host.getIP() == null) {
>+ return
"NO-IP-".concat(Integer.toString(Math.abs(host.getHostName()
>+ .hashCode())
>+ % DEFAULT_QUEUES_NOIP));
>+ } else {
>+ return
Integer.toString(Math.abs(host.getIP().hashCode())
>+ % DEFAULT_QUEUES_HOSTS);
>+ }
>+ }
>+}
>


Michael Stack ( stack-sf ) - 2005-02-14 21:25

6

Closed

None

Michael Stack

None

1.6.0

Public


Comments ( 3 )

Date: 2007-03-14 01:39
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-901 -- please add further
comments at that location.


Date: 2005-05-07 01:17
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Patch applied as part of '[ 1176934 ] [contrib]
Generalize/Refactor BDB Frontier' See there for commit message.


Date: 2005-03-02 20:17
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Deferring, pending a consideration of how to make it play
nicely with retries/errors/poiteness/per-host-settings.


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
artifact_group_id None 2005-09-23 21:08 gojomo
status_id Open 2005-05-07 01:17 stack-sf
assigned_to nobody 2005-05-07 01:17 stack-sf
close_date - 2005-05-07 01:17 stack-sf
priority 7 2005-03-02 20:17 gojomo