Christian Kohlschuetter wrote:
> Hi,
>
> here's another feature which I would like to contribute.
>
> Currently, I am performing broad crawls using
BroadScope/BdbFrontier. However,
> due to the number of host- or IP-keyed queues, an
OutOfMemoryError occurs
> very quickly after starting the crawl. One reason for
this is the RAM-based
> bookkeeping of subqueues -- the more queues, the more
heap.
>
> I have evaded this by writing a
BucketQueueAssignmentPolicy class, which
> produces a _fixed_ number of subqueues ("buckets"),
not one per host or per
> IP. The queue key is computed by hashing the hostname
(or the IP, if
> available) modulo N (a fixed number, such as 1000).
>
> This way, I was able to increase the number of
fetched pages from ca. 400,000
> to 1,000,000. For some other reason, I still get
OOMEs, but I think that is
> caused by a different problem -- the number of queues
did not grow over the
> specified limit.
>
> Furthermore, I have modified AbstractFrontier to be
able to choose arbitrary
> queue assignment policies and replaced the current
"ip-politness" option by a
> selectbox.
>
> The patch against CVS HEAD is attached.
>
> Greetings,
> --
> Christian Kohlschütter
> mailto: ck -at- NewsClub.de
>
> Yahoo! Groups Sponsor
> ADVERTISEMENT
>
> Yahoo! Groups Links
>
> * To visit your group on the web, go to:
> http://groups.yahoo.com/group/archive-crawler/
>
> * To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
> * Your use of Yahoo! Groups is subject to the
Yahoo! Terms of Service.
>
>
>
>
>Index: AbstractFrontier.java
>===================================================================
>RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/fr
ontier/AbstractFrontier.java,v
>retrieving revision 1.25
>diff -u -r1.25 AbstractFrontier.java
>--- AbstractFrontier.java 1 Feb 2005 18:05:42 -0000 1.25
>+++ AbstractFrontier.java 14 Feb 2005 12:57:29 -0000
>@@ -111,10 +111,13 @@
> public final static String ATTR_MAX_RETRIES =
"max-retries";
> protected final static Integer
DEFAULT_MAX_RETRIES = new Integer(30);
>
>- /** whether to reassign URIs to IP-address based
queues when IP known */
>- public final static String ATTR_IP_POLITENESS =
"ip-politeness";
>- // TODO: change default to true once well-tested
>- protected final static Boolean
DEFAULT_IP_POLITENESS = new Boolean(false);
>+ public final static String
ATTR_QUEUE_ASSIGNMENT_POLICY = "queue-assignment-policy";
>+ private final static String[]
AVAILABLE_QUEUE_ASSIGNMENT_POLICIES = new String[] {
>+ HostnameQueueAssignmentPolicy.class.getName(),
>+ IPQueueAssignmentPolicy.class.getName(),
>+ BucketQueueAssignmentPolicy.class.getName()
>+ };
>+ private final static String
DEFAULT_QUEUE_ASSIGNMENT_POLICY =
AVAILABLE_QUEUE_ASSIGNMENT_POLICIES[0];
>
> /** queue assignment to force onto CrawlURIs;
intended to be overridden */
> public final static String ATTR_FORCE_QUEUE =
"force-queue-assignment";
>@@ -202,11 +205,9 @@
> "limitation.",
> DEFAULT_MAX_HOST_BANDWIDTH_USAGE));
> t.setExpertSetting(true);
>- t = addElementToDefinition(new
SimpleType(ATTR_IP_POLITENESS,
>- "Whether to assign URIs to IP-address
based queues "+
>- "when possible, to remain polite on a
per-IP-address "+
>- "basis.",
>- DEFAULT_IP_POLITENESS));
>+ addElementToDefinition(new
SimpleType(ATTR_QUEUE_ASSIGNMENT_POLICY,
>+ "Defines how to assign URIs to queues.",
DEFAULT_QUEUE_ASSIGNMENT_POLICY,
>+ AVAILABLE_QUEUE_ASSIGNMENT_POLICIES));
> t.setExpertSetting(true);
> t.setOverrideable(false);
> t = addElementToDefinition(
>@@ -259,10 +260,16 @@
> String logsPath =
logsDisk.getAbsolutePath() + File.separatorChar;
> this.recover = new
RecoveryJournal(logsPath, LOGNAME_RECOVER);
> }
>-
if(((Boolean)getUncheckedAttribute(null,ATTR_IP_POLITENESS)).booleanValue()
)
{
>- queueAssignmentPolicy = new
IPQueueAssignmentPolicy();
>- } else {
>- queueAssignmentPolicy = new
HostnameQueueAssignmentPolicy();
>+ try {
>+ final Class qapClass = Class
>+ .forName((String)
getUncheckedAttribute(null,
>+ ATTR_QUEUE_ASSIGNMENT_POLICY));
>+
>+ queueAssignmentPolicy =
(QueueAssignmentPolicy) qapClass
>+ .newInstance();
>+ } catch (Exception e) {
>+ logger.log(Level.SEVERE, "Bad queue
assignment policy class", e);
>+ throw new
FatalConfigurationException(e.getMessage());
> }
> }
>
>Index: BucketQueueAssignmentPolicy.java
>===================================================================
>RCS file: BucketQueueAssignmentPolicy.java
>diff -N BucketQueueAssignmentPolicy.java
>--- /dev/null 1 Jan 1970 00:00:00 -0000
>+++ BucketQueueAssignmentPolicy.java 1 Jan 1970
00:00:00 -0000
>@@ -0,0 +1,30 @@
>+package org.archive.crawler.frontier;
>+
>+import org.archive.crawler.datamodel.CrawlHost;
>+import org.archive.crawler.datamodel.CrawlURI;
>+import org.archive.crawler.framework.CrawlController;
>+
>+/**
>+* Uses the target IPs as basis for queue-assignment,
>+* distributing them over a fixed number of sub-queues.
>+*
>+* @author Christian Kohlschuetter
>+*/
>+public class BucketQueueAssignmentPolicy extends
HostnameQueueAssignmentPolicy {
>+ private static final int DEFAULT_QUEUES_NOIP = 1000;
>+ private static final int DEFAULT_QUEUES_HOSTS = 1000;
>+
>+ public String getClassKey(CrawlController
controller, CrawlURI curi) {
>+ CrawlHost host =
controller.getServerCache().getHostFor(curi);
>+ if(host == null) {
>+ return "NO-HOST";
>+ } else if(host.getIP() == null) {
>+ return
"NO-IP-".concat(Integer.toString(Math.abs(host.getHostName()
>+ .hashCode())
>+ % DEFAULT_QUEUES_NOIP));
>+ } else {
>+ return
Integer.toString(Math.abs(host.getIP().hashCode())
>+ % DEFAULT_QUEUES_HOSTS);
>+ }
>+ }
>+}
>
Michael Stack
None
1.6.0
Public
|
Date: 2007-03-14 01:39
|
|
Date: 2005-05-07 01:17 Logged In: YES |
|
Date: 2005-03-02 20:17 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| artifact_group_id | None | 2005-09-23 21:08 | gojomo |
| status_id | Open | 2005-05-07 01:17 | stack-sf |
| assigned_to | nobody | 2005-05-07 01:17 | stack-sf |
| close_date | - | 2005-05-07 01:17 | stack-sf |
| priority | 7 | 2005-03-02 20:17 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use