The DomainSensitiveFrontier adds an OrFilter to disable
any further fetching from a particular site after its
pulled down the site quota. The add is throwing
ConcurrentModificationExceptions. Issue was reported
by Rob Eger. Below is our dialog and a patch that
fixed the CME issue for him.
This issue is a low priority issue to hold description
of this issue should we ever come across it in the future.
From Rob:
FYI, ran through the entire 1000 URL crawl and no CME
alerts.
Now I need to figure out a way to keep the
RobegerFrontier counter from counting redirects as
actual documents, and only count the actual doc it gets
redirected to.
Thanks again for the fix,
Rob.
stack wrote:
> Rob Eger wrote:
>
>> Okay, now that I'm using the jar that actually
contains the code change, it seems to be working. I'll
keep testing it and let you know if anything else comes
up, but with the test crawl I just started it was
hitting a lot of docs-per-site limits and no alerts
were coming up.
>>
>> Glad I could help, I appreciate you looking at it.
>
>
>
> Good news.
> St.Ack
>
>>
>> Rob.
>>
>> stack wrote:
>>
>>> Rob Eger wrote:
>>>
>>>> Made the change below, but still having the CME
problem. If it will requeue the URI that the CME
occurred on, it might be okay to just ignore for now,
right?
>>>
>>>
>>>
>>>
>>>
>>> Dang. Its the same stack trace? (It should be
slightly different -- at least the line numbers should
disagree. Send one over and I'll take a look at it).
>>>
>>> It doesn't look like it gets requeued. It looks
like the thread might actually get killed. Thats bad.
>>>
>>> Thanks for helping out with this.
>>> St.Ack
>>>
>>>
>>>>
>>>> Thanks,
>>>> Rob.
>>>>
>>>> stack wrote:
>>>>
>>>>> Rob Eger wrote:
>>>>>
>>>>>> They occur pretty consistently. What exactly
would the effect be on the
>>>>>> crawl? I'm trying to think of a workaround
(other than me watching page
>>>>>> counts in the crawl report and pausing/adding
filters as they reach the
>>>>>> limit I want, or crawling one site at a time).
>>>>>>
>>>>>>
>>>>> Any URI that gets a CME will fail. They might
get retried after an interval but I'd have to check.
>>>>>
>>>>>> Thanks for looking into it.
>>>>>>
>>>>>>
>>>>> Try this:
>>>>>
>>>>> Index:
src/java/org/archive/crawler/settings/DataContainer.java
>>>>>
===================================================================
>>>>> RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/se
ttings/DataContainer.java,v
>>>>> retrieving revision 1.2
>>>>> diff -u -r1.2 DataContainer.java
>>>>> ---
src/java/org/archive/crawler/settings/DataContainer.java
28 May 2004 22:33:05 -0000 1.2
>>>>> +++
src/java/org/archive/crawler/settings/DataContainer.java
4 Jan 2005 06:10:17 -0000
>>>>> @@ -67,7 +67,8 @@
>>>>> super();
>>>>> this.settings = new WeakReference(settings);
>>>>> this.complexType = module;
>>>>> - attributes = new ArrayList();
>>>>> + attributes =
>>>>> + new
EDU.oswego.cs.dl.util.concurrent.CopyOnWriteArrayList();
>>>>> attributeNames = new HashMap();
>>>>> }
>>>>>
>>>>>
>>>>> It changes out the list that is giving out the
CME for a CopyOnWrite version -- one that keeps all
outstanding iterators using an unchanging list at the
cost of some extra memory (Minor I'd say in this case
unless you're crawling thousands of sites and even then
its only at the end of a site crawl at the time when
the new filter is being added).
>>>>>
>>>>> How many sites you getting at a time?
>>>>>
>>>>> St.Ack
>>>>>
>>>>>> Rob.
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: stack [mailto:stack@archive.org]
>>>>>> Sent: Monday, January 03, 2005 8:34 PM
>>>>>> To: Rob Eger
>>>>>> Subject: Re: CME issue with RobegerFrontier
>>>>>>
>>>>>>
>>>>>> Rob Eger wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> St.Ack,
>>>>>>>
>>>>>>> I replied directly to you through the list
interface itself, so I just
>>>>>>> wanted to email you directly to make sure it
went through.
>>>>>>>
>>>>>>> Wasn't sure what you meant exactly, so any help
would be appreciated.
>>>>>>> I'll make changes to my version here and try it
out.
Nobody/Anonymous
None
1.4.2
Public
|
Date: 2007-03-14 00:19
|
|
Date: 2005-04-15 01:36 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| artifact_group_id | None | 2005-09-23 18:23 | gojomo |
| status_id | Open | 2005-04-15 01:36 | stack-sf |
| resolution_id | None | 2005-04-15 01:36 | stack-sf |
| close_date | - | 2005-04-15 01:36 | stack-sf |