Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 NPE je-2.0 entryToObject(SerialBinding.java:82) - ID: 1219854
Last Update: Comment added ( karl-ia )

Seen below on 32-bit xeon and on 64-bit opteron in
heritrix_out.log running infiniteurl.

Exception in thread "ToeThread #197"
java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at
com.sleepycat.bind.serial.SerialBinding.entryToObject(SerialBinding.java:82
)
at
org.archive.crawler.frontier.BdbMultipleWorkQueues.get(BdbMultipleWorkQueue
s.java:228)
at
org.archive.crawler.frontier.BdbWorkQueue.peekItem(BdbWorkQueue.java:91)
at
org.archive.crawler.frontier.WorkQueue.peek(WorkQueue.java:144)
at
org.archive.crawler.frontier.WorkQueueFrontier.next(WorkQueueFrontier.java:
505)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:141)


Michael Stack ( stack-sf ) - 2005-06-13 18:04

9

Closed

Fixed

Karl Thiessen

None

1.6.0

Public


Comments ( 12 )

Date: 2007-03-14 00:54
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-440 -- please add further
comments at that location.


Date: 2005-07-22 20:49
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

2.0.54 bdbje looks like it has a fix for this issue. I
can't reproduce after running in infiniteurl on dev02 with
define 'export LD_ASSUME_KERNEL=2.4.1' (This used reproduce
the NOTFOUND on every 3rd run or so using
je-2_0_42_archive_org_12562.jar, the one that the below
commit is replacing, and on every run after a few minutes
using even earlier versions than
je-2_0_42_archive_org_12562.jar). Assigning Karl to test
and close.

Fix for '[ 1219854 ] NPE je-2.0
entryToObject(SerialBinding.java:82)'
Commit of 2.0.54 bdbje. Has fixes for our 1219854 and for
the jmx serialization issue we reported (I ran it on dev02
for an hour in the worst scenario, with use 2.4 kernel
define, and no 'NOT FOUND' messages reported to
heritrix_out.log). From its release notes
http://www.sleepycat.com/je_update/2.0.54/if.2.0.54.html:

# In rare circumstances, an application could receive a
DatabaseException with the exception message "*** 12562 ***
not owned after acquire ...". This was due to the fact that
JE's internal latching implementation did not take into
account the possibility that a thread waiting on a monitor
can receive a spurious wakeup, as described in JDK 1.5
Object.wait(). JE now detects and handles that situation.
[#12562](2.0.54)

# In rare cases, when the application is actively deleting
records, a call to Cursor.getSearchKeyRange() could
incorrectly return OperationStatus.NOTFOUND. [#12736](2.0.54)

# Made several statistics classes Serializable so that they
can be returned when JE is invoked over over RMI. One such
use case is when JMX is used over RMI. [#12766](2.0.54)



Date: 2005-07-22 20:48
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

2.0.54 bdbje looks like it has a fix for this issue. I
can't reproduce after running in infiniteurl on dev02 with
define 'export LD_ASSUME_KERNEL=2.4.1' (This used reproduce
the NOTFOUND on every 3rd run or so using
je-2_0_42_archive_org_12562.jar, the one that the below
commit is replacing, and on every run after a few minutes
using even earlier versions than
je-2_0_42_archive_org_12562.jar). Assigning Karl to test
and close.

Fix for '[ 1219854 ] NPE je-2.0
entryToObject(SerialBinding.java:82)'
Commit of 2.0.54 bdbje. Has fixes for our 1219854 and for
the jmx serialization issue we reported (I ran it on dev02
for an hour in the worst scenario, with use 2.4 kernel
define, and no 'NOT FOUND' messages reported to
heritrix_out.log). From its release notes
http://www.sleepycat.com/je_update/2.0.54/if.2.0.54.html:

# In rare circumstances, an application could receive a
DatabaseException with the exception message "*** 12562 ***
not owned after acquire ...". This was due to the fact that
JE's internal latching implementation did not take into
account the possibility that a thread waiting on a monitor
can receive a spurious wakeup, as described in JDK 1.5
Object.wait(). JE now detects and handles that situation.
[#12562](2.0.54)

# In rare cases, when the application is actively deleting
records, a call to Cursor.getSearchKeyRange() could
incorrectly return OperationStatus.NOTFOUND. [#12736](2.0.54)

# Made several statistics classes Serializable so that they
can be returned when JE is invoked over over RMI. One such
use case is when JMX is used over RMI. [#12766](2.0.54)



Date: 2005-07-14 00:01
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

(formally assigning to stack since he's been working on this)


Date: 2005-07-06 05:27
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

New jar from sleepycat. Below is commit after testing.

New jar to address '[ 1219854 ] NPE je-2.0
entryToObject(SerialBinding.java:82)'
Sleepycat sent us new jar this evenign. Ran 3 tests over on
dev02: One w/ oldjar, two w/ new. Run w/ old jar turned up
41 retry attempts running for 30minutes. First run w/ new
jar produced one only. Second run of 30 minutesturned up no
instances. Checking in to see how does in production crawls.
* .classpath
* project.properties
Refer to new je jar.
* lib/je-12736-07-05-05.jar
Added. Replaces...
* lib/je-2_0_42_archive_org_12562.jar
Removed.


Date: 2005-06-17 18:58
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Tried HEAD code on opterontest2 w/ kernel 2.4 flag. Here is
what output now looks like:


06/17/2005 18:47:06 +0000 SEVERE
org.archive.crawler.frontier.BdbMultipleWorkQueues get See
'1219854 NPE je-2.0 entryToObject...'. OperationStatus was
not SUCCESS: OperationStatus.NOTFOUND, headKey b5e3d20fa9fb87e4
06/17/2005 18:47:06 +0000 SEVERE
org.archive.crawler.frontier.BdbWorkQueue peekItem Trying
get #2 in queue 180.localhost#8081 with 829 items using key
b5e3d20fa9fb87e4


Only one instance in first test run

Two in second test run but some time in betwixt. They don't
come in groups as was said earlier (Turned on logging of
queue creation; this puts timestamps between the instances
so I can see now they're not actually grouped)..


Date: 2005-06-17 03:58
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

More debug output.checking/refactoring/recovery. All
unexpected states counter to design are logged SEVERE;
however, even a failure after all retries simply returns a
null. This should cause queue to be mistaken for empty -- a
more recoverable situation than falling out of rotation
entirely. Commit comment:

Debug work for [ 1219854 ] NPE je-2.0
entryToObject(SerialBinding.java:82)
* BdbMultipleWorkQueues.java
move get() retries out to caller (where more useful info
is available) -- but 'severe'ly log every error
change calculateInsertKey() to make first 64 bits purely
classKey-based
add logging if item targetted by delete() not present
* BdbWorkQueue.java
truncate 'origin' key to 8 bytes (no need for zero padding)
in peekItem, retry get() multiple times, logging each
try or eventual failure 'severe'ly
* WorkQueue.java
synchronize getCount() access to long count
* ArchiveUtils.java
utility function to check a byte array is a prefix of
another




Date: 2005-06-17 03:55
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Upping priority. More serious problems here.

Upon failures pre-mitigating code, affected queue was left
in 'purgatory' state -- no longer in normal frontier
rotation, but still in allQueues and able to accept new
URIs. Nothing that I can see would jar it back into
rotation... so for every one of these errors encountered,
we've got a living-dead queue.

Also: if for whatever reason some entries that should be in
a certain key range aren't momentarily, we'd only see the
failure status if hitting the 'end of the database (from the
last some later queue-range). Earlier in the keyspace,
missing entries would just cause an entry to be grabbed from
the next queue range. This would likely cause strange
effects and corruption as the frontier notes it's received a
CrawlURI that should have been on another queue (as if in
the IP politeness case) -- and then tries to delete it from
where it was and reinsert it where it should have been.

Adding more/different debug output, status-checking, retry,
and recovery code...


Date: 2005-06-16 18:21
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed mitigating code. See message below. Lowering
priority.

Committing lookup retry to mitigate effects of '[ 1219854 ]
NPE je-2.0 entryToObject(SerialBinding.java:82)'
* src/conf/heritrix.properties
Show in heritrix_out.log retries on queue lookup so we
can keep an eye
on frequency (We used to show an exception in logs when
this occurred).
Added:
org.archive.crawler.frontier.BdbMultipleWorkQueues.level = INFO
*
src/java/org/archive/crawler/frontier/BdbMultipleWorkQueues.java
(LOGGER): Added.
(get): Refactored so easy to do retry.
(getNextNearestItem): Former guts of 'get' moved here to
facilitate retry.
(tallyAverageEntrySize): Commented it in. Put it behind a
check for Level.FINE so easy to enable this useful stat.
* src/java/org/archive/crawler/frontier/BdbWorkQueue.java
Javadoc. Added optional logging of inserts.
(getKeyPrefixHex): Added. Utility method.
* src/java/org/archive/crawler/frontier/WorkQueue.java
Javadoc.



Date: 2005-06-16 17:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

A retry immediately after the fail succeeds. Going to
commit logged retries as workaround till we get substance
back from sleepycat folks. Below is what I sent them:

Linda Lee wrote:

>
> St.Ack,
>
>> A very occasional call to cursor.getSearchKeyRange
returns OperationStatus.NOTFOUND though I know for sure the
lookup should succeed. Odd is that if I turn around and
immediately retry, the query returns
OperationStatus.SUCCESS. Please advise.
>>
>> I'm currently experiencing this behavior using je-2.0
prerelease but we're fairly certain we've experienced the
same issue w/ 1.7.x.
>
>
> Hmm, we'll start with some basic questions about usage
characteristics,
> and go from there to figure out how to approach the problem.

I should have provided more detail on submission.

>
> - Is the data static, or is it being modified during this
retrieval period?
> By modified, I mean:
> - are there any new records inserted

> - are records updated

> - are records deleted

For sure on all above -- deletes less often.

>
> - Are you using transactions? (I think not)

You are right.

> - If it's the snippet of code I saw before, the cursor is
brand-new, and is
> allocated locally for this method?
>
Its same basic code, refactored some. Now I check status as
you suggested.

If I see it come back NOT_FOUND, I redo the query even
allocating a new cursor.

I've pasted code on the end.

> - Have you noticed anything interesting about the key
values that have this problem?

Nothing obvious.

We're trying to find the beginning of a range. In my
tracing output I see that there are actually hundreds of
entries inside the range I'm doing a search for so it
'should' succeed.

>
> - What other types of access are being made to the
database at the time?
>
The above cited insert and updates -- rare deletes.

I see this on 32-bit xeon and on dual opteron in 32-bit and
64-bit mode. For sure its a rare occurrance. We're using
the pre-release 2.0 that Charles gave us. I'm fairly sure
we've seen this in 1.7.x because ye told me ye saw the NPEs
in logs I'd sent ye. I have a perverse test scenario that
can trigger the issue by running on a 64-bit 2.6 opteron
setting LD_ASSUME_KERNEL=2.4.19 on our application --
least, the manifestation is the same. In this scenario, an
immediate retry gets what I know to be in the db.

I'm going to commit a retry of the lookup after failure to
our HEAD code. That the second lookup works make this issue
low priority for us.

Some other notes:

+ On 64-bit linux and java: We've given up on trying to make
this work, at least for the moment. Running our rabid test
case on 64-bit dual opteron, the crawler OOME's after
minutes, no matter how much of a heap whereas in 32-bit
mode, it'll run days and days.
+ The later, custom 1.7.x ye sent us (12562 related) would
OOME running our rabid test case on 32-bit linux after a
time, whereas earlier 1.7.x versions did not do this.
Moving to 2.0 prerelease, the OOME'ing went away.

Here's the code as it is now:

public CrawlURI get(DatabaseEntry headKey)
throws DatabaseException {
DatabaseEntry result = new DatabaseEntry();

// From Linda Lee of sleepycat:
// "You want to check the status returned from
Cursor.getSearchKeyRange
// to make sure that you have
OperationStatus.SUCCESS. In that case,
// you have found a valid data record, and
result.getData()
// (called by internally by the binding code, in
this case) will be
// non-null. The other possible status return is
// OperationStatus.NOTFOUND, in which case no data
record matched
// the criteria. "
OperationStatus status = getNextNearestItem(headKey,
result);
CrawlURI retVal = null;
if (status != OperationStatus.SUCCESS) {
for (int i = 0; i < 10; i++) {
status = getNextNearestItem(headKey, result);
LOGGER.severe("Cycle " + Integer.toString(i) +
" status " + status + " " +

BdbWorkQueue.getHeadKeyHex(headKey.getData()));
if (status == OperationStatus.SUCCESS) {
break;
}
}
if (status != OperationStatus.SUCCESS) {
throw new DatabaseException("See '1219854
NPE je-2.0 "
+ "entryToObject...'. OperationStatus "
+ " was not SUCCESS: "
+ status
+ ", headKey "
+
BdbWorkQueue.getHeadKeyHex(headKey.getData()));
}
}
retVal =
(CrawlURI)crawlUriBinding.entryToObject(result);
retVal.setHolderKey(headKey);
return retVal;
}

protected OperationStatus
getNextNearestItem(DatabaseEntry headKey,
DatabaseEntry result)
throws DatabaseException {
Cursor cursor = this.pendingUrisDB.openCursor(null,
null);
OperationStatus status =
cursor.getSearchKeyRange(headKey, result, null);
cursor.close();
return status;
}



Date: 2005-06-15 14:18
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Changed code to check for status on getSearchKeyRange. Now
exception looks like below (From crawling014 this morning
doing the AUS crawl). Seems to come in a flurry.

java.io.IOException: See '1219854 NPE je-2.0
entryToObject...'. OperationStatus was not SUCCESS:
OperationStatus.NOTFOUND at
org.archive.crawler.frontier.BdbWorkQueue.peekItem(BdbWorkQueue.java:93)
at
org.archive.crawler.frontier.WorkQueue.peek(WorkQueue.java:144)
at
org.archive.crawler.frontier.WorkQueueFrontier.next(WorkQueueFrontier.java:515)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)Caused
by: com.sleepycat.je.DatabaseException: See '1219854 NPE
je-2.0 entryToObject...'. OperationStatus was not SUCCESS:
OperationStatus.NOTFOUND at
org.archive.crawler.frontier.BdbMultipleWorkQueues.get(BdbMultipleWorkQueues.java:241)
at
org.archive.crawler.frontier.BdbWorkQueue.peekItem(BdbWorkQueue.java:91)
... 3 moreException in thread "ToeThread #98: "
java.lang.RuntimeException: java.io.IOException: See
'1219854 NPE je-2.0 entryToObject...'. OperationStatus was
not SUCCESS: OperationStatus.NOTFOUND at
org.archive.crawler.frontier.WorkQueue.peek(WorkQueue.java:148)
at
org.archive.crawler.frontier.WorkQueueFrontier.next(WorkQueueFrontier.java:515)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)Caused
by: java.io.IOException: See '1219854 NPE je-2.0
entryToObject...'. OperationStatus was not SUCCESS:
OperationStatus.NOTFOUND at
org.archive.crawler.frontier.BdbWorkQueue.peekItem(BdbWorkQueue.java:93)
at
org.archive.crawler.frontier.WorkQueue.peek(WorkQueue.java:144)
... 2 moreCaused by: com.sleepycat.je.DatabaseException:
See '1219854 NPE je-2.0 entryToObject...'. OperationStatus
was not SUCCESS: OperationStatus.NOTFOUND at
org.archive.crawler.frontier.BdbMultipleWorkQueues.get(BdbMultipleWorkQueues.java:241)
at
org.archive.crawler.frontier.BdbWorkQueue.peekItem(BdbWorkQueue.java:91)
... 3 more

Bdb people say that this exception was in logs we gave them
from 1.7.1 output.

This exception seems to come easier when export
LD_ASSUME_KERNEL=2.4.19 is set.


Date: 2005-06-13 22:09
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

From sleepycat peoples:



>
>
> Here is the code:
>
>
> /**
> * Get the next nearest item after the given key. Relies on
> * external discipline to avoid asking for something
from an
> * origin where there are no associated items -- otherwise
> * could get first item of next 'queue' by mistake.
> *
> * TODO: hold within a queue's range
> *
> * @param headKey
> * @return CrawlURI.
> * @throws DatabaseException
> */
> public CrawlURI get(DatabaseEntry headKey) throws
DatabaseException {
> DatabaseEntry result = new DatabaseEntry();
> Cursor cursor = pendingUrisDB.openCursor(null,null);
> cursor.getSearchKeyRange(headKey, result, null);
> cursor.close();
> CrawlURI retVal = (CrawlURI)
crawlUriBinding.entryToObject(result);
> retVal.setHolderKey(headKey);
> return retVal;
> }
>
> Its the 'CrawlURI retVal = (CrawlURI)
crawlUriBinding.entryToObject(result);' that is doing the
NPE. Should I be checking status on the new DE call? To
see if no object created? On the cursor operation?


You want to check the status returned from
Cursor.getSearchKeyRange to make sure that you have
OperationStatus.SUCCESS. In that case, you have found a
valid data record, and result.getData() (called by
internally by the binding code, in this case) will be
non-null. The other possible status return is
OperationStatus.NOTFOUND, in which case no data record
matched the criteria.

This is rather un-Java like, to return a status code rather
than an exception, but uh, let me assure you that there was
a long and colorful Sleepycat-internal debate about
compatibility with the C product, etc, that lies behind this
particular api choice.

This may lead you on to a second question, as to why the
record doesn't exist, but it does explain the null pointer
exception.

Linda


I committed our throwing an exception if this happens
reporting OperationStatus.


Attached File

No Files Currently Attached

Changes ( 8 )

Field Old Value Date By
close_date - 2005-12-02 17:14 stack-sf
status_id Open 2005-12-02 17:14 stack-sf
artifact_group_id None 2005-09-23 18:29 gojomo
assigned_to stack-sf 2005-07-22 20:48 stack-sf
resolution_id None 2005-07-22 20:48 stack-sf
assigned_to nobody 2005-07-14 00:01 gojomo
priority 5 2005-06-17 03:58 gojomo
priority 6 2005-06-16 18:21 stack-sf