1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

Ticket #467 (closed defect: fixed)

Opened 16 months ago

Last modified 16 months ago

IllegalStateException trying to access lexicon index using RWStore with recycling

Reported by: thompsonbry Owned by: thompsonbry
Priority: major Milestone:
Component: RWStore Version: BIGDATA_RELEASE_1_0_4
Keywords: TransactionService Cc: thompsonbry

Description

We have a new failure mode for the RWStore when the recycler is enabled in the TestMROWTransactions stress test. This error is showing up in CI now that we have brought down the change set from the RWSTORE_1_1_0_DEBUG branch to the 1.0.x and 1.1.x maintenance branches. The error appears in both branches.

junit.framework.AssertionFailedError: Test failed: firstCause=java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.IllegalStateException, retentionMillis=1, nreaderThreads=16, nwriters=20, nreaders=400
	at com.bigdata.rdf.sail.TestMROWTransactions.domultiple_csem_transaction2(TestMROWTransactions.java:291)
	at com.bigdata.rdf.sail.TestMROWTransactionsWithHistory.test_multiple_csem_transaction_withHistory_stress(TestMROWTransactionsWithHistory.java:32)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.IllegalStateException
	at com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator.checkFuture(BlockingBuffer.java:1481)
	at com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator._hasNext(BlockingBuffer.java:1658)
	at com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator.hasNext(BlockingBuffer.java:1521)
	at com.bigdata.striterator.AbstractChunkedResolverator._hasNext(AbstractChunkedResolverator.java:327)
	at com.bigdata.striterator.AbstractChunkedResolverator.hasNext(AbstractChunkedResolverator.java:303)
	at com.bigdata.rdf.sail.TestMROWTransactions$1Reader.call(TestMROWTransactions.java:196)
	at com.bigdata.rdf.sail.TestMROWTransactions$1Reader.call(TestMROWTransactions.java:172)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:680)
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
	at com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator.checkFuture(BlockingBuffer.java:1434)
Caused by: java.lang.IllegalStateException
	at com.bigdata.rdf.lexicon.LexiconRelation.getId2TermIndex(LexiconRelation.java:802)
	at com.bigdata.rdf.lexicon.LexiconRelation.getTerms(LexiconRelation.java:1898)
	at com.bigdata.rdf.store.BigdataStatementIteratorImpl.resolveChunk(BigdataStatementIteratorImpl.java:166)
	at com.bigdata.rdf.store.BigdataStatementIteratorImpl.resolveChunk(BigdataStatementIteratorImpl.java:33)
	at com.bigdata.striterator.AbstractChunkedResolverator$ChunkConsumerTask.call(AbstractChunkedResolverator.java:232)
	at com.bigdata.striterator.AbstractChunkedResolverator$ChunkConsumerTask.call(AbstractChunkedResolverator.java:191)

The test is:

 public void test_multiple_csem_transaction_withHistory_stress() throws Exception {

        final Random r = new Random();
        
        for (int i = 0; i < 100; i++) {

            final int nreaderThreads = r.nextInt(19) + 1;
            
            log.warn("Trial: " + i + ", nreaderThreads=" + nreaderThreads);

            domultiple_csem_transaction2(1/* retentionMillis */,
                    nreaderThreads, 20/* nwriters */, 400/* nreaders */);

        }
        
    }

Change History

Changed 16 months ago by thompsonbry

Modified the stress test to include the IIndexManager object in the report. There are some indications that the recently observed failures might be in the EmbeddedFederation? mode and not related to the RWStore at all.

Committed revision r5944 (1.0.x and 1.1.x).

Changed 16 months ago by thompsonbry

Note: Martyn has observed this exception once against 1.0.x outside of CI.

Modified RWStore to enable the assert that commitPointsRecycled == commitPointsRemoved.

Added assert to RWStore that the lastCommitTime is strictly LT the latestReleasableTime (ie., that we never release the lastCommitTime).

Modified TestMROWTransactionsWithHistory to use a random retention time. The retention time range is (1,11,21,...,91).

Committed revision r5945 (1.0.x and 1.1.x).

Changed 16 months ago by thompsonbry

Raised log level to WARN when there is no commit record index for the given timestamp. This is on the code path that is being taken for [1].

Committed revision r5947 (1.0.x and 1.1.x).

Changed 16 months ago by thompsonbry

We have obtained a trace which clearly shows an error. The transaction 1328196805842 is opened and reads on commit time 1328196804421. The thread then sleeps for a few seconds.
{{
INFO : 6500 192.168.1.108 test-reader-pool17 com.bigdata.journal.Journal$1.activateTx(Journal.java:357): OPEN : txId=1328196805842, readsOnCommitTime=1328196804421
}}}
Before the reader thread resumes, we see the transaction service reporting a lastestReleasableTime of 1328196805295. This timestamp is AFTER the commit time on which the transaction is reading, which is 1328196804421. The RWStore then reaches back into the AbstractJournal? and correctly removes the commit record entries LT the lastestReleasableTime. This includes the commit record entry which reads on commitTime=1328196804421. However, that is the commit time on which our reader is reading.

INFO : 9000      test-writer-pool1 com.bigdata.rwstore.RWStore.checkDeferredFrees(RWStore.java:2503): lastCommitTime=1328196807014, latestReleasableTime=1328196805295, lastDeferredReleaseTime=1328196803374, activeTxCount=11
WARN : 9000      test-writer-pool1 com.bigdata.journal.AbstractJournal.removeCommitRecordEntries(AbstractJournal.java:4401): com.bigdata.journal.CommitRecordIndex$Entry@148e798{commitTime=1328196804421,addr=-211819197103706}

The reader wakes up and attempts to locate some indices. This fails because the commit record index entry for the commit time on which the reader is reading has been deleted. The first remaining commit record index entry is for AFTER the commit time on which the reader is reading.

WARN : 9172      test-reader-pool17 com.bigdata.journal.AbstractJournal.getCommitRecord(AbstractJournal.java:3351): No commit record: commitTime=1328196805842, lastReleaseTime=1328196804421
WARN : 9172      test-reader-pool17 com.bigdata.journal.AbstractJournal.getCommitRecord(AbstractJournal.java:3353): CommitRecordIndex: size=3, checkpoint=Checkpoint{indexType=BTree,height=0,nnodes=0,nleaves=1,nentries=3,counter=0,addrRoot=-281612415664016,addrMetadata=-12884901080,addrBloomFilter=0,addrCheckpoint=-35502199668516}, CRI.lastCommitTime=0
WARN : 9172      test-reader-pool17 com.bigdata.journal.AbstractJournal.getCommitRecord(AbstractJournal.java:3367): com.bigdata.journal.CommitRecordIndex$Entry@110278e{commitTime=1328196806280,addr=-212334593179226}
WARN : 9172      test-reader-pool17 com.bigdata.journal.AbstractJournal.getCommitRecord(AbstractJournal.java:3367): com.bigdata.journal.CommitRecordIndex$Entry@194e776{commitTime=1328196807014,addr=-212785564745306}
WARN : 9172      test-reader-pool17 com.bigdata.journal.AbstractJournal.getCommitRecord(AbstractJournal.java:3367): com.bigdata.journal.CommitRecordIndex$Entry@e80740{commitTime=1328196808233,addr=-213223651409498}

At this point the test case fails since it can not locate the desired application level index.

Changed 16 months ago by thompsonbry

  • owner changed from martyncutcher to thompsonbry
  • status changed from new to accepted

We have tracked down the problem above into a bug in AbstractTransactionService#updateReleaseTime?(long tx). That method is invoked out of IJournal#commit(long tx) and IJournal#abort(long tx).

The problem is that updateReleaseTime is using the txId rather than the readsOnCommitTime for the transaction when computing the new value of the releaseTime.

                earliestTxStartTime = startTimeIndex.decodeKey(startTimeIndex
                        .keyAt(0L));

....

            final long releaseTime = Math.min(lastCommitTime - 1, Math.min(
                    earliestTxStartTime - 1, now - minReleaseAge));

Thus, the releaseTime is using the txId, which is a positive integer GTE the commitTime on which the transaction is reading. This means that the transaction service is willing to report a releaseTime which is greater than the commit time on which some other open transaction may be reading. When this occurs, the other transaction will fail to locate the commit record for the commit time on which it is reading. This is the root cause of the exception for this ticket.

Changed 16 months ago by thompsonbry

  • keywords TransactionService added

Changed 16 months ago by thompsonbry

Martyn and I have talked through how to address this problem. The activeTx map has the necessary information in the TxState? object. However, the entry for a Tx is removed from this map prior to calling updateReleaseTime(). Further, deactivateTx() is invoked without the global lock so it is possible for more than one transaction to be removed from the activeTx map before updateReleaseTime() is called (the code is written in this manner to avoid some lock ordering problems). This means that in the case where 2 transactions commit or abort concurrently that we would be unable to recover the TxState? for one of them inside of updateReleaseTime(). While that could be taken as a signal that the transaction is no longer active and could be ignored, I am somewhat uncomfortable building that interaction into the code. Instead, my proposal is to modify the startTimeIndex.

The startTimeIndex currently stores the absolute value of the txId. The startTimeIndex ensures that we have a total order over the transaction identifiers and can easily report the oldest txId. Note that read-write transactions are assigned negative txIds while read-only transactions are assigned positive txIds. However, in both cases the txId is a timestamp GTE the commitTime on which the tx will read. The key to the startTimeIndex is the absolute value of the txId in order to place both read-only and read-write transaction identifiers into a single total ordering. The method findUnusedTimestamp() ensures that we never use the same absolute value for a read-only and a read-write transaction.

The suggested change is to associate the readsOnCommitTime with the absolute value of the transaction identifier in the startTimeIndex. This will give us the information that we need in updateReleaseTime() to compute the new release time based on the commit time that the transaction was reading on rather than the transaction id.

Changed 16 months ago by thompsonbry

Cloned CommitTimeIndex? to create TxId?2CommitTimeIndex and then modified the latter to provide a map from the txId to the commit time on which the transaction is reading. (CommitTimeIndex? is also used by the DistributedTransactionService? for a different purpose, which is why I cloned the class.)

Modified getCounters() for AbstractTransactionService? and DistributedTransactionService? to (a) declare the ICounterSetAccess interface; (b) NOT use the synchronized keyword and instead return a new CounterSet? instance for each request; and (c) to not use methods which require synchronized access to either the startTimeIndex (AbstractTransactionService?) or the commitTime index (DistributedTransactionService?).

Modified AbstractTransactionService#getStartTime?() to return 0L rather than -1 for readCommitTime when there are no commit points on the Journal.

Updated TestTransactionService? to correct two tests which were predicting the behavior demonsrated by the bug. I extended the test_newTx_readOnly_historyGone() significantly and also create a second variant of that test which covers the case when there are no more active transactions after a commit/abort.

The transaction service and RWJournal test suites are green.

Committed revision r5948 (1.0.x and 1.1.x).

Changed 16 months ago by thompsonbry

  • status changed from accepted to closed
  • resolution set to fixed

This has run clean through over 320 trials.

Note: See TracTickets for help on using tickets.