when the disk is full the sqlite will return error.
Sep 18 13:42:02 SC-2 osafimmpbed: ER SQL statement ('COMMIT TRANSACTION') failed because: disk I/O error
Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Invalid error reported implementer 'OpenSafImmPBE', Ccb 321 will be aborted
Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Ccb 321 ABORTED (TraceC)
Sep 18 13:42:02 SC-2 osafimmpbed: WA Failed to find CCB object for 141/321
Due to continoues CCB operations (even though disk is full) the 1PBE is seeing the following mesages for more than 3 hours:
messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages.7:Sep 18 14:22:22 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages.7:Sep 18 14:22:24 SC-2 osafimmpbed: WA Sqlite db locked by other thread
After freeing the space still the PBE is got struck in Sqlite db locked by other thread.
This is preventing any further operations.
once the PBE is killed, the imm.db re-generated and the CCB operations are applied.
Solution(1PBE):
For the 1PBE case, which is not multi threaded, if the sqlite db locked case is reached abort the PBE and let the PBE be re-generated(instead of blocking the PBE process).
Diff:
Question: How can this case happen for the 1PBE case when there is only one user thread using the sqlite instance ?
Another relevant question is why/when do you observe this now ?
The test case or test setup must be special somehow.
With only one thread this case should be impossible.
It suggest heap correuption could be the cause.
Some years ago we did see problems although not exactly this kind, in conjunction with
repeated failovers, where the new PBE managed to start while the old PBE (on the other SC) was
still executing (slow to terminate). But the distributes file level protection uses file system locking
and the symptoms should be different.
I guess it could be that the pbe level message "Sqlite db locked by other thread" is plain wrong,
i.e. missleading.
I looked at the code and the error message is correct but the "lock" is the PBE "spin lock" created
for handling 2PBE. The fact that it finds it locked in 1PBE means there is a logical bug somewhere
in 1PBE.
Most likely some error case where there is a bailout from commit processing without correct cleanup.
Changed ticket slogan to describe the problem.
I nack'ed the patch because the imm service already has a restart mechanism for the PBE if
it gets stuck and the symptom shown here must result from a bug (if this truly is on 1PBE).
If there is not enough information to locate the bug, then the problem needs to be reproduced
with trace.
If it can not be reproduced then we close the ticket as not reproducible.
since, the problem is not reproducible closing the defect.