Menu

#1526 imm: 1PBE can see db as locked

5.0.2
not-reproducible
None
defect
imm
tools
4.6
major
2016-11-07
2015-10-07
No

when the disk is full the sqlite will return error.

Sep 18 13:42:02 SC-2 osafimmpbed: ER SQL statement ('COMMIT TRANSACTION') failed because: disk I/O error
Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Invalid error reported implementer 'OpenSafImmPBE', Ccb 321 will be aborted
Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Ccb 321 ABORTED (TraceC)
Sep 18 13:42:02 SC-2 osafimmpbed: WA Failed to find CCB object for 141/321

Due to continoues CCB operations (even though disk is full) the 1PBE is seeing the following mesages for more than 3 hours:

messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread.


----

messages.7:Sep 18 14:22:22 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread.
messages.7:Sep 18 14:22:24 SC-2 osafimmpbed: WA Sqlite db locked by other thread

After freeing the space still the PBE is got struck in Sqlite db locked by other thread.
This is preventing any further operations.
once the PBE is killed, the imm.db re-generated and the CCB operations are applied.

Solution(1PBE):

For the 1PBE case, which is not multi threaded, if the sqlite db locked case is reached abort the PBE and let the PBE be re-generated(instead of blocking the PBE process).

Discussion

  • Neelakanta Reddy

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,3 @@
    -
     when the disk is full the sqlite will return error.
    
     Sep 18 13:42:02 SC-2 osafimmpbed: ER SQL statement ('COMMIT TRANSACTION') failed because:  disk I/O error
    @@ -27,6 +26,6 @@
    
     Solution(1PBE):
    
    -For the 1PBE case, which is not multi threaded, them if the sqlite db locked case is reached abort the PBE and let the PBE be re-generated(instead of blocking the PBE process).
    +For the 1PBE case, which is not multi threaded, if the sqlite db locked case is reached abort the PBE and let the PBE be re-generated(instead of blocking the PBE process).
    
     
  • Neelakanta Reddy

    • summary: imm: abort the 1PBE when pbeBeginTrans sees db as locked --> imm: exit the 1PBE when pbeBeginTrans sees db as locked
    • status: accepted --> review
     
  • Anders Bjornerstedt

    Question: How can this case happen for the 1PBE case when there is only one user thread using the sqlite instance ?

    Another relevant question is why/when do you observe this now ?
    The test case or test setup must be special somehow.

    With only one thread this case should be impossible.
    It suggest heap correuption could be the cause.

    Some years ago we did see problems although not exactly this kind, in conjunction with
    repeated failovers, where the new PBE managed to start while the old PBE (on the other SC) was
    still executing (slow to terminate). But the distributes file level protection uses file system locking
    and the symptoms should be different.

     
  • Anders Bjornerstedt

    I guess it could be that the pbe level message "Sqlite db locked by other thread" is plain wrong,
    i.e. missleading.

     
  • Anders Bjornerstedt

    I looked at the code and the error message is correct but the "lock" is the PBE "spin lock" created
    for handling 2PBE. The fact that it finds it locked in 1PBE means there is a logical bug somewhere
    in 1PBE.

    Most likely some error case where there is a bailout from commit processing without correct cleanup.

     
  • Anders Bjornerstedt

    Changed ticket slogan to describe the problem.

     
  • Anders Bjornerstedt

    • summary: imm: exit the 1PBE when pbeBeginTrans sees db as locked --> imm: 1PBE can see db as locked
     
  • Anders Bjornerstedt

    I nack'ed the patch because the imm service already has a restart mechanism for the PBE if
    it gets stuck and the symptom shown here must result from a bug (if this truly is on 1PBE).

    If there is not enough information to locate the bug, then the problem needs to be reproduced
    with trace.

    If it can not be reproduced then we close the ticket as not reproducible.

     
  • Anders Bjornerstedt

    • status: review --> accepted
     
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: 4.5.2 --> 4.6.2
     
  • Neelakanta Reddy

    • status: accepted --> assigned
     
  • Mathi Naickan

    Mathi Naickan - 2016-05-04
    • Milestone: 4.6.2 --> 4.7.2
     
  • Anders Widell

    Anders Widell - 2016-09-20
    • Milestone: 4.7.2 --> 5.0.2
     
  • Neelakanta Reddy

    • status: assigned --> not-reproducible
     
  • Neelakanta Reddy

    since, the problem is not reproducible closing the defect.

     

Log in to post a comment.