Menu

#19 IMM: PBE should periodically audit the imm.db file

4.7.FC
fixed
None
enhancement
imm
-
4.2.0
major
2015-08-25
2013-05-07
No

The Imm Persistent Back-End writes transactions/CCBs incrementally
to an slqlite file "imm.db". This file resides on a replicated file
system. The replicated file system guards against hardware problems
such as failure of the disk or the host where the disk resides.

But there is always a risk of the imm.db file being corrupted
accidentally. This could be due to bugs in the PBE; or due to
network partitioning of the cluster causing two PBEs to
concurrently write to the same file; or accidents with the
backup and restore framework; or problems with the very complex
communication stack which the shared filesystem is (drbd,
journaling, nfs, sqlite recovery).

The problem is that the imm.db file is a logically a single
point of failure at cluster start.

If the imm.db is corrupted due to whatever reason, then
this may not be discovered until the critical time when it
is needed for a cluster restart.

This enhancement proposes that the PBE shall have some form
of periodic audit of of the existing imm.db file.

One possibility is for the PBE to periodically copy the imm.db
file to a local tmp directory. During the copy the PBE will
buffer & delay the regular user requests (Ccbs & PRTA updates).
As soon as the copy has been made, a "pseudo loading" will
be invoked using the copy of imm.db. In essence the immloder
is invoked such that it reads the imm.db in exactly the way
it does during loading, but does not try to actually load
anything towards the immsv.

Note that this level of audit will only catch consistency problems
in the PBE/sqlite representation of the imm data.
Loading may fail on higher levels, by failing checks inside
the immsv or applications (failing validation by OIs).

THe point of this is to discover an inconsistency earlier,
when the problem has hopefully not impacted the executing
cluster. IF a problem is detected, then the PBE will restart
and generate a new version of the imm.db file.

Migrated from:
http://devel.opensaf.org/ticket/2451


The audit could actually verify snapshot value equality between the sqlite representation
in PBE and the in-memory representation in immsv. By initializing an iterator
towards the immsv during the short stop period for mutations enforced during the
file copy, the iterator will take a snapshot of the in-memory representation.

That snapshot should reflect all committed CCbs and PRTA updates. The same values
should be commited to the PBE representation.


http://list.opensaf.org/pipermail/devel/2012-February/021139.html

The fix for this enhancement should be based on an improvement of verifyPbeState(..)
in imm_pbe_dump.cc
That function is executed each time the PBE re-attaches to the imm.db file.
Currently it is very weak. It should ideally verify the state of all persistent
objects both ways. All objects that exist in the imm.db must exist in the imm and
have the same state; and all persistent objects that exist in the imm must exist in
the imm.db file and have the same state.

This same function could be periodically invoked by the immnd-coord using an admin-op
towards the pbe. This should only be done during periods when there is a lull in
persistence traffic. The frequency can be quite low, but could also be increased
in relation to write traffic.

Finally, there is a point in closing and re-opening the imm.db file before performing
the verification. This to protect agains accidental removal of the file (inode).


Related

Tickets: #1665
Tickets: #1668
Tickets: #19
Wiki: NEWS-4.7.0

Discussion

  • Anders Bjornerstedt

    • assigned_to: Zoran Milinkovic --> Anders Bjornerstedt
     
  • Anders Bjornerstedt

    • status: assigned --> accepted
     
  • Anders Bjornerstedt

    • status: accepted --> assigned
    • Milestone: 4.4.FC --> 4.5.FC
     
  • Anders Bjornerstedt

    • status: assigned --> unassigned
    • assigned_to: Anders Bjornerstedt --> nobody
    • Milestone: 4.5.FC --> future
     
  • Anders Bjornerstedt

    • Milestone: future --> 4.6.FC
     
  • Anders Bjornerstedt

    There is a need for either off-line tests (regression tests) that verify
    imm content and/or on-line audit as tracked by this ticket.

    See:
    http://sourceforge.net/p/opensaf/tickets/1001/

     
  • Anders Bjornerstedt

    • Milestone: 4.6.FC --> future
     
  • Zoran Milinkovic

    • status: unassigned --> accepted
    • assigned_to: Zoran Milinkovic
    • Milestone: future --> 4.7-Tentative
     
  • Zoran Milinkovic

    • status: accepted --> review
     
  • Zoran Milinkovic

    The first part of the ticket:

    changeset: 6614:1cab75dd421a
    tag: tip
    user: Zoran Milinkovic zoran.milinkovic@ericsson.com
    date: Tue Jun 09 15:37:08 2015 +0200
    summary: immtools: audit no dangling references in PBE file [#19]

     

    Related

    Tickets: #19

  • Zoran Milinkovic

    • status: review --> accepted
     
  • Zoran Milinkovic

    • status: accepted --> review
     
  • Zoran Milinkovic

    • status: review --> fixed
     
  • Zoran Milinkovic

    default(4.7):

    changeset: 6753:51e030423a82
    user: Zoran Milinkovic zoran.milinkovic@ericsson.com
    date: Fri Aug 14 09:35:06 2015 +0200
    summary: immtools: add new checks for PBE audit [#19]

    changeset: 6754:bb3f06f4a606
    tag: tip
    user: Zoran Milinkovic zoran.milinkovic@ericsson.com
    date: Fri Aug 14 09:38:27 2015 +0200
    summary: immtools: allow immdump to audit PBE when OpenSAF is down [#19]

     

    Related

    Tickets: #19


Log in to post a comment.