|
From: Anders B. <and...@er...> - 2014-10-10 14:46:57
|
Summary: IMM (2PBE): Fix incorrect transaction abort in slave PBE [#1057]
Review request for Trac Ticket(s): 1057
Peer Reviewer(s): Neel;Zoran
Pull request to:
Affected branch(es): 4.4; 4.5; default(4.6)
Development branch:
--------------------------------
Impacted area Impact y/n
--------------------------------
Docs n
Build system n
RPM/packaging n
Configuration files n
Startup scripts n
SAF services n
OpenSAF services n
Core libraries n
Samples n
Tests n
Other n
Comments (indicate scope for each "y" above):
---------------------------------------------
changeset 9e75d7c5fa2e9d2fe80e957b79c12c2c80e73d0d
Author: Anders Bjornerstedt <and...@er...>
Date: Fri, 10 Oct 2014 14:23:34 +0200
IMM (2PBE): Fix incorrect transaction abort in slave PBE [#1057]
The main problem observed in #1057 is that the PBE slave gets repeatedly
restarted. The initial and main cause of the restarts is a dynamic deadlock
in the slave PBE between its main thread and the runtime thread.
<background start> A bit of background is that the primary PBE executes all
CCB and PRT jobs in the main thread. For each CCB/PRT-job it first asks the
slave PBE if it is ready to prepare the sqlite transaction for the next
identified ccb, i.e. if it has completed the commit of the sqlite
transaction for any prior ccb/prt-op AND if it has received all needed data
via its main thread for this next identified ccb. PRT-ops are identified via
the "weak" ccb-ids in the high order range. If the slave replies ok, then it
is not only rady, it has already started the prepare.
The primary PBE then prepares and commits the sqlite transaction for the
CCB/PRT and replies with OK back to imm-ram. The CCB/PRT is commited in imm-
ram on reception of ok from primary PBE (for PRTO-creates and PRTA-updates
imm-ram needs ack from both primary and slave).
Finally, for CCBs and PRTO-deletes, the slave-PBE gets the completed and
apply callbacks as applier via its main thread. PRTO-deletes are handles as
full blown CCBs (but with "weak" ccb-ids) because PRTO-deletes may be
cascading, involving many deletes and are already handled that way in 1PBE
and by the primary.
PRTO-creates and PRTA-updates are handled in a slightly different way. Both
slave PBE and primary PE (and 1PBE) get only the operation-callback (create
or modify). No completed/apply callback is generated for PRTO-create or
PRTA-update. In 2-PBE the primary here also asks the slave if it can prepare
and commit the PRT-op. If the slave can prepare (if it has completed the
previous job and if it has received the matching create/modify callback via
the main thread), then the slave replies ok does the sqlite prepare AND
commit. If the primary receives the ok from the slave, it does the prepare
and commit to sqlite. Imm-ram will only commit the a PRTO-create/PRTA-update
if it gets ok reply from *both* primary and slave. Primary will only commit
if it receives ok on prepare from slave and slave will only prepare and
commit if it receives the prepare request from primary. <background end>
The dynamic deadlock of the slave occurs as follows: A regular CCB is
commited, the slave replies ok to primary, does the sqlite prepare (in the
runtime thread) but not the commit. It then goes into general poll normally
to get the completed callback via main thread. Before the completed callback
arrives, the slave instead receives the PRT-op over the main thread, the
completed callback for the regular ccb is behind this PRT-op in the message
queue for the main thread at the PBE slave.
The primary-pbe has received ok from slave on the regular ccb. The
primaryprepares and commits it to sqlite, then goes back to poll and
receives the PRT-op. The primary asks the slave if it can prepare the prt-
op. The Slave receives the prepare request on its realtime thread, discovers
that it is still waiting for the previos regular ccb to commit. Here we have
the dynamic deadlock.
The primary times out requestiong prepare after 3 seconds and replies with
failure on the PRT-op to imm-ram. Imm-ram aborts the PRT-op.
The slave times out after 5 seconds (in the main thread inside the prto-
create/modify callback) waiting for the prepare to be confirmed, but which
is blocked by the pending commit for the prior regular ccb, which in tern is
in the in-queue for the main-thread. The slave timeout should have unwound
cleanly if there was not any bug in the slave PBE. But here the slave PBE
has the main bug. On timeout it not only discards the prt-op, it also aborts
the >>current open sqlite transaction<<. The current open sqlite transaction
is not the prt-op transaction it is the regular ccb-transaction that is
prepared and waiting on that completed callback.
When the slave pbe returns with failure) from the prt-op callback and goes
into poll, it receives the completed callback for the regular ccb on the
main thread. The protocol inconsistency is detected because it gets
completed for a CCB that is gone (aborted). This causes the slave to exit.
An additonal problem here is that the restarting slave PBE should regenerate
the sqlite file (since it never commited the regular ccb). But the slave PBE
re-attaches without regenerating the file. So besides the main bug, there is
a secondary bug in the mechanism that determines iof the slae needs to
regenerate its file at restart.
The main fix done here is to avoid aborting the wrong transaction when the
prt-op fails in the slave. This was a pure bug in the PBE slave logic.
A secondary fix is to ensure that any slave PBE restart after failure to
commit a prto-op will result in regeneration of the file at the slave. This
is a fix in IMMND. Done in the next patch.
Finally some cleanup is also done, mainly in immnd_fevs_local_checks and for
the PBE to return ERR_NO_RESOURCES rather than ERR_BAD_OPERATION because the
latter is displayed as "Validation error" by the tools.
changeset 92a4e85a7598a7ed7fa6c616c89fc789d6e021cd
Author: Anders Bjornerstedt <and...@er...>
Date: Fri, 10 Oct 2014 15:35:26 +0200
IMM (2PBE): Slave PBE regenerates file when restarted soon after PRT error
[#1057]
If 2PBE-slave restarts in close proximity to PRT problems, then it should
regenerate the pbe-file. This is not mainly to correct for any dropped PRT
updates. That is supposed to be covered by immModel_pbeIsInSync().
It is rather to compensate for the risk that PRT problems (non OK reply on a
PRT op) are a symptom of bigger problems, that could also compromize regular
CCB consistency.
changeset e671a56961d63639610df1cf17902f5575543ae1
Author: Anders Bjornerstedt <and...@er...>
Date: Fri, 10 Oct 2014 16:07:50 +0200
IMM (PBE): Return NO_RESOURCES to avoid "validation error" from PBE [#1057]
This patch is not a pure 2PBE patch. It does some cleanup of minor problems
discovered while troubleshooting #1057.
1) Instead of the PBE returning BAD_OPERATION or FAILED_OPERATION on
failures to complete CCBs or PRTO operations, NO_RESOURCES is returned. This
is both more correct and avoids the front end/user seeing the error reported
as "validation error". Eg: osafimmnd[407]: NO Validation error
(BAD_OPERATION) reported by implementer 'OpenSafImmPBE', Ccb 66 will be
aborted
2) immnd_fevs_local_checks had a bug when loging errors due to wrong order
in deallocating the unpacked message and using it. Eg: osafimmnd[407]: NO
Precheck of fevs message of type <0> failed with ERROR:9
3) immdump/PBE copes with ERR_NO_RESOURCES same as ERR_TRY_AGAIN when
received from saImmOmSearchInitilalize. This avoids unnecessary extra
restarts of PBE. Eg: osafimmpbed: ER Failed on saImmOmSearchInitialize:18 -
exiting
Complete diffstat:
------------------
osaf/libs/common/immsv/immpbe_dump.cc | 7 +--
osaf/services/saf/immsv/immnd/immnd_evt.c | 40 ++++++++++++++++----
osaf/services/saf/immsv/immpbed/immpbe_daemon.cc | 115 ++++++++++++++++++++++++++++++++++----------------------------
3 files changed, 97 insertions(+), 65 deletions(-)
Testing Commands:
-----------------
Requires both ccb generation and PRT generation.
CCB generation can be accomplished with immpopulate or immcfg.
PRT generation I generated by hacking immapplier.
Testing, Expected Results:
--------------------------
Concurrently generate both CCBs and PRT-ops.
The occasional CCB or PRT-op could get aborted.
But no "major" problems like PBE restarts should be seen.
Conditions of Submission:
-------------------------
Ack from Neel.
Arch Built Started Linux distro
-------------------------------------------
mips n n
mips64 n n
x86 n n
x86_64 n n
powerpc n n
powerpc64 n n
Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any checkmarks!]
Your checkin has not passed review because (see checked entries):
___ Your RR template is generally incomplete; it has too many blank entries
that need proper data filled in.
___ You have failed to nominate the proper persons for review and push.
___ Your patches do not have proper short+long header
___ You have grammar/spelling in your header that is unacceptable.
___ You have exceeded a sensible line length in your headers/comments/text.
___ You have failed to put in a proper Trac Ticket # into your commits.
___ You have incorrectly put/left internal data in your comments/files
(i.e. internal bug tracking tool IDs, product names etc)
___ You have not given any evidence of testing beyond basic build tests.
Demonstrate some level of runtime or other sanity testing.
___ You have ^M present in some of your files. These have to be removed.
___ You have needlessly changed whitespace or added whitespace crimes
like trailing spaces, or spaces before tabs.
___ You have mixed real technical changes with whitespace and other
cosmetic code cleanup changes. These have to be separate commits.
___ You need to refactor your submission into logical chunks; there is
too much content into a single commit.
___ You have extraneous garbage in your review (merge commits etc)
___ You have giant attachments which should never have been sent;
Instead you should place your content in a public tree to be pulled.
___ You have too many commits attached to an e-mail; resend as threaded
commits, or place in a public tree for a pull.
___ You have resent this content multiple times without a clear indication
of what has changed between each re-send.
___ You have failed to adequately and individually address all of the
comments and change requests that were proposed in the initial review.
___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
___ Your computer have a badly configured date and time; confusing the
the threaded patch review.
___ Your changes affect IPC mechanism, and you don't present any results
for in-service upgradability test.
___ Your changes affect user manual and documentation, your patch series
do not contain the patch that updates the Doxygen manual.
|