[devel] [PATCH 0 of 3] Review Request for IMM (2PBE): Fix incorrect transaction abort in slave PBE

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Summary: IMM (2PBE): Fix incorrect transaction abort in slave PBE [#1057]
Review request for Trac Ticket(s): 1057
Peer Reviewer(s): Neel;Zoran
Pull request to: 
Affected branch(es): 4.4; 4.5; default(4.6)
Development branch: 

--------------------------------
Impacted area       Impact y/n
--------------------------------
 Docs                    n
 Build system            n
 RPM/packaging           n
 Configuration files     n
 Startup scripts         n
 SAF services            n
 OpenSAF services        n
 Core libraries          n
 Samples                 n
 Tests                   n
 Other                   n

Comments (indicate scope for each "y" above):
---------------------------------------------

changeset 9e75d7c5fa2e9d2fe80e957b79c12c2c80e73d0d
Author:	Anders Bjornerstedt <and...@er...>
Date:	Fri, 10 Oct 2014 14:23:34 +0200

	IMM (2PBE): Fix incorrect transaction abort in slave PBE [#1057]

	The main problem observed in #1057 is that the PBE slave gets repeatedly
	restarted. The initial and main cause of the restarts is a dynamic deadlock
	in the slave PBE between its main thread and the runtime thread.

	<background start> A bit of background is that the primary PBE executes all
	CCB and PRT jobs in the main thread. For each CCB/PRT-job it first asks the
	slave PBE if it is ready to prepare the sqlite transaction for the next
	identified ccb, i.e. if it has completed the commit of the sqlite
	transaction for any prior ccb/prt-op AND if it has received all needed data
	via its main thread for this next identified ccb. PRT-ops are identified via
	the "weak" ccb-ids in the high order range. If the slave replies ok, then it
	is not only rady, it has already started the prepare.

	The primary PBE then prepares and commits the sqlite transaction for the
	CCB/PRT and replies with OK back to imm-ram. The CCB/PRT is commited in imm-
	ram on reception of ok from primary PBE (for PRTO-creates and PRTA-updates
	imm-ram needs ack from both primary and slave).

	Finally, for CCBs and PRTO-deletes, the slave-PBE gets the completed and
	apply callbacks as applier via its main thread. PRTO-deletes are handles as
	full blown CCBs (but with "weak" ccb-ids) because PRTO-deletes may be
	cascading, involving many deletes and are already handled that way in 1PBE
	and by the primary.

	PRTO-creates and PRTA-updates are handled in a slightly different way. Both
	slave PBE and primary PE (and 1PBE) get only the operation-callback (create
	or modify). No completed/apply callback is generated for PRTO-create or
	PRTA-update. In 2-PBE the primary here also asks the slave if it can prepare
	and commit the PRT-op. If the slave can prepare (if it has completed the
	previous job and if it has received the matching create/modify callback via
	the main thread), then the slave replies ok does the sqlite prepare AND
	commit. If the primary receives the ok from the slave, it does the prepare
	and commit to sqlite. Imm-ram will only commit the a PRTO-create/PRTA-update
	if it gets ok reply from *both* primary and slave. Primary will only commit
	if it receives ok on prepare from slave and slave will only prepare and
	commit if it receives the prepare request from primary. <background end>

	The dynamic deadlock of the slave occurs as follows: A regular CCB is
	commited, the slave replies ok to primary, does the sqlite prepare (in the
	runtime thread) but not the commit. It then goes into general poll normally
	to get the completed callback via main thread. Before the completed callback
	arrives, the slave instead receives the PRT-op over the main thread, the
	completed callback for the regular ccb is behind this PRT-op in the message
	queue for the main thread at the PBE slave.

	The primary-pbe has received ok from slave on the regular ccb. The
	primaryprepares and commits it to sqlite, then goes back to poll and
	receives the PRT-op. The primary asks the slave if it can prepare the prt-
	op. The Slave receives the prepare request on its realtime thread, discovers
	that it is still waiting for the previos regular ccb to commit. Here we have
	the dynamic deadlock.

	The primary times out requestiong prepare after 3 seconds and replies with
	failure on the PRT-op to imm-ram. Imm-ram aborts the PRT-op.

	The slave times out after 5 seconds (in the main thread inside the prto-
	create/modify callback) waiting for the prepare to be confirmed, but which
	is blocked by the pending commit for the prior regular ccb, which in tern is
	in the in-queue for the main-thread. The slave timeout should have unwound
	cleanly if there was not any bug in the slave PBE. But here the slave PBE
	has the main bug. On timeout it not only discards the prt-op, it also aborts
	the >>current open sqlite transaction<<. The current open sqlite transaction
	is not the prt-op transaction it is the regular ccb-transaction that is
	prepared and waiting on that completed callback.

	When the slave pbe returns with failure) from the prt-op callback and goes
	into poll, it receives the completed callback for the regular ccb on the
	main thread. The protocol inconsistency is detected because it gets
	completed for a CCB that is gone (aborted). This causes the slave to exit.

	An additonal problem here is that the restarting slave PBE should regenerate
	the sqlite file (since it never commited the regular ccb). But the slave PBE
	re-attaches without regenerating the file. So besides the main bug, there is
	a secondary bug in the mechanism that determines iof the slae needs to
	regenerate its file at restart.

	The main fix done here is to avoid aborting the wrong transaction when the
	prt-op fails in the slave. This was a pure bug in the PBE slave logic.

	A secondary fix is to ensure that any slave PBE restart after failure to
	commit a prto-op will result in regeneration of the file at the slave. This
	is a fix in IMMND. Done in the next patch.

	Finally some cleanup is also done, mainly in immnd_fevs_local_checks and for
	the PBE to return ERR_NO_RESOURCES rather than ERR_BAD_OPERATION because the
	latter is displayed as "Validation error" by the tools.

changeset 92a4e85a7598a7ed7fa6c616c89fc789d6e021cd
Author:	Anders Bjornerstedt <and...@er...>
Date:	Fri, 10 Oct 2014 15:35:26 +0200

	IMM (2PBE): Slave PBE regenerates file when restarted soon after PRT error
	[#1057]

	If 2PBE-slave restarts in close proximity to PRT problems, then it should
	regenerate the pbe-file. This is not mainly to correct for any dropped PRT
	updates. That is supposed to be covered by immModel_pbeIsInSync().

	It is rather to compensate for the risk that PRT problems (non OK reply on a
	PRT op) are a symptom of bigger problems, that could also compromize regular
	CCB consistency.

changeset e671a56961d63639610df1cf17902f5575543ae1
Author:	Anders Bjornerstedt <and...@er...>
Date:	Fri, 10 Oct 2014 16:07:50 +0200

	IMM (PBE): Return NO_RESOURCES to avoid "validation error" from PBE [#1057]

	This patch is not a pure 2PBE patch. It does some cleanup of minor problems
	discovered while troubleshooting #1057.

	1) Instead of the PBE returning BAD_OPERATION or FAILED_OPERATION on
	failures to complete CCBs or PRTO operations, NO_RESOURCES is returned. This
	is both more correct and avoids the front end/user seeing the error reported
	as "validation error". Eg: osafimmnd[407]: NO Validation error
	(BAD_OPERATION) reported by implementer 'OpenSafImmPBE', Ccb 66 will be
	aborted

	2) immnd_fevs_local_checks had a bug when loging errors due to wrong order
	in deallocating the unpacked message and using it. Eg: osafimmnd[407]: NO
	Precheck of fevs message of type <0> failed with ERROR:9

	3) immdump/PBE copes with ERR_NO_RESOURCES same as ERR_TRY_AGAIN when
	received from saImmOmSearchInitilalize. This avoids unnecessary extra
	restarts of PBE. Eg: osafimmpbed: ER Failed on saImmOmSearchInitialize:18 -
	exiting

Complete diffstat:
------------------
 osaf/libs/common/immsv/immpbe_dump.cc            |    7 +--
 osaf/services/saf/immsv/immnd/immnd_evt.c        |   40 ++++++++++++++++----
 osaf/services/saf/immsv/immpbed/immpbe_daemon.cc |  115 ++++++++++++++++++++++++++++++++++----------------------------
 3 files changed, 97 insertions(+), 65 deletions(-)

Testing Commands:
-----------------
Requires both ccb generation and PRT generation.
CCB generation can be accomplished with immpopulate or immcfg.
PRT generation I generated by hacking immapplier.

Testing, Expected Results:
--------------------------
Concurrently generate both CCBs and PRT-ops. 
The occasional CCB or PRT-op could get aborted.
But no "major" problems like PBE restarts should be seen.

Conditions of Submission:
-------------------------
Ack from Neel.

Arch      Built     Started    Linux distro
-------------------------------------------
mips        n          n
mips64      n          n
x86         n          n
x86_64      n          n
powerpc     n          n
powerpc64   n          n

Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any checkmarks!]

Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
    that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
    (i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
    Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
    like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
    cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
    too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
    Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
    commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
    of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
    comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)

___ Your computer have a badly configured date and time; confusing the
    the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
    for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
    do not contain the patch that updates the Doxygen manual.

[devel] [PATCH 0 of 3] Review Request for IMM (2PBE): Fix incorrect transaction abort in slave PBE

[devel] [PATCH 0 of 3] Review Request for IMM (2PBE): Fix incorrect transaction abort in slave PBE [#1057]