The incident is similar to that in ticket #517.
https://sourceforge.net/p/opensaf/tickets/517/
Part of the problem is that the AMFND goes down due to a too short timeout
(10 seconds) on a handle an om-handle initialize. In general, an IMMND sync
of 300K objects or more could take up to 60 seconds. Put in another way,
if a sync takes longer than 60 seconds then the system is configured wrong
or has too much imm data so that it is out of bounds for what OpenSAF
currently tries to support.
This ticket deals with why the sync took unexpectedly long in this particlar
case. The imm data was small enough that other syncs took just a few seconds.
The problem discovered is a temporary service internal deadlock between the
PBE and the IMMND.
The sync is blocked from actually starting because it is waiting on the
outcome of one or more CCBs in critical (i.e. being processed by the PBE).
Removal of that blocking is itself covered by enhancement ticket (#31):
https://sourceforge.net/p/opensaf/tickets/31/
The PBE it turns out is being restarted at new active SC due to an SC failover.
A restarted PBE in this context is forced to regenerate the imm.db sqlite file.
Regenerating the sqlite file is effectively an immdump which tries to obtain
a "dump iterator" from IMMND. The dump iterator is special in that it iterates
filtered over only persistent objects. It also allocates a new epoch so that
all persistent modifications done after the dump snapshot are covered by a new
epoch. But the epoch allocation fails because there is a sync ongoing.
So the PBE is stuck in a TRY_AGAIN loop to obtain the dump iterator.
This explains the deadlock between PBE and IMMND.
The PBE times out after 20 seconds, exits and gets restarted.
The sync also times out on non progress after 20 seconds and is aborted.
The restarted PBE then succeeds in obtaining the dump iterator and the
next sync attempt succeeds.
The AMF should of course be fixed to have a longer wait.
But in this case it would first have waited 20 seconds due to this deadlock
and then wait for a successfully started sync to complete. Since the actual
sync could take up to 60 seonds, the total wait could here end up to be
clearly above 60 seconds. It is in principle possible that the imm internal
deadlock could re-occur in the next sync attempt.
So this problem needs to be fixed.
A solution is to detect this situation in the IMMND and to allow the
dump iterator to proceed without generating an epoch. Since there is a sync
in progress, no persistent writes are allowed anyway. So the epoch should
not strictly be necessaty in this case. The sync itself generates a new epoch
and the dump iterator should ba able to share the epoch shift done by the
sync iterator.
The dump can also not take longer than the sync since the sync is still waiting
on the outcome of critical ccbs.
When enhancement ticket #31 is done, this issue (the shared epoch) needs to
be looked into again, because the sync can then complete before the PBE has
finished regenerating the file and re-attached.
The proposed solution for this ticket, to allow the dump iterator
to proceed despite on-going sync, does mot work. I tested this but the
PBE just got stuck again in a TRY_AGAIN loop, this time for setting
admin-owner.
Instead the solution will be to abort the sync as soon as this kind of
deadlock is detected. Such an abort will be done much faster than the 20
second timeout that breaks the deadlock currently.
changeset: 4464:5b11bb1c2c43
tag: tip
parent: 4461:ca254d6398cc
user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
date: Mon Sep 02 18:25:55 2013 +0200
summary: IMM: Abort sync when deadlock detected between sync and PBE (#556)
changeset: 4463:f274e8365589
branch: opensaf-4.3.x
parent: 4460:a9bca63837f8
user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
date: Mon Sep 02 18:25:55 2013 +0200
summary: IMM: Abort sync when deadlock detected between sync and PBE (#556)
changeset: 4462:f420996d5992
branch: opensaf-4.2.x
parent: 4459:0504d98346dd
user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
date: Mon Sep 02 18:25:55 2013 +0200
summary: IMM: Abort sync when deadlock detected between sync and PBE (#556)