It seems that AL can't shutdown is reasonable time if
an application (opensm) is running.
The 'ibal stop' script removes the VPD module first. In
the module_cleanup() procedure, we call
ib_deregister_ca() for every CA. This call is supposed
to trigger the CA shutdown process, including all the
object tree beneath the CA. It is supposed to be
synchronous and finish in reasonable time.
After ib_deregister_ca() returns, the VPD proceeds in
shutting down its staff, and expects that all the
resources are closed. However, some race condition
happens (wrong reference counting?) and the AL shutdown
process keeps running in the async thread.
This leads to removing the VPD (with errors!) prior to
stopping the AL activities, and eventually, a kernel crash.
Logged In: YES
user_id=784019
Note that while the CA is opened and in use, the VPD cannot
be removed from the system. AL cannot enforce users to
release the CA resources, and it is extremely difficult if not
impossible for AL to release them for the user. All
applications using the CA must exit first. AL will likely need to
add additional reference counting on the VPD while a user-
mode app is accessing the CA to prevent its removal.
Logged In: YES
user_id=742384
First of all, the VPD must be removed first because it uses
the AL symbols.
We have examined the option to add module reference counting
on the VPD, for example, on every CA open. However, this
approach is infeasible because AL keeps every registered CA
permanently open.
I think that the architecture is correct, and the kernel
crashes are the matter of bugs that must be fixed. There is
a variety of ways to let applications know that there is no
more driver below them anymore (async events, SIGSEGV etc);
I am not afraid of killing the user app here.
We have to handle this case anyway, as well as hot plug-out
of an HCA.
Logged In: YES
user_id=784019
The general policy that we have used is that kernel mode
users are notified of the removal through a PnP callback. The
kernel clients are expected to cleanup their resources, so
that the CA can be closed by AL.
For user-mode clients, we don't want to maintain that same
trust. (User-mode clients are still notified via PnP, but there
is no guarantee that they will close the CA.) If we don't care
that the user-mode app crashes, we can probably have the
kernel proxy code perform the cleanup automatically.
We will look into the issue further. The time delay mentioned
in the original bug is a result of waiting for a reference count
on the CA to go away, but it times out instead.