Menu

#226 'ibal stop' crashes the kernel when opensm is working

B2.0
open
8
2004-05-10
2004-01-20
No

It seems that AL can't shutdown is reasonable time if
an application (opensm) is running.

The 'ibal stop' script removes the VPD module first. In
the module_cleanup() procedure, we call
ib_deregister_ca() for every CA. This call is supposed
to trigger the CA shutdown process, including all the
object tree beneath the CA. It is supposed to be
synchronous and finish in reasonable time.

After ib_deregister_ca() returns, the VPD proceeds in
shutting down its staff, and expects that all the
resources are closed. However, some race condition
happens (wrong reference counting?) and the AL shutdown
process keeps running in the async thread.

This leads to removing the VPD (with errors!) prior to
stopping the AL activities, and eventually, a kernel crash.

Discussion

  • Eddie Bortnikov

    Eddie Bortnikov - 2004-01-20
    • milestone: --> 335028
     
  • Sean Hefty

    Sean Hefty - 2004-01-20

    Logged In: YES
    user_id=784019

    Note that while the CA is opened and in use, the VPD cannot
    be removed from the system. AL cannot enforce users to
    release the CA resources, and it is extremely difficult if not
    impossible for AL to release them for the user. All
    applications using the CA must exit first. AL will likely need to
    add additional reference counting on the VPD while a user-
    mode app is accessing the CA to prevent its removal.

     
  • Eddie Bortnikov

    Eddie Bortnikov - 2004-01-21

    Logged In: YES
    user_id=742384

    First of all, the VPD must be removed first because it uses
    the AL symbols.

    We have examined the option to add module reference counting
    on the VPD, for example, on every CA open. However, this
    approach is infeasible because AL keeps every registered CA
    permanently open.

    I think that the architecture is correct, and the kernel
    crashes are the matter of bugs that must be fixed. There is
    a variety of ways to let applications know that there is no
    more driver below them anymore (async events, SIGSEGV etc);
    I am not afraid of killing the user app here.

    We have to handle this case anyway, as well as hot plug-out
    of an HCA.

     
  • Sean Hefty

    Sean Hefty - 2004-01-21

    Logged In: YES
    user_id=784019

    The general policy that we have used is that kernel mode
    users are notified of the removal through a PnP callback. The
    kernel clients are expected to cleanup their resources, so
    that the CA can be closed by AL.

    For user-mode clients, we don't want to maintain that same
    trust. (User-mode clients are still notified via PnP, but there
    is no guarantee that they will close the CA.) If we don't care
    that the user-mode app crashes, we can probably have the
    kernel proxy code perform the cleanup automatically.

    We will look into the issue further. The time delay mentioned
    in the original bug is a result of waiting for a reference count
    on the CA to go away, but it times out instead.

     
  • Sean Hefty

    Sean Hefty - 2004-05-06
    • milestone: 335028 --> 201622
     
  • Sean Hefty

    Sean Hefty - 2004-05-10
    • milestone: 201622 --> B2.0
     

Log in to post a comment.

MongoDB Logo MongoDB