I have observed a strange behaviour of memory
registration in DAPL with sf-ibal-cs1.219 and mellanox
sf-tvpd-3.0-rc1. dat_lmr_create() can only succeed for
32 times, and returns DAT_INVALID_STATE_LMR_IN_USE upon
the 33rd call.
a close look at the DAPL code reveals a long-time bug.
in dat_lmr_free(), there is code segment like:
......
dapls_ib_mr_deregister(lmr);
......
dapls_hash_remove( ...., lmr->param.lmr_context, ...);
......
the problem is, within dapls_ib_mr_deregister(), the
field lmr->param.lmr_context has been reset to 0. as a
result, the real lmr_context is never removed from the
hash table. When the same lmr_context is generated
again, dat_lmr_create() returns the error because the
lmr_context is still in the hash table.
the bug doesn't show itself in previous version of
code. I guess it is related to the mellanox driver. The
new driver (sf-tvpd-3.0-rc1) will only generate 32
different lkeys when the low level memory registering
function is called. so at the 33rd call, the returned
lmr_context is identical to the one returned at the
first call. I guess the older drivers generate a much
larger set of lkeys (not verified yet).