Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

DistributedLockManager: parallel lock\unlock

2005-08-08
2012-09-06
  • We implemented a solution based on the DistributedLockManager of JGroups as a distributed lock and we run into the following scenario:

    I had the following setup:
    2 Nodes. Each with their own DistributedLockManager. Every lockManager
    gets the local address of the owning node as its managerId. When locking the local address is also used as the owner.

    The 2 nodes are configured to work on a primary/secondary settings and the lock is used to insure that only 1 node is active at a certain point in time. On top of that the configuration settings are such that if both of the nodes are alive one of them is recognized as primary and will try to lock.
    If fore some reason the primary node is down the secondary nodes receives a notification (through the view mechanism) and tries to lock. This also works well.
    The problem is at the time that the primary tries to regain control over the lock. At the same time the secondary node detects that the primary is alive again it tries to unlock. When this situation happens the result is unexpected. In some cases lock\unlock operation completes simultaneously, but in other cases the system might run into a dead lock where constantly one node is unable to release and the other is unable to lock.
    I dug into the code and was able to see that the problem occurs due to the two phase adapter implementation. (prepare/commit).

    If I return true in the "prepare" method in cases of releasing the same lock it can solve the issue.

    Am I over seeing something else?

    Can you suggest another work around for the problem?

    \locking the partition
    while (true)
    {
    try
    {
    m_lockManager.lock(partition.getName(), m_nodeName, LOCK_TIMEOUT);

        m_log.info(CLASS_NAME, "computeActiveNodes", "LOCKED_PARTITION", new Object[]{ partition.getName() } );
        break;
    }
    

    catch (ChannelException ex)
    {
    }
    catch (LockNotGrantedException ex)
    {
    if (m_log.isInfoEnabled())
    {
    m_log.info(CLASS_NAME, "computeActiveNodes", "UNABLE_TO_LOCK_PARTITION", new Object[] { partition.getName() } );
    }

    }

    \unlocking the partition

    while (true)
    {
    try
    {
    if (m_log.isInfoEnabled())
    {
    m_log.info(CLASS_NAME, "computeActiveNodes", "UNLOCK_PARTITION", new Object[] { partition.getName() } );
    m_lockManager.unlock(partition.getName(), m_nodeName);
    break;
    }
    catch (ChannelException ex)
    {
    }
    catch (LockNotReleasedException ex)
    {

    }
    

    }
    Thanks

    Shachar

     
    • No one replied however I was able to figure out the problem.

      The DistributedLockManager was configured to work with a PullPushAdapter. The PullPushAdapter was connected to the channel before the lock was attached to it so in some cases an unlock operation would fail since a request to unlock was received by the adapter but it was drop due to
      "received a messages tagged with identifier=lockmgr, but there is no registration for that identifier. Will drop message"
      For some reason I couldn't see this message previously and it was only printed after I changed the location of some of the jars (due to class loading restrictions ...)