Menu

#82 waitForCancelled and waitForAnnounced can cause bad behavior

closed-fixed
networking (37)
5
2014-11-06
2011-01-04
No

In areas of shaky WiFi coverage where the WiFi goes up and down a lot, we sometimes get into a state where one thread is waiting for a cancel and another thread is waiting for an announce, but neither ever succeeds.

This is particularly bad, because the waitForCancelled and waitForAnnounced functions are spinning while loops, and this can easily cause the CPU to get pegged and bring the system to a halt.

Is there any plan to remove the blocking code and replace it with an event-driven model, so that at least if something gets into a bad state, only JmDNS will be affected, and not the whole machine (spinning while loops are a bad thing.)

I'm still looking into this -- it's really hard to reproduce, and I'm not sure what event actually triggers this behavior. It seems that calling recoverTask from any other step will cause this problem, since it will try to cancel, but cancelling runs on the same stateTimer, and since the Prober is waiting for the canceller, bu tthe canceller can never start because the Prober is blocking the stateTimer thread.

These are stack traces from the misbehaving threads:

"JmDNS(192-168-1-3).State.Timer" prio=1 tid=13 RUNNABLE
| group="main" sCount=1 dsCount=0 s=Y obj=0x47cccdf8 self=0x254ac8
| sysTid=4176 nice=19 sched=0/0 cgrp=bg_non_interactive handle=2753408
| schedstat=( 150118100032 192908634375 146559 )
at java.util.concurrent.locks.AbstractQueuedSynchronizer.getState(AbstractQueuedSynchronizer.java:~497)
at java.util.concurrent.locks.ReentrantLock$Sync.nonfairTryAcquire(ReentrantLock.java:106)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.tryAcquire(ReentrantLock.java:189)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:415)
at javax.jmdns.impl.DNSStatefulObject$DefaultImplementation.waitForCanceled(DNSStatefulObject.java:266)
at javax.jmdns.impl.HostInfo.waitForCanceled(HostInfo.java:395)
at javax.jmdns.impl.JmDNSImpl.waitForCanceled(JmDNSImpl.java:456)
at javax.jmdns.impl.JmDNSImpl.recover(JmDNSImpl.java:1369)
at javax.jmdns.impl.tasks.state.Prober.recoverTask(Prober.java:144)
at javax.jmdns.impl.tasks.state.DNSStateTask.run(DNSStateTask.java:145)
at java.util.Timer$TimerImpl.run(Timer.java:289)

"SyncService worker" prio=1 tid=8 RUNNABLE
| group="main" sCount=1 dsCount=0 s=Y obj=0x47c950a8 self=0x261808
| sysTid=3099 nice=19 sched=0/0 cgrp=bg_non_interactive handle=2573336
| schedstat=( 209572919166 231530661529 214000 )
at java.util.concurrent.locks.AbstractQueuedSynchronizer.getState(AbstractQueuedSynchronizer.java:~497)
at java.util.concurrent.locks.ReentrantLock$Sync.nonfairTryAcquire(ReentrantLock.java:106)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.tryAcquire(ReentrantLock.java:189)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:415)
at javax.jmdns.impl.DNSStatefulObject$DefaultImplementation.waitForAnnounced(DNSStatefulObject.java:233)
at javax.jmdns.impl.ServiceInfoImpl.waitForAnnounced(ServiceInfoImpl.java:929)
at javax.jmdns.impl.JmDNSImpl.registerService(JmDNSImpl.java:793)
at com.doubleTwist.sync.SyncService.startZeroconf(SyncService.java:566)
at com.doubleTwist.sync.SyncService.startServer(SyncService.java:604)
at com.doubleTwist.sync.SyncService.startOrStopServer(SyncService.java:493)
at com.doubleTwist.sync.SyncService.access$800(SyncService.java:79)
at com.doubleTwist.sync.SyncService$SyncHandler.handleMessage(SyncService.java:1025)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loop(Looper.java:123)
at com.doubleTwist.androidPlayer.MusicUtils$BackgroundWorker.run(MusicUtils.java:1842)
at java.lang.Thread.run(Thread.java:1096)

Discussion

1 2 3 > >> (Page 1 of 3)
  • Pierre Frisch

    Pierre Frisch - 2011-01-04

    I can understand how it could get deadlocked but I don't see how it could peg the cpu. The loop has a 1s timeout so at most you get 2 calls per seconds.

    I will have a look at the deadlock but without some more concrete test case this is going to be hard to squash.

    Pierre

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-04
    • assigned_to: nobody --> spearway
    • status: open --> open-accepted
     
  • Jason LeBrun

    Jason LeBrun - 2011-01-04

    The loop has a 1s timeout *iff* the lock is held somewhere else. If the lock is not held, it's immediately acquired, the state of finished is checked, and then the lock is released, which results in spinning.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-04

    It seems like the recover() method must be called on a different thread so that the Prober task can complete, allowing for the newly started canceler in the unregisterAllServices() call to start, so that the waitForCanceled() call will finish.

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-04

    Ok I can fix the spin easily. The intent was to limit the spin to once per second.

    Pierre

    commit -m "waitForCancelled and waitForAnnounced can cause bad behavior - ID: 3151408" /Users/pierre/Projects/workspace/jmdns/CHANGELOG.txt /Users/pierre/Projects/workspace/jmdns/src/main/java/javax/jmdns/impl/DNSStatefulObject.java
    Sending /Users/pierre/Projects/workspace/jmdns/CHANGELOG.txt
    Sending /Users/pierre/Projects/workspace/jmdns/src/main/java/javax/jmdns/impl/DNSStatefulObject.java
    Transmitting file data ...
    Committed revision 257.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-04

    That's essentially what we have done. This fixes the issue with the CPU taking over, which is good. However, the deadlock still occurs. I can get it to happen pretty consistently by turning off WiFi in the middle of the constructor for JmDNSImpl (during the register step).

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-05

    Do you mean Thread.sleep() instead of wait() in your commit #257?

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-05

    Sorry I went too fast

    Pierre

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-05

    I have done more testing and I think I have a fix. I have checked it in. Could you please test it?

    Thank you

    Pierre

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-05
    • status: open-accepted --> pending-accepted
     
  • Jason LeBrun

    Jason LeBrun - 2011-01-05
    • status: pending-accepted --> open-accepted
     
  • Jason LeBrun

    Jason LeBrun - 2011-01-05

    Good, this looks like the direction we were heading, it definitely fixes the deadlock-on-stateTimer problem.

    As a background, here's our usage:

    1. User elects to start using the mDNS-based service. The service manager calls JmDNS.create, and then JmDNS.registerService(ourServiceInfo);

    2. User elects to stop using the mDNS-based service. The service manager calls JmDNS.close().

    So, a very basic usage.

    The things I'm testing right now are killing the network connection at various steps. Here are the results with the trunk of a few of these tests.

    * Killing the WiFi connection after the JmDNS object is fully created
    **RESULT: The close call never completes, the method is setting at waitForCanceled. I've added some logging in the while loop, so this is printed every second:

    W/t$DefaultImplementation( 2670): waitForCanceled: DNS: 10-0-2-183 state: canceling 1 task: null (1203627144)
    W/t$DefaultImplementation( 2670): waitForCanceled: DNS: 10-0-2-183 state: canceling 2 task: null (1203291896)

    My hypothesis is that the waitFor loops should abort somehow if the task is null, since there is no task that will result in the waited state to be reached? Does this sound reasonable?

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-05

    Another thing we've noticed: if recoverTask fails when trying to create the new multicastSocket, a message is printed, but nothing more happened. Should we somehow abort the task altogether in this case? We're not sure what the proper expected behavior here is.

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-05

    The first issue is easy to fix even though I am not sure what the correct behavior is. In the __recover() method we call this.waitForCanceled(0); The 0 means that we wait forever. The idea was that we did not want to start the recovery until it is clean. I am starting to think that we should put a timeout there this.waitForCanceled(DNSConstants.CLOSE_TIMEOUT); and if the cleanup does successfully complete bail out and probably callback the application.

    What I am saying is the 2 problems are not independent and we should have a better handling.

    Any idea?

    Pierre

    PS: I am a little bit concerned about your design. Starting an mDNS incur a penalty of at least 6 seconds so if you create JmDNS when the user ask for the registration there will be 9 to 10 seconds before the service is visible. This has been discussed before and cannot be compressed. The way MacOSX and IOS run around that problem is by having a daemon running in the background on a permanent basis. That way you only need to register the service which take 3 to 4 seconds.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-05

    I will try to waitForCanceled timeout.

    Would it be appropriate to leave the waitForCanceled/waitForProbing methods if the task goes null? I am doing this and it seems to fix a lot of issues. Does it make sense to continue waiting when there's no longer a task to wait for? Would doing this cause bad behavior under other cirsumstances? In particular, I've added:

    public boolean waitForCanceled(long timeout) {
    logger.log(Level.WARNING, "waitForCanceled");
    if (!this.isCanceled()) {
    try {
    boolean finished = false;
    long end = (timeout > 0 ? System.currentTimeMillis() + timeout : Long.MAX_VALUE);
    while (!finished) {
    if(this._task == null) {
    return true;
    }

    and

    @Override
    public boolean waitForAnnounced(long timeout) {
    logger.log(Level.WARNING, "waitForAnnounced");
    if (!this.isAnnounced() && !this.willCancel()) {
    try {
    boolean finished = false;
    long end = (timeout > 0 ? System.currentTimeMillis() + timeout : Long.MAX_VALUE);
    while (!finished) {
    if(this._task == null) {
    return false;
    }

    As for your PS comment... I think we are OK. We start/register the JmDNS service in a separate thread, so it doesn't block any other behavior of the application. We can not keep the daemon running constantly since this is a mobile device and the network state changes often. The start up delay has not presented a user issue thus far.

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-05

    You need to be careful the task adds itself to the object so there may be window before the task is not null. I would do the test for the null task after the first timeout to give all threads time to run and do what they have to do. But I like your idea.

    There is still a problem of what we do in the recovery if we cannot recover. I think we should have a call back I will have a look at it.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-05

    Right now, we are just calling .close() if the recover fails, which seems to work. However the callback idea seems like a good plan.

    Calling .close() could cause the service to go away in case of transient network failures, however it's hard to test this.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-05

    One other thing, we are calling cancel(); just before calling recoverTask. I think this is correct, since the current task is essentially dead; the recover() method will create a new task if necessary.

    Wondering your thoughts on this, as well.

    So, this is the change:

    diff --git a/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java b/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java
    index a10f8a7..7004bfe 100644
    --- a/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java
    +++ b/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java
    @@ -142,6 +142,7 @@ public abstract class DNSStateTask extends DNSTask {
    }
    } catch (Throwable e) {
    logger1.log(Level.WARNING, this.getName() + ".run() exception ", e);
    + cancel();
    this.recoverTask(e);
    }

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-05

    Given your comments about the transient "initial null" potentially causing a problem when aborting if _task==null.

    Would a better plan to have removeAssociation always set a task to the DNSState.CANCELED state? This would cause any wait loops to exit.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-05

    I'm experimenting with a callback-based version of the code instead of the waitFor<State> blockers.

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-05

    Let me know how it works I am very interested as this would be need to get JmmDNS the multi-hommed feature of JmDNS working properly.

    We also need to get HostInfo multi-hommed aware.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-07

    I have made a basic event-based version, and it seems to be working well.

    The way I've done it is all DNSStatefulObjects now have two additional functions: registerSignal(String name, Runnable callback) and signal(String name).

    The only signals I use right now are announced and canceled, and they can be used equally internally and externally to the library. This should probably be improved so that clients do not accidentally clobber internally handled signals.

    Functions that are of the form:

    operationA;
    waitForAnnounced;
    operationB;

    are now:
    registerSignal("announced", operationB_runnable);
    operationA;

    It needs a little more work so that you can set multiple callbacks for a signal, and handle queueing of multiple operations correctly, synchronize correctly, etc.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-07

    On further though, It looks like probably the external clients should use the exposed ServiceTypeAdded/ServiceAdded listeners, and the signals should remain completely internal.

    I originally wrote separate signal handler types, but the code got tedious so I switched to a <String, Runnable> map approach to reduce the code burden. It's less type-safe, but much more concise.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-10

    My final first draft of evented JmDNS. No longer uses complicated "signal" mechanism. See comments for details.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-10

    There are a few glitches in the patch I just posted, a new one will be coming shortly.

     
1 2 3 > >> (Page 1 of 3)