Menu

#82 waitForCancelled and waitForAnnounced can cause bad behavior

closed-fixed
networking (37)
5
2014-11-06
2011-01-04
No

In areas of shaky WiFi coverage where the WiFi goes up and down a lot, we sometimes get into a state where one thread is waiting for a cancel and another thread is waiting for an announce, but neither ever succeeds.

This is particularly bad, because the waitForCancelled and waitForAnnounced functions are spinning while loops, and this can easily cause the CPU to get pegged and bring the system to a halt.

Is there any plan to remove the blocking code and replace it with an event-driven model, so that at least if something gets into a bad state, only JmDNS will be affected, and not the whole machine (spinning while loops are a bad thing.)

I'm still looking into this -- it's really hard to reproduce, and I'm not sure what event actually triggers this behavior. It seems that calling recoverTask from any other step will cause this problem, since it will try to cancel, but cancelling runs on the same stateTimer, and since the Prober is waiting for the canceller, bu tthe canceller can never start because the Prober is blocking the stateTimer thread.

These are stack traces from the misbehaving threads:

"JmDNS(192-168-1-3).State.Timer" prio=1 tid=13 RUNNABLE
| group="main" sCount=1 dsCount=0 s=Y obj=0x47cccdf8 self=0x254ac8
| sysTid=4176 nice=19 sched=0/0 cgrp=bg_non_interactive handle=2753408
| schedstat=( 150118100032 192908634375 146559 )
at java.util.concurrent.locks.AbstractQueuedSynchronizer.getState(AbstractQueuedSynchronizer.java:~497)
at java.util.concurrent.locks.ReentrantLock$Sync.nonfairTryAcquire(ReentrantLock.java:106)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.tryAcquire(ReentrantLock.java:189)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:415)
at javax.jmdns.impl.DNSStatefulObject$DefaultImplementation.waitForCanceled(DNSStatefulObject.java:266)
at javax.jmdns.impl.HostInfo.waitForCanceled(HostInfo.java:395)
at javax.jmdns.impl.JmDNSImpl.waitForCanceled(JmDNSImpl.java:456)
at javax.jmdns.impl.JmDNSImpl.recover(JmDNSImpl.java:1369)
at javax.jmdns.impl.tasks.state.Prober.recoverTask(Prober.java:144)
at javax.jmdns.impl.tasks.state.DNSStateTask.run(DNSStateTask.java:145)
at java.util.Timer$TimerImpl.run(Timer.java:289)

"SyncService worker" prio=1 tid=8 RUNNABLE
| group="main" sCount=1 dsCount=0 s=Y obj=0x47c950a8 self=0x261808
| sysTid=3099 nice=19 sched=0/0 cgrp=bg_non_interactive handle=2573336
| schedstat=( 209572919166 231530661529 214000 )
at java.util.concurrent.locks.AbstractQueuedSynchronizer.getState(AbstractQueuedSynchronizer.java:~497)
at java.util.concurrent.locks.ReentrantLock$Sync.nonfairTryAcquire(ReentrantLock.java:106)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.tryAcquire(ReentrantLock.java:189)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:415)
at javax.jmdns.impl.DNSStatefulObject$DefaultImplementation.waitForAnnounced(DNSStatefulObject.java:233)
at javax.jmdns.impl.ServiceInfoImpl.waitForAnnounced(ServiceInfoImpl.java:929)
at javax.jmdns.impl.JmDNSImpl.registerService(JmDNSImpl.java:793)
at com.doubleTwist.sync.SyncService.startZeroconf(SyncService.java:566)
at com.doubleTwist.sync.SyncService.startServer(SyncService.java:604)
at com.doubleTwist.sync.SyncService.startOrStopServer(SyncService.java:493)
at com.doubleTwist.sync.SyncService.access$800(SyncService.java:79)
at com.doubleTwist.sync.SyncService$SyncHandler.handleMessage(SyncService.java:1025)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loop(Looper.java:123)
at com.doubleTwist.androidPlayer.MusicUtils$BackgroundWorker.run(MusicUtils.java:1842)
at java.lang.Thread.run(Thread.java:1096)

Discussion

<< < 1 2 3 > >> (Page 2 of 3)
  • Jason LeBrun

    Jason LeBrun - 2011-01-11

    Much happier with thise one.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-11

    A few more tweaks so that cleanup happens in the correct place.

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-11

    Have you tried running the unit test?

    Pierre

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-11

    The easiest is to ru:n mvn test

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-11

    No I haven't, I'll do that now.

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-14

    Ok I have an implementation that appears too work and is quite a bit more efficient. Could you try it?

    Thanks

    Pierre

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-14
    • status: open-accepted --> pending-accepted
     
  • Nobody/Anonymous

    Well, this is better than the spinwait, but it will still leak threads if connectivity goes away during one of the semaphore waits.

    My goal in the implementation I made was to never block execution waiting for a condition to be met. So, in the case where the client discards the JmDNS object, it will always be cleaned up and collected when the client discards it.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-14
    • status: pending-accepted --> open-accepted
     
  • Jason LeBrun

    Jason LeBrun - 2011-01-14

    Sorry, that last comment was me. Did not realize that I wasn't logged in.

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-14

    Why would it leak? There is a timeout and if the timeout expires the Thread is released. This is why I am using Semaphores and not wait/notify. I may have missed something could you be more explicit?

    There are time we need to block the thread. In particular when we are registering services we need to wait for JmDNS to be announced before we can probe for the service and announce it. The previous implementation of JmDNS was blocking until it was announced and I added that so that we can start listening while JmDNS starts but then and publish need to wait.

    Pierre

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-14

    Most of the internal calls to waitForAnnounced/waitForCanceled specify 0, which is effectively an infinite timeout. So these threads will block forever if they can't finish.

    The thread does not need to be blocked to successfully wait for a particular condition before doing something. If a callback-based method is used, then you simply call the callbacks when the particular announced state is reached.

    In my solution, for example, if you called registerService and JmDNS was not announced,the desired service registration was queued. When the annouce state was finally reach, the queue was processed and the service registrations were announced. I used a single callback per DNSStatefulObject that decided what to do next based on the current state of the object once a "settled" state was reached. This allowed it to process pending operations in the right order, for example, ensuring that an attempt was made to unregister services before closing everything.

    Furthermore, the client doesn't need to block, since there are serviceAnnounced/serviceResolved/serviceRemoved callback listeners that can be registered to tell the client that the service is now registered and that it can proceed with doing what it needs to do after service registration.

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-14

    My mistake I thought I added a timeout on all of them. Fixed.

    The problem is that you redefining the API and that is why the unit tests are failing. If we make all the call asynchronous this will break a lot of existing code. I am not saying this is a bad solution but this will change things quite radically. I still like the idea but this is BIG change as to how JmDNS works with consequences on the client code.

    Pierre

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-14

    I agree completely about the API change. When I ran your unit tests and they all failed, I quickly learned why, and realized that it wouldn't be shippable since it could potentially break existing clients. I was looking at remedying that, but then got distracted and hadn't finished before you checked in your other patch.

    My planned remedy was to rename my versions of the now non-blocking API calls, and have the original method names still exhibit blocking behavior. Then the original blocking calls could just call the new calls (registerServiceNB or something) and then after calling it, block until the desired condition occurs.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-19

    Even with the timeout, I still end up with threads blocked forever at the acquire() step if connectivity is lost.

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-19

    Do you have a thread dump? This is not supposed to happen.

    Pierre

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-19

    It happens if I call close() from another thread while the first thread is in the middle of waiting for a registerService() call to finish. The registerService() call blocks forever:

    at java.util.concurrent.Semaphore.acquire(Semaphore.java:284)

    Should the acquire() be a tryAcquire(timeout) ?

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-19

    The timeout is simply not working because the timer is started after the semaphore acquisition and therefore never executed…

    We should always write unit test…

    Pierre

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-19
    • status: open-accepted --> pending-fixed
     
  • Jason LeBrun

    Jason LeBrun - 2011-01-19
    • status: pending-fixed --> open-fixed
     
  • Jason LeBrun

    Jason LeBrun - 2011-01-19

    Why are you using a whole separate timer, instead of just using the tryAcquire(timeout) call of Semaphores?

     
  • Pierre Frisch

    Pierre Frisch - 2011-01-19
    • status: open-fixed --> pending-fixed
     
  • Pierre Frisch

    Pierre Frisch - 2011-01-19

    You are absolutely right. Done

     
  • Nobody/Anonymous

    It's working well. I'm attaching a patch to fix one more minor thing, which is that you can get into a state where the timers aren't cleaned up if you close() while the service is in the middle of recovering. I fixed this by creating a new closing() and closed() states, which will be helpful for other things too, I think.

     
  • Jason LeBrun

    Jason LeBrun - 2011-01-20

    Add closing and closed states so that we can always clean up properly when close is issued.

     
<< < 1 2 3 > >> (Page 2 of 3)