In areas of shaky WiFi coverage where the WiFi goes up and down a lot, we sometimes get into a state where one thread is waiting for a cancel and another thread is waiting for an announce, but neither ever succeeds.
This is particularly bad, because the waitForCancelled and waitForAnnounced functions are spinning while loops, and this can easily cause the CPU to get pegged and bring the system to a halt.
Is there any plan to remove the blocking code and replace it with an event-driven model, so that at least if something gets into a bad state, only JmDNS will be affected, and not the whole machine (spinning while loops are a bad thing.)
I'm still looking into this -- it's really hard to reproduce, and I'm not sure what event actually triggers this behavior. It seems that calling recoverTask from any other step will cause this problem, since it will try to cancel, but cancelling runs on the same stateTimer, and since the Prober is waiting for the canceller, bu tthe canceller can never start because the Prober is blocking the stateTimer thread.
These are stack traces from the misbehaving threads:
"JmDNS(192-168-1-3).State.Timer" prio=1 tid=13 RUNNABLE
| group="main" sCount=1 dsCount=0 s=Y obj=0x47cccdf8 self=0x254ac8
| sysTid=4176 nice=19 sched=0/0 cgrp=bg_non_interactive handle=2753408
| schedstat=( 150118100032 192908634375 146559 )
at java.util.concurrent.locks.AbstractQueuedSynchronizer.getState(AbstractQueuedSynchronizer.java:~497)
at java.util.concurrent.locks.ReentrantLock$Sync.nonfairTryAcquire(ReentrantLock.java:106)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.tryAcquire(ReentrantLock.java:189)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:415)
at javax.jmdns.impl.DNSStatefulObject$DefaultImplementation.waitForCanceled(DNSStatefulObject.java:266)
at javax.jmdns.impl.HostInfo.waitForCanceled(HostInfo.java:395)
at javax.jmdns.impl.JmDNSImpl.waitForCanceled(JmDNSImpl.java:456)
at javax.jmdns.impl.JmDNSImpl.recover(JmDNSImpl.java:1369)
at javax.jmdns.impl.tasks.state.Prober.recoverTask(Prober.java:144)
at javax.jmdns.impl.tasks.state.DNSStateTask.run(DNSStateTask.java:145)
at java.util.Timer$TimerImpl.run(Timer.java:289)
"SyncService worker" prio=1 tid=8 RUNNABLE
| group="main" sCount=1 dsCount=0 s=Y obj=0x47c950a8 self=0x261808
| sysTid=3099 nice=19 sched=0/0 cgrp=bg_non_interactive handle=2573336
| schedstat=( 209572919166 231530661529 214000 )
at java.util.concurrent.locks.AbstractQueuedSynchronizer.getState(AbstractQueuedSynchronizer.java:~497)
at java.util.concurrent.locks.ReentrantLock$Sync.nonfairTryAcquire(ReentrantLock.java:106)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.tryAcquire(ReentrantLock.java:189)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:415)
at javax.jmdns.impl.DNSStatefulObject$DefaultImplementation.waitForAnnounced(DNSStatefulObject.java:233)
at javax.jmdns.impl.ServiceInfoImpl.waitForAnnounced(ServiceInfoImpl.java:929)
at javax.jmdns.impl.JmDNSImpl.registerService(JmDNSImpl.java:793)
at com.doubleTwist.sync.SyncService.startZeroconf(SyncService.java:566)
at com.doubleTwist.sync.SyncService.startServer(SyncService.java:604)
at com.doubleTwist.sync.SyncService.startOrStopServer(SyncService.java:493)
at com.doubleTwist.sync.SyncService.access$800(SyncService.java:79)
at com.doubleTwist.sync.SyncService$SyncHandler.handleMessage(SyncService.java:1025)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loop(Looper.java:123)
at com.doubleTwist.androidPlayer.MusicUtils$BackgroundWorker.run(MusicUtils.java:1842)
at java.lang.Thread.run(Thread.java:1096)
I can understand how it could get deadlocked but I don't see how it could peg the cpu. The loop has a 1s timeout so at most you get 2 calls per seconds.
I will have a look at the deadlock but without some more concrete test case this is going to be hard to squash.
Pierre
The loop has a 1s timeout *iff* the lock is held somewhere else. If the lock is not held, it's immediately acquired, the state of finished is checked, and then the lock is released, which results in spinning.
It seems like the recover() method must be called on a different thread so that the Prober task can complete, allowing for the newly started canceler in the unregisterAllServices() call to start, so that the waitForCanceled() call will finish.
Ok I can fix the spin easily. The intent was to limit the spin to once per second.
Pierre
commit -m "waitForCancelled and waitForAnnounced can cause bad behavior - ID: 3151408" /Users/pierre/Projects/workspace/jmdns/CHANGELOG.txt /Users/pierre/Projects/workspace/jmdns/src/main/java/javax/jmdns/impl/DNSStatefulObject.java
Sending /Users/pierre/Projects/workspace/jmdns/CHANGELOG.txt
Sending /Users/pierre/Projects/workspace/jmdns/src/main/java/javax/jmdns/impl/DNSStatefulObject.java
Transmitting file data ...
Committed revision 257.
That's essentially what we have done. This fixes the issue with the CPU taking over, which is good. However, the deadlock still occurs. I can get it to happen pretty consistently by turning off WiFi in the middle of the constructor for JmDNSImpl (during the register step).
Do you mean Thread.sleep() instead of wait() in your commit #257?
Sorry I went too fast
Pierre
I have done more testing and I think I have a fix. I have checked it in. Could you please test it?
Thank you
Pierre
Good, this looks like the direction we were heading, it definitely fixes the deadlock-on-stateTimer problem.
As a background, here's our usage:
1. User elects to start using the mDNS-based service. The service manager calls JmDNS.create, and then JmDNS.registerService(ourServiceInfo);
2. User elects to stop using the mDNS-based service. The service manager calls JmDNS.close().
So, a very basic usage.
The things I'm testing right now are killing the network connection at various steps. Here are the results with the trunk of a few of these tests.
* Killing the WiFi connection after the JmDNS object is fully created
**RESULT: The close call never completes, the method is setting at waitForCanceled. I've added some logging in the while loop, so this is printed every second:
W/t$DefaultImplementation( 2670): waitForCanceled: DNS: 10-0-2-183 state: canceling 1 task: null (1203627144)
W/t$DefaultImplementation( 2670): waitForCanceled: DNS: 10-0-2-183 state: canceling 2 task: null (1203291896)
My hypothesis is that the waitFor loops should abort somehow if the task is null, since there is no task that will result in the waited state to be reached? Does this sound reasonable?
Another thing we've noticed: if recoverTask fails when trying to create the new multicastSocket, a message is printed, but nothing more happened. Should we somehow abort the task altogether in this case? We're not sure what the proper expected behavior here is.
The first issue is easy to fix even though I am not sure what the correct behavior is. In the __recover() method we call this.waitForCanceled(0); The 0 means that we wait forever. The idea was that we did not want to start the recovery until it is clean. I am starting to think that we should put a timeout there this.waitForCanceled(DNSConstants.CLOSE_TIMEOUT); and if the cleanup does successfully complete bail out and probably callback the application.
What I am saying is the 2 problems are not independent and we should have a better handling.
Any idea?
Pierre
PS: I am a little bit concerned about your design. Starting an mDNS incur a penalty of at least 6 seconds so if you create JmDNS when the user ask for the registration there will be 9 to 10 seconds before the service is visible. This has been discussed before and cannot be compressed. The way MacOSX and IOS run around that problem is by having a daemon running in the background on a permanent basis. That way you only need to register the service which take 3 to 4 seconds.
I will try to waitForCanceled timeout.
Would it be appropriate to leave the waitForCanceled/waitForProbing methods if the task goes null? I am doing this and it seems to fix a lot of issues. Does it make sense to continue waiting when there's no longer a task to wait for? Would doing this cause bad behavior under other cirsumstances? In particular, I've added:
public boolean waitForCanceled(long timeout) {
logger.log(Level.WARNING, "waitForCanceled");
if (!this.isCanceled()) {
try {
boolean finished = false;
long end = (timeout > 0 ? System.currentTimeMillis() + timeout : Long.MAX_VALUE);
while (!finished) {
if(this._task == null) {
return true;
}
and
@Override
public boolean waitForAnnounced(long timeout) {
logger.log(Level.WARNING, "waitForAnnounced");
if (!this.isAnnounced() && !this.willCancel()) {
try {
boolean finished = false;
long end = (timeout > 0 ? System.currentTimeMillis() + timeout : Long.MAX_VALUE);
while (!finished) {
if(this._task == null) {
return false;
}
As for your PS comment... I think we are OK. We start/register the JmDNS service in a separate thread, so it doesn't block any other behavior of the application. We can not keep the daemon running constantly since this is a mobile device and the network state changes often. The start up delay has not presented a user issue thus far.
You need to be careful the task adds itself to the object so there may be window before the task is not null. I would do the test for the null task after the first timeout to give all threads time to run and do what they have to do. But I like your idea.
There is still a problem of what we do in the recovery if we cannot recover. I think we should have a call back I will have a look at it.
Right now, we are just calling .close() if the recover fails, which seems to work. However the callback idea seems like a good plan.
Calling .close() could cause the service to go away in case of transient network failures, however it's hard to test this.
One other thing, we are calling cancel(); just before calling recoverTask. I think this is correct, since the current task is essentially dead; the recover() method will create a new task if necessary.
Wondering your thoughts on this, as well.
So, this is the change:
diff --git a/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java b/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java
index a10f8a7..7004bfe 100644
--- a/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java
+++ b/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java
@@ -142,6 +142,7 @@ public abstract class DNSStateTask extends DNSTask {
}
} catch (Throwable e) {
logger1.log(Level.WARNING, this.getName() + ".run() exception ", e);
+ cancel();
this.recoverTask(e);
}
Given your comments about the transient "initial null" potentially causing a problem when aborting if _task==null.
Would a better plan to have removeAssociation always set a task to the DNSState.CANCELED state? This would cause any wait loops to exit.
I'm experimenting with a callback-based version of the code instead of the waitFor<State> blockers.
Let me know how it works I am very interested as this would be need to get JmmDNS the multi-hommed feature of JmDNS working properly.
We also need to get HostInfo multi-hommed aware.
I have made a basic event-based version, and it seems to be working well.
The way I've done it is all DNSStatefulObjects now have two additional functions: registerSignal(String name, Runnable callback) and signal(String name).
The only signals I use right now are announced and canceled, and they can be used equally internally and externally to the library. This should probably be improved so that clients do not accidentally clobber internally handled signals.
Functions that are of the form:
operationA;
waitForAnnounced;
operationB;
are now:
registerSignal("announced", operationB_runnable);
operationA;
It needs a little more work so that you can set multiple callbacks for a signal, and handle queueing of multiple operations correctly, synchronize correctly, etc.
On further though, It looks like probably the external clients should use the exposed ServiceTypeAdded/ServiceAdded listeners, and the signals should remain completely internal.
I originally wrote separate signal handler types, but the code got tedious so I switched to a <String, Runnable> map approach to reduce the code burden. It's less type-safe, but much more concise.
My final first draft of evented JmDNS. No longer uses complicated "signal" mechanism. See comments for details.
There are a few glitches in the patch I just posted, a new one will be coming shortly.