JmDNS / Bugs / #82 waitForCancelled and waitForAnnounced can cause bad behavior

Pierre Frisch - 2011-01-04

I can understand how it could get deadlocked but I don't see how it could peg the cpu. The loop has a 1s timeout so at most you get 2 calls per seconds.

I will have a look at the deadlock but without some more concrete test case this is going to be hard to squash.

Pierre

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pierre Frisch - 2011-01-04

assigned_to: nobody --> spearway

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-04

The loop has a 1s timeout *iff* the lock is held somewhere else. If the lock is not held, it's immediately acquired, the state of finished is checked, and then the lock is released, which results in spinning.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-04

It seems like the recover() method must be called on a different thread so that the Prober task can complete, allowing for the newly started canceler in the unregisterAllServices() call to start, so that the waitForCanceled() call will finish.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pierre Frisch - 2011-01-04

Ok I can fix the spin easily. The intent was to limit the spin to once per second.

Pierre

commit -m "waitForCancelled and waitForAnnounced can cause bad behavior - ID: 3151408" /Users/pierre/Projects/workspace/jmdns/CHANGELOG.txt /Users/pierre/Projects/workspace/jmdns/src/main/java/javax/jmdns/impl/DNSStatefulObject.java
Sending /Users/pierre/Projects/workspace/jmdns/CHANGELOG.txt
Sending /Users/pierre/Projects/workspace/jmdns/src/main/java/javax/jmdns/impl/DNSStatefulObject.java
Transmitting file data ...
Committed revision 257.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-04

That's essentially what we have done. This fixes the issue with the CPU taking over, which is good. However, the deadlock still occurs. I can get it to happen pretty consistently by turning off WiFi in the middle of the constructor for JmDNSImpl (during the register step).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-05

Do you mean Thread.sleep() instead of wait() in your commit #257?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pierre Frisch - 2011-01-05

Sorry I went too fast

Pierre

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pierre Frisch - 2011-01-05

I have done more testing and I think I have a fix. I have checked it in. Could you please test it?

Thank you

Pierre

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pierre Frisch - 2011-01-05

status: open-accepted --> pending-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-05

status: pending-accepted --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-05

Good, this looks like the direction we were heading, it definitely fixes the deadlock-on-stateTimer problem.

As a background, here's our usage:

1. User elects to start using the mDNS-based service. The service manager calls JmDNS.create, and then JmDNS.registerService(ourServiceInfo);

2. User elects to stop using the mDNS-based service. The service manager calls JmDNS.close().

So, a very basic usage.

The things I'm testing right now are killing the network connection at various steps. Here are the results with the trunk of a few of these tests.

* Killing the WiFi connection after the JmDNS object is fully created
**RESULT: The close call never completes, the method is setting at waitForCanceled. I've added some logging in the while loop, so this is printed every second:

W/t$DefaultImplementation( 2670): waitForCanceled: DNS: 10-0-2-183 state: canceling 1 task: null (1203627144)
W/t$DefaultImplementation( 2670): waitForCanceled: DNS: 10-0-2-183 state: canceling 2 task: null (1203291896)

My hypothesis is that the waitFor loops should abort somehow if the task is null, since there is no task that will result in the waited state to be reached? Does this sound reasonable?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-05

Another thing we've noticed: if recoverTask fails when trying to create the new multicastSocket, a message is printed, but nothing more happened. Should we somehow abort the task altogether in this case? We're not sure what the proper expected behavior here is.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pierre Frisch - 2011-01-05

The first issue is easy to fix even though I am not sure what the correct behavior is. In the __recover() method we call this.waitForCanceled(0); The 0 means that we wait forever. The idea was that we did not want to start the recovery until it is clean. I am starting to think that we should put a timeout there this.waitForCanceled(DNSConstants.CLOSE_TIMEOUT); and if the cleanup does successfully complete bail out and probably callback the application.

What I am saying is the 2 problems are not independent and we should have a better handling.

Any idea?

Pierre

PS: I am a little bit concerned about your design. Starting an mDNS incur a penalty of at least 6 seconds so if you create JmDNS when the user ask for the registration there will be 9 to 10 seconds before the service is visible. This has been discussed before and cannot be compressed. The way MacOSX and IOS run around that problem is by having a daemon running in the background on a permanent basis. That way you only need to register the service which take 3 to 4 seconds.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-05

I will try to waitForCanceled timeout.

Would it be appropriate to leave the waitForCanceled/waitForProbing methods if the task goes null? I am doing this and it seems to fix a lot of issues. Does it make sense to continue waiting when there's no longer a task to wait for? Would doing this cause bad behavior under other cirsumstances? In particular, I've added:

public boolean waitForCanceled(long timeout) {
logger.log(Level.WARNING, "waitForCanceled");
if (!this.isCanceled()) {
try {
boolean finished = false;
long end = (timeout > 0 ? System.currentTimeMillis() + timeout : Long.MAX_VALUE);
while (!finished) {
if(this._task == null) {
return true;
}

and

@Override
public boolean waitForAnnounced(long timeout) {
logger.log(Level.WARNING, "waitForAnnounced");
if (!this.isAnnounced() && !this.willCancel()) {
try {
boolean finished = false;
long end = (timeout > 0 ? System.currentTimeMillis() + timeout : Long.MAX_VALUE);
while (!finished) {
if(this._task == null) {
return false;
}

As for your PS comment... I think we are OK. We start/register the JmDNS service in a separate thread, so it doesn't block any other behavior of the application. We can not keep the daemon running constantly since this is a mobile device and the network state changes often. The start up delay has not presented a user issue thus far.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pierre Frisch - 2011-01-05

You need to be careful the task adds itself to the object so there may be window before the task is not null. I would do the test for the null task after the first timeout to give all threads time to run and do what they have to do. But I like your idea.

There is still a problem of what we do in the recovery if we cannot recover. I think we should have a call back I will have a look at it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-05

Right now, we are just calling .close() if the recover fails, which seems to work. However the callback idea seems like a good plan.

Calling .close() could cause the service to go away in case of transient network failures, however it's hard to test this.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-05

One other thing, we are calling cancel(); just before calling recoverTask. I think this is correct, since the current task is essentially dead; the recover() method will create a new task if necessary.

Wondering your thoughts on this, as well.

So, this is the change:

diff --git a/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java b/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java
index a10f8a7..7004bfe 100644
--- a/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java
+++ b/src/main/java/javax/jmdns/impl/tasks/state/DNSStateTask.java
@@ -142,6 +142,7 @@ public abstract class DNSStateTask extends DNSTask {
}
} catch (Throwable e) {
logger1.log(Level.WARNING, this.getName() + ".run() exception ", e);
+ cancel();
this.recoverTask(e);
}

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-05

Given your comments about the transient "initial null" potentially causing a problem when aborting if _task==null.

Would a better plan to have removeAssociation always set a task to the DNSState.CANCELED state? This would cause any wait loops to exit.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-05

I'm experimenting with a callback-based version of the code instead of the waitFor<State> blockers.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pierre Frisch - 2011-01-05

Let me know how it works I am very interested as this would be need to get JmmDNS the multi-hommed feature of JmDNS working properly.

We also need to get HostInfo multi-hommed aware.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-07

I have made a basic event-based version, and it seems to be working well.

The way I've done it is all DNSStatefulObjects now have two additional functions: registerSignal(String name, Runnable callback) and signal(String name).

The only signals I use right now are announced and canceled, and they can be used equally internally and externally to the library. This should probably be improved so that clients do not accidentally clobber internally handled signals.

Functions that are of the form:

operationA;
waitForAnnounced;
operationB;

are now:
registerSignal("announced", operationB_runnable);
operationA;

It needs a little more work so that you can set multiple callbacks for a signal, and handle queueing of multiple operations correctly, synchronize correctly, etc.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-07

On further though, It looks like probably the external clients should use the exposed ServiceTypeAdded/ServiceAdded listeners, and the signals should remain completely internal.

I originally wrote separate signal handler types, but the code got tedious so I switched to a <String, Runnable> map approach to reduce the code burden. It's less type-safe, but much more concise.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-10

My final first draft of evented JmDNS. No longer uses complicated "signal" mechanism. See comments for details.

evented_jmdns.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason LeBrun - 2011-01-10

There are a few glitches in the patch I just posted, a new one will be coming shortly.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

waitForCancelled and waitForAnnounced can cause bad behavior

Group

Searches

Help

#82 waitForCancelled and waitForAnnounced can cause bad behavior

Discussion