Deadlock detection

Developers
2013-11-17
2013-11-18
  • Guus Bonnema
    Guus Bonnema
    2013-11-17

    Hi Paul,

    I need to confirm the inner workings of deadlock detection as they are now.

    deadlockTest == true?
    First I would like to know:
    deadlockTest only has an initial value, and subnet switches it off (of course), but it is not set anywhere else. So I must assume deadlockTest is always true: correct?

    Deadlock detection procedure
    1. Every 0.5 second (500 ms) the await on CountDownLatch terminates.
    2. If so (and deadlockTest is true) and there is no error and no abort, then we scan all components. In the meantime, each component we look at, we gather information about the component and create strings with that information into a container. We look for one component that has status ACTIVE or LONG_WAIT. If so, no deadlock and we terminate the search, for a next round of waiting.
    3. If we found an active or waiting component, we wait again for 500 ms.
    4. If we found no active or waiting component, we wait a second time with "possible deadlock" enabled.
    5. if the "possible deadlock" is already enabled we have a winner! That is, we have a deadlock. So after 2x 500 ms successive waits we find no active or waiting component we start using the information we gathered on each loop.
    6. This waiting takes as long as the job is running which for an online component could be "forever".

    Could you confirm this picture of deadlock detection is accurate (enough)?

    Queries
    If so, I have the following queries:

    A. We gather of information on each component on every cycle of the wait, while looking for active / waiting components. If a job runs for 60 minutes, the we do this a total or 60 x 60 x 2 times = 7200 times! We only use the information if a deadlock really occurs: is this correct?

    B. The deadlock detection seems to me to look pretty complex. However, it should be possible to structure the methods in a way to make the main line stand out more, and the deadlock detection be less intrusive. Of course the processing would be the same, it would just be easier to follow. Would you approve of such a change?

    C. If deadlock detection is necessary in production, why make it optional?

    Regards, Guus.

     
  • Guus Bonnema
    Guus Bonnema
    2013-11-18

    I am progressing on understanding the logic in Network. I find that many processes within Network are based on pulling information from component and doing stuff that needs information from component. All together this complicates Network a lot, more than I think it should.

    Paul: I am not criticizing, but I do think improvement is possible.

    Example's of stuff the Network is doing, where I think Component should be doing it are:

    1. managing the timeouts (decrement, signalling timeout).
    2. managing the status of components (signalling change of status per component)

    I suspect there is more, but these stand out at the moment because I am looking at deadlock detection.

    proposal
    What I propose is to reverse the flow of information, which currently the Network pulls in, to the Components pushing out to anyone interested. Remark that I describe this mechanism for Network, but it might also work for any other objects that need information about Components (like intermediate reporting, logging, tracing, etc).

    Have the Network be an observer of all components. The components will all implement Observable, making each component observable to all subscribers. The Network or a specialized object that works for the Network will subscribe to all components to receive message when things get interesting.

    status change
    When a status change occurs, the component signals the subscribers what the new status is. The subscribers then update the information about the status they need.

    abort or error
    When an abort or error occurs, the component signals the subscribers, and the subscribers do whatever is necessary to gracefully terminate the network or if necessary end the network abruptly (like if you want a dump of the current situation).

    TimeoutHandler
    Every now and again (like every 500 ms) the TimeoutHandler will update the time remaining and when 0 or less, it signals its subscribers (probably Component or an object that works for Component) that the timeout expired. The Component will then decide what to do with the timeout: will it retry? will it throw an Error or Exception? It is the responsibility of the Component to decide and act. Subsequently it can signal the Network if it decides to quit on Error. Now the Network must decide what to do: ignore? Report and continue? Abort? This is the responsibility of the Network.

    Detecting a deadlock
    Detecting deadlock has become easy. Each time a Component changes status, the network, or an object working for the Network, will subtract 1 from the old status and add 1 to the new status. Periodically, with intervals that the Network decides on (or the user specifies), the Network will check the status of all components. So, if ACTIVE == LONG_WAIT == 0, for X timeslots in a row, we have a deadlock. X to be determined by Network or the designer of the Network.

    Network's responsibilities
    This way the Network is no longer responsible for updating the components timeout, managing its status or gathering info. The components will send their information to the Network when anything interesting happens.

    This would be my way of checking deadlock. It would releave the Network of responsibilities that were not it's responsibilities in the first place. Additionally it should simplify the logic in Network and friends.

    Let me know what you think.

    Kind regards, Guus.

     
  • Guus Bonnema
    Guus Bonnema
    2013-11-18

    Hi Paul,

    I am having second thoughts on my proposal. The point is that the approach of Observer - Observable (a pattern) works fine in sequential environments, but presents some serious concurrency problems in our environment.

    Let me think about it some more and I will search whether others have found a solution to the problem. Will get back to you shortly.

    Kind regards, Guus.

     
  • Sounds good generally - as long as it doesn't increase the overhead significantly :-)

    I do have one comment about deadlockTest - I don't feel it should be private with getters and setters - I really feel it should be static. This is based on a feeling (which may not fit Java thinking) that it is not a "settable" attribute of Network - it is built-in.

    My reasons:

    a) deadlockTest is true in Network in all situations, except when doing debugging (is there a compile test for DEBUG?)

    b) deadlockTest is false for subnets, because it is only meaningful at the level of the network as a whole

    c) you might want to change its settings if e.g. you are doing deadlock testing! But that's certainly not normal.

    When I experimentally changed deadlockTest to static, everything compiled OK, except for 2 tests in testsrc where it is explicitly set to true - and I have no idea why that code is there, as this is the default case anyway!

    Hope this makes some sense!

    Hope the root canal wasn't too bad!

    Talk again soon,

    Paul

     
    • Guus Bonnema
      Guus Bonnema
      2013-11-18

      Hi Paul,

      The root canal was hmmmmm. Started out without (verdoving) (where you get a shot so you dont feel it), because the nerve was dead. The nerve disagreed! So I got my (verdoving) and she finished it fine. I was glad to get out of the dentist chair.

      deadlockTest
      Yes, your reasoning makes sense: although it should be a constant i.e. final. It doesn't really matter whether it is static or not, as long as it is final. (static defines fields on class level).

      I will make it private static final so it will always be true. Also, I will create a public static getter that will return false if this is a SubNet and the constant otherwise (which normally contains true).

      proposal
      the approach I propose above has one disadvantage that I don't like: it needs constant synchronization which could lead to bad concurrency.

      I still think the approach is solid, but it needs something different because of the concurrency issues. As I said, I need to think on it and will certainly get back to you on this.

      In the mean time, I found a pdf presentation that explains exactly the problem that I am having with it. I suspect you will either already have it or you will find it interesting.

      You can it at: http://chess.eecs.berkeley.edu/pubs/871/MakingConcurrencyMainstream_CONCUR.pdf

      let me know whether you like it. It describes exactly the problem with observer-observable design pattern in concurrency, and it describes a solution! However, I have some trouble implementing it, because it entails applying "Actor-oriented component architectures implemented in coordination languages that complement rather than replace existing languages."!

      Or flow based programming!

      Kind regards, Guus.

       
  • Guus Bonnema
    Guus Bonnema
    2013-11-18

    P.S. how can I program components in the system that is supposed to manage the components? If you know the answer to that one, problem solved!

     
  • I am sorry - I really wish people like Lee would read my book!!! He says correctly that the real world is concurrent, but then goes off into abstruse math (at least abstruse for me). This seems so much like trying to prove mathematically that a bumblebee can fly - the bumblebee just goes ahead and does it anyway! The real, basic, problem IMO is that people, esp. IT academics, get really uncomfortable when they can't control the exact sequence of events, so they try to over-control the sequence of everything.

    My personal advice FWIW is not to sweat it - leave the Network pulling when it needs to - most of the time it doesn't care what the Components are doing, whereas they could waste a lot of CPU time pushing stuff... Surely setting an int status variable is way cheaper than doing a notify...?

    Hope I haven't offended you - deadlock detection really doesn't have to be very precise - you just want to avoid false positives, but if the network hasn't done anything for a while, you've probably got a deadlock - and in my 40 years of experience with FBP, all our deadlocks were design issues!

     
  • Guus Bonnema
    Guus Bonnema
    2013-11-18

    Don't get me wrong: your detection as far as I can see is perfect. It responds to the symptoms of a deadlock, which is excellent.

    I am not out to change what it does (algorithm), but how it does it (implementation).

    OO ramble
    Its an OO thing: keeping responsibilities where they belong. This in itself does not cost extra resources (not to speak of), but decreases coupling and complexity. It is the cross thread coordination that costs extra resources.

    Cross thread communication
    And JavaFBP is all about cross thread coordination. Thats why, even your implementation has a lot of cross thread communication. In my proposal I was trying to minimize the cross thread communication by using the observer design pattern. And if it would work in concurrent programs, that would be fine. Then I got second thoughts and found this presentation of mister Lee. And now I am pondering a solution.

    The point is not whether Network is pulling or Component is pushing. The point is that Network is doing administration that it does not need to do. It is extra complexity that we can easily miss (like a root canal xD), without changing the algorithm.

    What I still need to figure out is how to reach minimize cross thread communication while make the remainder of the cross thread communication safe and performant.

    Let me chew on this a little longer and I will find a solution that we both like.

    Summary
    There are 2 reasons for wanting to keep the classes clean w.r.t. responsiblities:

    1. Complexity is lower
    2. less cross thread communication
    3. Less coupling between classes.

    Taking offence and bad idea's
    You did not offend me (never thought I would be saying that :) ). Discussion is necessary for bad idea's to change into good idea's and good idea's evolve into great idea's. (this is candidate for one of John Cowan postsignature thingies).

    Kind regards, Guus.

    P.S. Our discussions are all over the place: this one belongs in a different thread, I believe :)

     
  • Hi Guus, what are John's postsignature thingies? I think I missed that one!

    Clean interfaces are goodness for me too! So, if you are happy that these changes result in minimal increase in overhead, vas-y!

    PS anaesthetic? I think Brits spell it with ae, Yankees with e, Canadians can't decide. Did you know that Yankee was Dutch (Jan Kies, I believe).