From: <psc...@in...> - 2002-12-17 12:39:29
|
On 16 Dec 2002 at 15:48, David Jencks wrote: > umm, I think I've been saying repeatedly that _all- the sources I have > seen on distributed system design say the __only__ reliable way to > detect 1--5 is with timeouts on the server. So far no one has > directly disputed this, just argued with the conclusion that to ensure > reliability we should implement transaction timeouts on the server. > > Although many cases of 1--4 can be detected by the current connection > keep-alive/timeout feature, I think this clearly relies very heavily > on unwarranted assumptions about client architecture and good behavior > of clients. As far as I can tell there are two methods of handling whether a network transaction is still alive, either we assume it is, and ask it, if we get no reponse we assume it died. The other method, is to assume that if it is really old, that it's probably died without falling over, and we make sure by driving a stake through it's heart, otherwise known as the timeout method. For network design, and networking the second method may seem to be the better of the two. However in this case our primary concern isn't the network, it's data reliability and performance. I thought the whole process got started because a stuck OAT affects performance. The real issue is how to prevent the OAT from getting stuck in the first place. I don't mind having BOTH methods in place, as long as the timeout can be disabled, what I would like to see is this: Two settings, first is the timeout period, something like a months worth as a maximum, expressed in seconds, should be sufficient. Second is a setting about what to do about it. several options here: 1) Ignore it 2) Warning this would send a message of some kind to a DBA 3) Kill 4) Ignore Read only read committed These would be additive, so for example if you combine warn, kill, ignore RORC then it would warn, wait the same period again and then kill, ignoring all read only read committed transactions. |