From: Roman R. <rro...@ac...> - 2002-12-16 23:24:19
|
Paul, > The issue isn't the timeout, or timeout period, but whether this fixes the > problem or just masks the problem. It's like a toothache, sure the > aspirin takes away the pain (temporarily), but until you go to the dentist > and get the tooth fixed, the problem is still there. In this case the > stalling of the OAT is the toothache. I would say that toothache is the thing we have to cure, and we have a backup/restore medicine for it. However, reason of this toothache is that you did not clean your teeth twice a day. And we want to enforce this teeth cleaning procedure by reliably detecting that teeth were not yet cleaned. > So what can be done about it, first we need to enumerate all of the ways > of causing a transaction to fail in the middle. > > 1) The client machine crashes. > 2) Client machine suffers a power off situation. > 3) Network cable or device failure. > 4) Application fails to commit/rollback the transaction before ending. > 5) Application opens a writable transaction and leaves it open for an > extended period of time. In theory of fault-tolerant systems people define three classes of failures: - failures system can detect and tolerate; - failures system can only detect; - failures system cannot detect. Crash of client machine, physical link failure and ending tx without commit/rollback seems to be failures engine should detect and tolerate (by rolling back tx and performing some additional steps). Question is if we can reliably detect them? As I understood from the discussion we have a failure detector (FD) that sends some keep-alive data to the client with response to the client query. If client fails to respond that keep-alive message, it is suspected (are there any additional checks of the suspected node?). As I understand, this FD is not able to say if the client with a long open transaction is still alive or not, because server never sends a keep-alive message to the client on its own. Am I right? If yes, then problem 5) belongs to the "undetectable faults". And, if yes, isn't it natural to extend server FD to be able to "ping" a client (ping interval and max. pong delay are specified in server config) to detect if client is alive or not? This FD will be server-centric and will need some background thread running and pinging connections. David proposes a different FD scheme where each client has to say to the server "I will be cleaning my teeth at 9:00am and 9:00pm" on begining of the transaction. And if client failes to call server between 9:00 and 9:05 am/pm and say "teeth are clean", server suspects that client without any additional checks. This scheme seems to be more elegant than a busy server asking each client at 9:05 am/pm "hey, did you clean your teeth?"... Difference is like a regime in army and in the kindergarden. Also this scheme tolerates more types of failures (at least problem 5) becomes a "detectable-toleratable fault"). However what is not clear to me if current FD is so bad, that we have to replace it with something new. Can we create a list of failure that currect FD cannot detect? Right now only one type of failure is not detectable: - long-running transaction with open socket. Are there any others? Fixing the client in this case will not solve this problem, because you assume that client software is 100% correct and works as it was intended to work. Unfortunatelly this is not the case and will not be the case for a long time. Each system can fail and fail-stop is most simple class of failures. Byzantine faults are more severe faults, and we have to deal with them on the server. But personally I would just create a list of faults engine cannot tolerate and put it somewhere on the web. As far as I know we are not writing an engine for a nuclear plant, so tolerating fail-stop faults should be enough. Best regards, Roman Rokytskyy __________________________________________________ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com |