Re: [Firebird-devel] Max transaction duration

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Paul,

> The issue isn't the timeout, or timeout period, but whether this fixes the
> problem or just masks the problem.  It's like a toothache, sure the
> aspirin takes away the pain (temporarily), but until you go to the dentist
> and get the tooth fixed, the problem is still there.  In this case the
> stalling of the OAT is the toothache.

I would say that toothache is the thing we have to cure, and we have a
backup/restore medicine for it. However, reason of this toothache is that
you did not clean your teeth twice a day. And we want to enforce this teeth
cleaning procedure by reliably detecting that teeth were not yet cleaned.

> So what can be done about it, first we need to enumerate all of the ways
> of causing a transaction to fail in the middle.
>
> 1) The client machine crashes.
> 2) Client machine suffers a power off situation.
> 3) Network cable or device failure.
> 4) Application fails to commit/rollback the transaction before ending.
> 5) Application opens a writable transaction and leaves it open for an
> extended period of time.

In theory of fault-tolerant systems people define three classes of failures:
- failures system can detect and tolerate;
- failures system can only detect;
- failures system cannot detect.

Crash of client machine, physical link failure and ending tx without
commit/rollback seems to be failures engine should detect and tolerate (by
rolling back tx and performing some additional steps).

Question is if we can reliably detect them?

As I understood from the discussion we have a failure detector (FD) that
sends some keep-alive data to the client with response to the client query.
If client fails to respond that keep-alive message, it is suspected (are
there any additional checks of the suspected node?).

As I understand, this FD is not able to say if the client with a long open
transaction is still alive or not, because server never sends a keep-alive
message to the client on its own. Am I right?

If yes, then problem 5) belongs to the "undetectable faults". And, if yes,
isn't it natural to extend server FD to be able to "ping" a client (ping
interval and max. pong delay are specified in server config) to detect if
client is alive or not?

This FD will be server-centric and will need some background thread running
and pinging connections.

David proposes a different FD scheme where each client has to say to the
server "I will be cleaning my teeth at 9:00am and 9:00pm" on begining of the
transaction. And if client failes to call server between 9:00 and 9:05 am/pm
and say "teeth are clean", server suspects that client without any
additional checks.

This scheme seems to be more elegant than a busy server asking each client
at 9:05 am/pm "hey, did you clean your teeth?"... Difference is like a
regime in army and in the kindergarden. Also this scheme tolerates more
types of failures (at least problem 5) becomes a "detectable-toleratable
fault").

However what is not clear to me if current FD is so bad, that we have to
replace it with something new. Can we create a list of failure that currect
FD cannot detect?

Right now only one type of failure is not detectable:

- long-running transaction with open socket.

Are there any others?

Fixing the client in this case will not solve this problem, because you
assume that client software is 100% correct and works as it was intended to
work. Unfortunatelly this is not the case and will not be the case for a
long time. Each system can fail and fail-stop is most simple class of
failures. Byzantine faults are more severe faults, and we have to deal with
them on the server.

But personally I would just create a list of faults engine cannot tolerate
and put it somewhere on the web. As far as I know we are not writing an
engine for a nuclear plant, so tolerating fail-stop faults should be enough.

Best regards,
Roman Rokytskyy

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

Re: [Firebird-devel] Max transaction duration

A powerful, cross platform, SQL database system

Re: [Firebird-devel] Max transaction duration