Postgres-XC / Feature Requests / #33 Error handling rework

Error handling rework

#33 Error handling rework

Milestone: V.1.1

Status: open

Owner: Amit Khandekar

Labels: Coordinator (34)

Priority: 5

Updated: 2012-06-03

Created: 2012-04-16

Creator: Michael Paquier

Private: No

am trying to rework the logic around handling multiple connections at the same time, but can't see how I can finish them now. But I thought it might be a good idea to record my findings for someone else to look at them later. One of the biggest problem that I face with the current logic is that there is no easy way to know what happened on individual connection, which is kind of important for many cases, especially the commit/abort/prepare handling. I thought we can teach pgxc_node_receive_responses() and pgxc_node_receive() to look at all connections and track success/failure for every connection individually. We already have some support for that in the form of add_error_message() APIs, but I feel its not complete. Here are some general comments based on my observation.

1. We have two mechanisms to track errors a. "state" member in the connection handle and b. "transaction_status" member in the connection handle. We need to clearly define the roles of these members and understand if we really need two mechanisms. If yes, why ? And if no, then see if we can combine them together. I would vote for the second approach because its less confusing

2. In pgxc_node_receive_responses(), for unexpected responses we throw an error, but then continue to do some additional processing. Obviously, once an error is thrown, the next steps are not going to get executed. We need to fix that. Either we don't throw an error, or get rid of the remaining steps.

3. The ValidateAndCloseCombiner() looks for error messages, but doesn't handle them. There are places where we just call this function, but don't do any additional handling for errors that might have occurred. I wonder if thats intentional or we are potentially missing some error handling.

4. As I said in the last F2F meeting, we need a mechanism to handle different errors from different nodes. The combiner logic today discards any subsequent errors on other connections, which might cause confusion to the user. In the long run, we need a mechanism to log errors received from different nodes separately.

5. AbortTransaction should usually take care of any cleanups. This can also include cleaning up the remote connections and getting them in a sane state so that subsequent commands can be sent without any issue. I did that a bit while refactoring work, but we need to ensure that we don't add any special handling in elog.c etc. All such things should be handled by AbortTransaction and the related functions

I have not-even-a-half-cooked patch that I was working on some of these issues. I don't feel comfortable sending that out because I don't want someone to start the work based on that. But if you want, I can send it across.

Discussion

mason_s - 2012-04-16

Regarding 2, This may be to make sure all messages are consumed that may have already been sent to us and could be buffered. Otherwise, when we reuse the connection later we will read these messages for another statement that we are trying to execute. An alternative may be to try and just destroy the other connections (or all used connections for this transaction), and the pooler will create new ones later.

Regarding 3, it may be that if there was an error and we are trying to reuse connections and return to the pooler, that there were sometimes other error messages that appeared on the connection and we had already sent an error message to the client. If we send additional error messages, then the next time the client sends a request and reads from the connection it may read this old error message.

Regarding 4, Andrei can answer better, but maybe it was that it did not matter at that point, we encountered an error and if we just proxy the first one back. We could wait for all of the others I suppose and send all distinct error messages including count and/or nodes? This may contradict changes to (2) above, that if some error occurs, don't bother doing any additional processing.

To summarize, in general there were issues with existing messages being buffered and read in on subsequent requests. We just need to be mindful of that if reworking. Perhaps an easy way to handle this is that if there is an error, just try and cancel the running query and simply destroy the existing connections.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mason_s - 2012-04-16

I want to add additional comments.

Another consideration is transaction handling. If we send an error, a client application may issue a ROLLBACK. So, we don't necessarily want to just destroy the connections.

This gets trickier when we want to handle savepoints. To support that, we would want to have the code consume messages even though an error has occurred on another node, so that we can later rollback to a particular savepoint if we want to.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Koichi Suzuki - 2012-06-03

assigned_to: nobody --> amitdkhan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Koichi Suzuki - 2012-06-03

labels: 2084521 --> Coordinator

milestone: --> V.1.1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Error handling rework

Group

Searches

Help

#33 Error handling rework

Discussion