[Postgres-xc-developers] Random failures in the regression test.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,

As I've been exasperated by random failures, I'm willing to whip the cause
of the issue.

This issue is related to cancel of the failed query.
When a datanode reports an error of a query, a coordinator sends a cancel
request to non-idle nodes, waits the node to get ready and requests nodes
to rollback the transaction.

Where's the problem? Consider the next case.
1. Datanode A (PID 1) reports an error to coordinator A. ([1] 'E' message)
2. Coordinator A receives [1] and reports an error to a frontend. ([2] 'E'
message)
3. Coordinator A starts aborting process and it thinks datanode A (PID
1) is not idle.
4. Coordinator A sends a cancel request about PID 1 to datanode A (PID 2).
([3] cancel message)
5. Datanode A (PID 1) reports ready to coordinator A. ([4] 'Z' message)
6. Coordinator A receives [4] and sends "ROLLBACK TRANSACTION" immediately.
([5] 'Q' message)
7. Datanode A (PID 1) receives [5] and starts processing the query.
8. Datanode A (PID 2) receives [3].
9. Datanode A (PID 2) notify PID 1 of [3].
10. Datanode A (PID 1) cancel processing [5] and reports an error to
Coordinator A. ([6] 'E' message)
11. Coordinator A receives [6] and reports an error to a frontend. ([7] 'E'
message)

[7] makes unexpected output and a test fails.

Saying an extreme thing, it could occur that the next query of [5] is
cancelled by [3].

As far as I know, there's no way to know when to the cancel request get to
be processed, I think we can't not wait an experimental duration after
cancelling like the attached patch.

Does anyone have another cool idea to solve this issue?

Regards.