From: Masataka S. <pg...@gm...> - 2014-01-24 09:26:00
|
Hello, As I've been exasperated by random failures, I'm willing to whip the cause of the issue. This issue is related to cancel of the failed query. When a datanode reports an error of a query, a coordinator sends a cancel request to non-idle nodes, waits the node to get ready and requests nodes to rollback the transaction. Where's the problem? Consider the next case. 1. Datanode A (PID 1) reports an error to coordinator A. ([1] 'E' message) 2. Coordinator A receives [1] and reports an error to a frontend. ([2] 'E' message) 3. Coordinator A starts aborting process and it thinks datanode A (PID 1) is not idle. 4. Coordinator A sends a cancel request about PID 1 to datanode A (PID 2). ([3] cancel message) 5. Datanode A (PID 1) reports ready to coordinator A. ([4] 'Z' message) 6. Coordinator A receives [4] and sends "ROLLBACK TRANSACTION" immediately. ([5] 'Q' message) 7. Datanode A (PID 1) receives [5] and starts processing the query. 8. Datanode A (PID 2) receives [3]. 9. Datanode A (PID 2) notify PID 1 of [3]. 10. Datanode A (PID 1) cancel processing [5] and reports an error to Coordinator A. ([6] 'E' message) 11. Coordinator A receives [6] and reports an error to a frontend. ([7] 'E' message) [7] makes unexpected output and a test fails. Saying an extreme thing, it could occur that the next query of [5] is cancelled by [3]. As far as I know, there's no way to know when to the cancel request get to be processed, I think we can't not wait an experimental duration after cancelling like the attached patch. Does anyone have another cool idea to solve this issue? Regards. |