From: Koichi S. <koi...@gm...> - 2014-02-13 09:10:28
|
I think it hits the point. I tested this patch several times and it seems to work fine. The delay time (at present 10ms) is short enough and it is applied only when we need to cancel a statement. We should check this into all the master and STABLE branches improving magic number with some meaningful name. Any thoughts? --- Koichi Suzuki 2014-01-24 18:25 GMT+09:00 Masataka Saito <pg...@gm...>: > Hello, > > As I've been exasperated by random failures, I'm willing to whip the cause > of the issue. > > This issue is related to cancel of the failed query. > When a datanode reports an error of a query, a coordinator sends a cancel > request to non-idle nodes, waits the node to get ready and requests nodes to > rollback the transaction. > > Where's the problem? Consider the next case. > 1. Datanode A (PID 1) reports an error to coordinator A. ([1] 'E' message) > 2. Coordinator A receives [1] and reports an error to a frontend. ([2] 'E' > message) > 3. Coordinator A starts aborting process and it thinks datanode A (PID 1) is > not idle. > 4. Coordinator A sends a cancel request about PID 1 to datanode A (PID 2). > ([3] cancel message) > 5. Datanode A (PID 1) reports ready to coordinator A. ([4] 'Z' message) > 6. Coordinator A receives [4] and sends "ROLLBACK TRANSACTION" immediately. > ([5] 'Q' message) > 7. Datanode A (PID 1) receives [5] and starts processing the query. > 8. Datanode A (PID 2) receives [3]. > 9. Datanode A (PID 2) notify PID 1 of [3]. > 10. Datanode A (PID 1) cancel processing [5] and reports an error to > Coordinator A. ([6] 'E' message) > 11. Coordinator A receives [6] and reports an error to a frontend. ([7] 'E' > message) > > [7] makes unexpected output and a test fails. > > Saying an extreme thing, it could occur that the next query of [5] is > cancelled by [3]. > > As far as I know, there's no way to know when to the cancel request get to > be processed, I think we can't not wait an experimental duration after > cancelling like the attached patch. > > Does anyone have another cool idea to solve this issue? > > Regards. > > ------------------------------------------------------------------------------ > CenturyLink Cloud: The Leader in Enterprise Cloud Services. > Learn Why More Businesses Are Choosing CenturyLink Cloud For > Critical Workloads, Development Environments & Everything In Between. > Get a Quote or Start a Free Trial Today. > http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > |