From: Koichi S. <koi...@gm...> - 2014-02-19 04:50:39
|
2014-02-14 18:35 GMT+09:00 Andrei Martsinchyk <and...@gm...>: > > > > 2014-02-14 9:04 GMT+02:00 Masataka Saito <pg...@gm...>: > >> Thank you for your clever suggestion. >> >> > - Make Cancel more selective and affect only specific query. That means >> > an ID for each query to introduce, that should be known to client and way to >> > deliver it. >> > - Introduce procedure of changing backend key. Old cancel won't affect >> > such backend. >> >> I prefer the 2nd idea. But these ideas seem to require touching libpq >> infrastructure and if I understand correctly, they are used not only >> the inter node communication but also a coordinator and a frontend >> communication. Unless we can separate them, I think better not to >> change it. >> > XC is already extending PG client-server protocol and use the extension in > internode communications. The suggested feature do not have to be available > to external client and therefore no need to be supported by libpq. > >> >> > - Before starting new query, check if there is pending cancel and remove >> > it. It sounds ridiculous "cancel cancel" but may work, if queries and >> > cancels are issued synchronously from single source. >> >> I'm afraid of the wrong hypothesis. As I suggested first, cancel and >> subsequent request are not serialized at the target node. It means >> that if the query started with no pending cancel, it could be >> interrupted by cancel request. >> > > I am not sure how exactly Cancel request is handled. If server creates a > session and sends back an acknowledgement before PGcancel returns it is > synchronous enough. Node sends next command after the PGcancel returns, so > the respective session either already placed the interrupt request or can be > found in the Proc array. Either can be cleaned. If the Cancel is not > synchronous enough, OK - just another bad idea, ignore it. We may be able to implement this by adding new lock to synchronize them, adding a command through libpq to handle this. Adding a lock can bring additional issues so I think we should be careful and take a time to show it's safe too. On the other hand, we're long suffered from this mainly in the regression. Masataka's idea could be a quick hack but looks useful too. Regards; --- Koichi Suzuki > > >> >> Regards. >> >> >> On 14 February 2014 14:06, Andrei Martsinchyk >> <and...@gm...> wrote: >> > >> > You are right, the temp objects are problem. >> > On the one hand if we run a long query and there was an error on one >> > node we want to cancel it on others to avoid unnecessary waiting. On the >> > other hand the query may be near its natural end and the cancel may be late >> > and hit the next query. >> > Just throwing out ideas: >> > - Make Cancel more selective and affect only specific query. That means >> > an ID for each query to introduce, that should be known to client and way to >> > deliver it. >> > - Introduce procedure of changing backend key. Old cancel won't affect >> > such backend. >> > - Before starting new query, check if there is pending cancel and remove >> > it. It sounds ridiculous "cancel cancel" but may work, if queries and >> > cancels are issued synchronously from single source. >> > >> > 14.02.2014 4:07 пользователь "Koichi Suzuki" <koi...@gm...> >> > написал: >> > >> >> I misunderstand the implication. Anyway additional wait is separate >> >> from your suggestion. >> >> >> >> Disconnecting the connection as you suggested will bring another >> >> problem such as TEMPORARY object in the subsequent queries. We do >> >> not support TEMPORARY object but I believe we should be consistent on >> >> this for future releases. >> >> >> >> Thoughts? >> >> --- >> >> Koichi Suzuki >> >> >> >> >> >> 2014-02-14 2:30 GMT+09:00 Andrei Martsinchyk >> >> <and...@gm...>: >> >> > Hello, >> >> > >> >> > Postgres establishes separate connection to deliver Cancel command to >> >> > the >> >> > target session. >> >> > On a heavily loaded node it may take fairly long. Longer sleep would >> >> > help >> >> > out, but it means longer recovery after an error. >> >> > Better solution is to remove canceled connection from the pool and >> >> > therefore >> >> > do not use it to handle subsequent queries. >> >> > >> >> > >> >> > >> >> > 2014-02-13 11:10 GMT+02:00 Koichi Suzuki <koi...@gm...>: >> >> >> >> >> >> I think it hits the point. I tested this patch several times and >> >> >> it >> >> >> seems to work fine. The delay time (at present 10ms) is short >> >> >> enough >> >> >> and it is applied only when we need to cancel a statement. >> >> >> >> >> >> We should check this into all the master and STABLE branches >> >> >> improving >> >> >> magic number with some meaningful name. >> >> >> >> >> >> Any thoughts? >> >> >> --- >> >> >> Koichi Suzuki >> >> >> >> >> >> >> >> >> 2014-01-24 18:25 GMT+09:00 Masataka Saito <pg...@gm...>: >> >> >> > Hello, >> >> >> > >> >> >> > As I've been exasperated by random failures, I'm willing to whip >> >> >> > the >> >> >> > cause >> >> >> > of the issue. >> >> >> > >> >> >> > This issue is related to cancel of the failed query. >> >> >> > When a datanode reports an error of a query, a coordinator sends a >> >> >> > cancel >> >> >> > request to non-idle nodes, waits the node to get ready and >> >> >> > requests >> >> >> > nodes to >> >> >> > rollback the transaction. >> >> >> > >> >> >> > Where's the problem? Consider the next case. >> >> >> > 1. Datanode A (PID 1) reports an error to coordinator A. ([1] 'E' >> >> >> > message) >> >> >> > 2. Coordinator A receives [1] and reports an error to a frontend. >> >> >> > ([2] >> >> >> > 'E' >> >> >> > message) >> >> >> > 3. Coordinator A starts aborting process and it thinks datanode A >> >> >> > (PID >> >> >> > 1) is >> >> >> > not idle. >> >> >> > 4. Coordinator A sends a cancel request about PID 1 to datanode A >> >> >> > (PID >> >> >> > 2). >> >> >> > ([3] cancel message) >> >> >> > 5. Datanode A (PID 1) reports ready to coordinator A. ([4] 'Z' >> >> >> > message) >> >> >> > 6. Coordinator A receives [4] and sends "ROLLBACK TRANSACTION" >> >> >> > immediately. >> >> >> > ([5] 'Q' message) >> >> >> > 7. Datanode A (PID 1) receives [5] and starts processing the >> >> >> > query. >> >> >> > 8. Datanode A (PID 2) receives [3]. >> >> >> > 9. Datanode A (PID 2) notify PID 1 of [3]. >> >> >> > 10. Datanode A (PID 1) cancel processing [5] and reports an error >> >> >> > to >> >> >> > Coordinator A. ([6] 'E' message) >> >> >> > 11. Coordinator A receives [6] and reports an error to a frontend. >> >> >> > ([7] >> >> >> > 'E' >> >> >> > message) >> >> >> > >> >> >> > [7] makes unexpected output and a test fails. >> >> >> > >> >> >> > Saying an extreme thing, it could occur that the next query of [5] >> >> >> > is >> >> >> > cancelled by [3]. >> >> >> > >> >> >> > As far as I know, there's no way to know when to the cancel >> >> >> > request get >> >> >> > to >> >> >> > be processed, I think we can't not wait an experimental duration >> >> >> > after >> >> >> > cancelling like the attached patch. >> >> >> > >> >> >> > Does anyone have another cool idea to solve this issue? >> >> >> > >> >> >> > Regards. >> >> >> > >> >> >> > >> >> >> > >> >> >> > ------------------------------------------------------------------------------ >> >> >> > CenturyLink Cloud: The Leader in Enterprise Cloud Services. >> >> >> > Learn Why More Businesses Are Choosing CenturyLink Cloud For >> >> >> > Critical Workloads, Development Environments & Everything In >> >> >> > Between. >> >> >> > Get a Quote or Start a Free Trial Today. >> >> >> > >> >> >> > >> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk >> >> >> > _______________________________________________ >> >> >> > Postgres-xc-developers mailing list >> >> >> > Pos...@li... >> >> >> > >> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> Android apps run on BlackBerry 10 >> >> >> Introducing the new BlackBerry 10.2.1 Runtime for Android apps. >> >> >> Now with support for Jelly Bean, Bluetooth, Mapview and more. >> >> >> Get your Android app in front of a whole new audience. Start now. >> >> >> >> >> >> >> >> >> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk >> >> >> >> >> >> _______________________________________________ >> >> >> Postgres-xc-developers mailing list >> >> >> Pos...@li... >> >> >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Andrei Martsinchyk >> >> > >> >> > StormDB - http://www.stormdb.com >> >> > The Database Cloud >> >> > >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > Android apps run on BlackBerry 10 >> > Introducing the new BlackBerry 10.2.1 Runtime for Android apps. >> > Now with support for Jelly Bean, Bluetooth, Mapview and more. >> > Get your Android app in front of a whole new audience. Start now. >> > >> > http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk >> > _______________________________________________ >> > Postgres-xc-developers mailing list >> > Pos...@li... >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers >> > > > > > > -- > Andrei Martsinchyk > > StormDB - http://www.stormdb.com > The Database Cloud > |