Re: [Postgres-xc-developers] Random failures in the regression test.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

2014-02-14 18:35 GMT+09:00 Andrei Martsinchyk <and...@gm...>:
>
>
>
> 2014-02-14 9:04 GMT+02:00 Masataka Saito <pg...@gm...>:
>
>> Thank you for your clever suggestion.
>>
>> > - Make Cancel more selective and affect only specific query. That means
>> > an ID for each query to introduce, that should be known to client and way to
>> > deliver it.
>> > - Introduce procedure of changing backend key. Old cancel won't affect
>> > such backend.
>>
>> I prefer the 2nd idea. But these ideas seem to require touching libpq
>> infrastructure and if I understand correctly, they are used not only
>> the inter node communication but also a coordinator and a frontend
>> communication. Unless we can separate them, I think better not to
>> change it.
>>
> XC is already extending PG client-server protocol and use the extension in
> internode communications. The suggested feature do not have to be available
> to external client and therefore no need to be supported by libpq.
>
>>
>> > - Before starting new query, check if there is pending cancel and remove
>> > it. It sounds ridiculous "cancel cancel" but may work, if queries and
>> > cancels are issued synchronously from single source.
>>
>> I'm afraid of the wrong hypothesis. As I suggested first, cancel and
>> subsequent request are not serialized at the target node. It means
>> that if the query started with no pending cancel, it could be
>> interrupted by cancel request.
>>
>
> I am not sure how exactly Cancel request is handled. If server creates a
> session and sends back an acknowledgement before PGcancel returns it is
> synchronous enough. Node sends next command after the PGcancel returns, so
> the respective session either already placed the interrupt request or can be
> found in the Proc array. Either can be cleaned. If the Cancel is not
> synchronous enough, OK - just another bad idea, ignore it.

We may be able to implement this by adding new lock to synchronize
them, adding a command through libpq to handle this.   Adding a lock
can bring additional issues so I think we should be careful and take a
time to show it's safe too.

On the other hand, we're long suffered from this mainly in the regression.

Masataka's idea could be a quick hack but looks useful too.

Regards;
---
Koichi Suzuki

>
>
>>
>> Regards.
>>
>>
>> On 14 February 2014 14:06, Andrei Martsinchyk
>> <and...@gm...> wrote:
>> >
>> > You are right, the temp objects are problem.
>> > On the one hand if we run a long query and there was an error on one
>> > node we want to cancel it on others to avoid unnecessary waiting. On the
>> > other hand the query may be near its natural end and the cancel may be late
>> > and hit the next query.
>> > Just throwing out ideas:
>> > - Make Cancel more selective and affect only specific query. That means
>> > an ID for each query to introduce, that should be known to client and way to
>> > deliver it.
>> > - Introduce procedure of changing backend key. Old cancel won't affect
>> > such backend.
>> > - Before starting new query, check if there is pending cancel and remove
>> > it. It sounds ridiculous "cancel cancel" but may work, if queries and
>> > cancels are issued synchronously from single source.
>> >
>> > 14.02.2014 4:07 пользователь "Koichi Suzuki" <koi...@gm...>
>> > написал:
>> >
>> >> I misunderstand the implication.   Anyway additional wait is separate
>> >> from your suggestion.
>> >>
>> >> Disconnecting the connection as you suggested will bring another
>> >> problem such as TEMPORARY object in the subsequent queries.   We do
>> >> not support TEMPORARY object but I believe we should be consistent on
>> >> this for future releases.
>> >>
>> >> Thoughts?
>> >> ---
>> >> Koichi Suzuki
>> >>
>> >>
>> >> 2014-02-14 2:30 GMT+09:00 Andrei Martsinchyk
>> >> <and...@gm...>:
>> >> > Hello,
>> >> >
>> >> > Postgres establishes separate connection to deliver Cancel command to
>> >> > the
>> >> > target session.
>> >> > On a heavily loaded node it may take fairly long. Longer sleep would
>> >> > help
>> >> > out, but it means longer recovery after an error.
>> >> > Better solution is to remove canceled connection from the pool and
>> >> > therefore
>> >> > do not use it to handle subsequent queries.
>> >> >
>> >> >
>> >> >
>> >> > 2014-02-13 11:10 GMT+02:00 Koichi Suzuki <koi...@gm...>:
>> >> >>
>> >> >> I think it hits the point.   I tested this patch several times and
>> >> >> it
>> >> >> seems to work fine.   The delay time (at present 10ms) is short
>> >> >> enough
>> >> >> and it is applied only when we need to cancel a statement.
>> >> >>
>> >> >> We should check this into all the master and STABLE branches
>> >> >> improving
>> >> >> magic number with some meaningful name.
>> >> >>
>> >> >> Any thoughts?
>> >> >> ---
>> >> >> Koichi Suzuki
>> >> >>
>> >> >>
>> >> >> 2014-01-24 18:25 GMT+09:00 Masataka Saito <pg...@gm...>:
>> >> >> > Hello,
>> >> >> >
>> >> >> > As I've been exasperated by random failures, I'm willing to whip
>> >> >> > the
>> >> >> > cause
>> >> >> > of the issue.
>> >> >> >
>> >> >> > This issue is related to cancel of the failed query.
>> >> >> > When a datanode reports an error of a query, a coordinator sends a
>> >> >> > cancel
>> >> >> > request to non-idle nodes, waits the node to get ready and
>> >> >> > requests
>> >> >> > nodes to
>> >> >> > rollback the transaction.
>> >> >> >
>> >> >> > Where's the problem? Consider the next case.
>> >> >> > 1. Datanode A (PID 1) reports an error to coordinator A. ([1] 'E'
>> >> >> > message)
>> >> >> > 2. Coordinator A receives [1] and reports an error to a frontend.
>> >> >> > ([2]
>> >> >> > 'E'
>> >> >> > message)
>> >> >> > 3. Coordinator A starts aborting process and it thinks datanode A
>> >> >> > (PID
>> >> >> > 1) is
>> >> >> > not idle.
>> >> >> > 4. Coordinator A sends a cancel request about PID 1 to datanode A
>> >> >> > (PID
>> >> >> > 2).
>> >> >> > ([3] cancel message)
>> >> >> > 5. Datanode A (PID 1) reports ready to coordinator A. ([4] 'Z'
>> >> >> > message)
>> >> >> > 6. Coordinator A receives [4] and sends "ROLLBACK TRANSACTION"
>> >> >> > immediately.
>> >> >> > ([5] 'Q' message)
>> >> >> > 7. Datanode A (PID 1) receives [5] and starts processing the
>> >> >> > query.
>> >> >> > 8. Datanode A (PID 2) receives [3].
>> >> >> > 9. Datanode A (PID 2) notify PID 1 of [3].
>> >> >> > 10. Datanode A (PID 1) cancel processing [5] and reports an error
>> >> >> > to
>> >> >> > Coordinator A. ([6] 'E' message)
>> >> >> > 11. Coordinator A receives [6] and reports an error to a frontend.
>> >> >> > ([7]
>> >> >> > 'E'
>> >> >> > message)
>> >> >> >
>> >> >> > [7] makes unexpected output and a test fails.
>> >> >> >
>> >> >> > Saying an extreme thing, it could occur that the next query of [5]
>> >> >> > is
>> >> >> > cancelled by [3].
>> >> >> >
>> >> >> > As far as I know, there's no way to know when to the cancel
>> >> >> > request get
>> >> >> > to
>> >> >> > be processed, I think we can't not wait an experimental duration
>> >> >> > after
>> >> >> > cancelling like the attached patch.
>> >> >> >
>> >> >> > Does anyone have another cool idea to solve this issue?
>> >> >> >
>> >> >> > Regards.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > ------------------------------------------------------------------------------
>> >> >> > CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> >> >> > Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> >> >> > Critical Workloads, Development Environments & Everything In
>> >> >> > Between.
>> >> >> > Get a Quote or Start a Free Trial Today.
>> >> >> >
>> >> >> >
>> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> >> >> > _______________________________________________
>> >> >> > Postgres-xc-developers mailing list
>> >> >> > Pos...@li...
>> >> >> >
>> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> ------------------------------------------------------------------------------
>> >> >> Android apps run on BlackBerry 10
>> >> >> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>> >> >> Now with support for Jelly Bean, Bluetooth, Mapview and more.
>> >> >> Get your Android app in front of a whole new audience.  Start now.
>> >> >>
>> >> >>
>> >> >> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>> >> >>
>> >> >> _______________________________________________
>> >> >> Postgres-xc-developers mailing list
>> >> >> Pos...@li...
>> >> >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Andrei Martsinchyk
>> >> >
>> >> > StormDB - http://www.stormdb.com
>> >> > The Database Cloud
>> >> >
>> >
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Android apps run on BlackBerry 10
>> > Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>> > Now with support for Jelly Bean, Bluetooth, Mapview and more.
>> > Get your Android app in front of a whole new audience.  Start now.
>> >
>> > http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Postgres-xc-developers mailing list
>> > Pos...@li...
>> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>> >
>
>
>
>
> --
> Andrei Martsinchyk
>
> StormDB - http://www.stormdb.com
> The Database Cloud
>