Re: [Postgres-xc-developers] Implementation details for BEFORE and AFTER ROW trigger support

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 3 April 2013 15:10, Ashutosh Bapat <ash...@en...>wrote:

> Hi Amit,
> Given the magnitude of the change, I think we should break this work into
> smaller self-sufficient patches and commit them (either on master or in a
> separate branch for trigger work). This will allow us to review and commit
> small amount of work and set it aside, rather than going over everything in
> every round.
>

I have created a new branch "rowtriggers" in
postgres-xc.git.sourceforge.net/gitroot/postgres-xc/postgres-xc, where I
have dumped incremental changes.

>
> On Wed, Apr 3, 2013 at 10:46 AM, Amit Khandekar <
> ami...@en...> wrote:
>
>>
>>
>>
>> On 26 March 2013 15:53, Ashutosh Bapat <ash...@en...>wrote:
>>
>>>
>>>
>>> On Tue, Mar 26, 2013 at 8:56 AM, Amit Khandekar <
>>> ami...@en...> wrote:
>>>
>>>>
>>>>
>>>> On 4 March 2013 11:11, Amit Khandekar <ami...@en...>wrote:
>>>>
>>>>> On 1 March 2013 13:53, Nikhil Sontakke <ni...@st...> wrote:
>>>>> >>
>>>>> >> Issue: Whether we should fetch the whole from the datanode (OLD
>>>>> row) and not
>>>>> >> just ctid and node_id and required columns and store it at the
>>>>> coordinator
>>>>> >> for the processing  OR whether we should fetch each row (OLD and NEW
>>>>> >> variants) while processing each row.
>>>>> >>
>>>>> >> Both of them have performance impacts - the first one has disk
>>>>> impact for
>>>>> >> large number of rows whereas the second has network impact for
>>>>> querying
>>>>> >> rows. Is it possible to do some analytical assessment as to which
>>>>> of them
>>>>> >> would be better? If you can come up with something concrete (may be
>>>>> numbers
>>>>> >> or formulae) we will be able to judge better as to which one to
>>>>> pick up.
>>>>>
>>>>> Will check if we can come up with some sensible analysis or figures.
>>>>>
>>>>>
>>>> I have done some analysis on both of these approaches here:
>>>>
>>>> https://docs.google.com/document/d/10QPPq_go_wHqKqhmOFXjJAokfdLR8OaUyZVNDu47GWk/edit?usp=sharing
>>>>
>>>> In practical terms, we anyways would need to implement (B). The reason
>>>> is because when the trigger has conditional execution(WHEN clause) we
>>>> *have* to fetch the rows beforehand, so there is no point in fetching all
>>>> of them again at the end of the statement when we already have them
>>>> locally. So may be it would be too ambitious to have have both
>>>> implementations, at least for this release.
>>>>
>>>>
>>> I agree here. We can certainly optimize for various cases later, but we
>>> should have something which would give all the functionality (albeit at a
>>> lower performance for now).
>>>
>>>
>>>> So I am focussing on (B) right now. We have two options:
>>>>
>>>> 1. Store all rows in palloced memory, and save the HeapTuple pointers
>>>> in the trigger queue, and directly access the OLD and NEW rows using these
>>>> pointers when needed. Here we will have no control over how much memory we
>>>> should use for the old and new records, and this might even hamper system
>>>> performance, let alone XC performance.
>>>> 2. Other option is to use tuplestore. Here, we need to store the
>>>> positions of the records in the tuplestore. So for a particular tigger
>>>> event, fetch by the position. From what I understand, tuplestore can be
>>>> advanced only sequentially in either direction. So when the read pointer is
>>>> at position 6 and we need to fetch a record at position 10, we need to call
>>>> tuplestore_advance() 4 times, and this call involves palloc/pfree overhead
>>>> because it calls tuplestore_gettuple(). But the trigger records are not
>>>> distributed so randomly. In fact a set of trigger events for a particular
>>>> event id are accessed in the same order as the order in which they are
>>>> queued. So for a particular event id, only the first access call will
>>>> require random access. tuplestore supports multiple read pointers, so may
>>>> be we can make use of that to access the first record using the closest
>>>> read pointer.
>>>>
>>>>
>>> Using palloc will be a problem if the size of data fetched is more that
>>> what could fit in memory. Also pallocing frequently is going to be
>>> performance problem. Let's see how does the tuple store approach go.
>>>
>>
>> While I am working on the AFTER ROW optimization, here's a patch that has
>> only BEFORE ROW trigger support, so that it can get you started with first
>> round of review. The regression is not analyzed fully yet. Besides the AR
>> trigger related changes, I have also stripped the logic of whether to run
>> the trigger on datanode or coordinator; this logic depends on both before
>> and after triggers.
>>
>>
>>>
>>>
>>>>
>>>>
>>>>>  >>
>>>>> >
>>>>> > Or we can consider a hybrid approach of getting the rows in batches
>>>>> of
>>>>> > 1000 or so if possible as well. That ways they get into coordinator
>>>>> > memory in one shot and can be processed in batches. Obviously this
>>>>> > should be considered if it's not going to be a complicated
>>>>> > implementation.
>>>>>
>>>>> It just occurred to me that it would not be that hard to optimize the
>>>>> row-fetching-by-ctid as shown below:
>>>>> 1. When it is time to fire the queued triggers at the
>>>>> statement/transaction end, initialize cursors - one cursor per
>>>>> datanode - which would do: SELECT remote_heap_fetch(table_name,
>>>>> '<ctidlist>'); We can form this ctidlist out of the trigger even list.
>>>>> 2. For each trigger event entry in the trigger queue, FETCH NEXT using
>>>>> the appropriate cursor name according to the datanode id to which the
>>>>> trigger entry belongs.
>>>>>
>>>>> >
>>>>> >>> Currently we fetch all attributes in the SELECT subplans. I have
>>>>> >>> created another patch to fetch only the required attribtues, but
>>>>> have
>>>>> >>> not merged that into this patch.
>>>>> >
>>>>> > Do we have other places where we unnecessary fetch all attributes?
>>>>> > ISTM, this should be fixed as a performance improvement first ahead
>>>>> of
>>>>> > everything else.
>>>>>
>>>>> I believe DML subplan is the only remaining place where we fetch all
>>>>> attributes. And yes, this is a must-have for triggers, otherwise, the
>>>>> other optimizations would be of no use.
>>>>>
>>>>> >
>>>>> >>> 2.  One important TODO for BEFORE trigger is this: Just before
>>>>> >>> invoking the trigger functions, in PG, the tuple is row-locked
>>>>> >>> (exclusive) by GetTupleTrigger() and the locked version is fetched
>>>>> >>> from the table. So it is made sure that while all the triggers for
>>>>> >>> that table are executed, no one can update that particular row.
>>>>> >>> In the patch, we haven't locked the row. We need to lock it either
>>>>> by
>>>>> >>> executing :
>>>>> >>>    1. SELECT * from tab1 where ctid = <ctid_val> FOR UPDATE, and
>>>>> then
>>>>> >>> use the returned ROW as the OLD row.
>>>>> >>> OR
>>>>> >>>    2. The UPDATE subplan itself should have SELECT for UPDATE so
>>>>> that
>>>>> >>> the row is already locked, and we don't have to lock it again.
>>>>> >>> #2 is simple though it might cause some amount of longer waits in
>>>>> general.
>>>>> >>> Using #1, though the locks would be acquired only when the
>>>>> particular
>>>>> >>> row is updated, the locks would be released only after transaction
>>>>> >>> end, so #1 might not be worth implementing.
>>>>> >>> Also #1 requires another explicit remote fetch for the
>>>>> >>> lock-and-get-latest-version operation.
>>>>> >>> I am more inclined towards #2.
>>>>> >>>
>>>>> >> The option #2 however, has problem of locking too many rows if
>>>>> there are
>>>>> >> coordinator quals in the subplans IOW the number of rows finally
>>>>> updated are
>>>>> >> lesser than the number of rows fetched from the datanode. It can
>>>>> cause
>>>>> >> unwanted deadlocks. Unless there is a way to release these extra
>>>>> locks, I am
>>>>> >> afraid this option will be a problem.
>>>>>
>>>>> True. Regardless of anything else - whether it is deadlocks or longer
>>>>> waits, we should not lock rows that are not to be updated.
>>>>>
>>>>> There is a more general row-locking issue that we need to solve first
>>>>> : 3606317. I anticipate that solving this will solve the trigger
>>>>> specific lock issue. So for triggers, this is a must-have, and I am
>>>>> going to solve this issue as part of this bug 3606317.
>>>>>
>>>>> >>
>>>>> > Deadlocks? ISTM, we can get more lock waits because of this but I do
>>>>> > not see deadlock scenarios..
>>>>> >
>>>>> > With the FQS shipping work being done by Ashutosh, will we also ship
>>>>> > major chunks of subplans to the datanodes? If yes, then row locking
>>>>> > will only involve required tuples (hopefully) from the coordinator's
>>>>> > point of view.
>>>>> >
>>>>> > Also, something radical is can be invent a new type of FOR [NODE]
>>>>> > UPDATE type lock to minimize the impact of such locking of rows on
>>>>> > datanodes?
>>>>> >
>>>>> > Regards,
>>>>> > Nikhils
>>>>> >
>>>>> >>>
>>>>> >>> 3.  The BEFORE trigger function can change the distribution column
>>>>> >>> itself. We need to add a check at the end of the trigger
>>>>> executions.
>>>>> >>>
>>>>> >>
>>>>> >> Good, you thought about that. Yes we should check it.
>>>>> >>
>>>>> >>>
>>>>> >>> 4. Fetching OLD row for WHEN clause handling.
>>>>> >>>
>>>>> >>> 5. Testing with mix of Shippable and non-shippable ROW triggers
>>>>> >>>
>>>>> >>> 6. Other types of triggers. INSTEAD triggers are anticipated to
>>>>> work
>>>>> >>> without significant changes, but they are yet to be tested.
>>>>> >>> INSERT/DELETE triggers: Most of the infrastructure has been done
>>>>> while
>>>>> >>> implementing UPDATE triggers. But some changes specific to INSERT
>>>>> and
>>>>> >>> DELETE are yet to be done.
>>>>> >>> Deferred triggers to be tested.
>>>>> >>>
>>>>> >>> 7. Regression analysis. There are some new failures. Will post
>>>>> another
>>>>> >>> fair version of the patch after regression analysis and fixing
>>>>> various
>>>>> >>> TODOs.
>>>>> >>>
>>>>> >>> Comments welcome.
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> ------------------------------------------------------------------------------
>>>>> >>> Everyone hates slow websites. So do we.
>>>>> >>> Make your web apps faster with AppDynamics
>>>>> >>> Download AppDynamics Lite for free today:
>>>>> >>> http://p.sf.net/sfu/appdyn_d2d_feb
>>>>> >>> _______________________________________________
>>>>> >>> Postgres-xc-developers mailing list
>>>>> >>> Pos...@li...
>>>>> >>>
>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Best Wishes,
>>>>> >> Ashutosh Bapat
>>>>> >> EntepriseDB Corporation
>>>>> >> The Enterprise Postgres Company
>>>>> >>
>>>>> >>
>>>>> ------------------------------------------------------------------------------
>>>>> >> Everyone hates slow websites. So do we.
>>>>> >> Make your web apps faster with AppDynamics
>>>>> >> Download AppDynamics Lite for free today:
>>>>> >> http://p.sf.net/sfu/appdyn_d2d_feb
>>>>> >> _______________________________________________
>>>>> >> Postgres-xc-developers mailing list
>>>>> >> Pos...@li...
>>>>> >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > StormDB - http://www.stormdb.com
>>>>> > The Database Cloud
>>>>> > Postgres-XC Support and Service
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Wishes,
>>> Ashutosh Bapat
>>> EntepriseDB Corporation
>>> The Enterprise Postgres Company
>>>
>>
>>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EntepriseDB Corporation
> The Enterprise Postgres Company
>