Re: [Postgres-xc-developers] Implementation details for BEFORE and AFTER ROW trigger support

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Amit,
Thanks for creating the branch and the commits.

I will give my comments on each of the commits in separate mail. I am
starting with 1dc081ebe097e63009bab219231e55679db6fae0. Is that the correct
one?

On Thu, Apr 4, 2013 at 12:26 PM, Amit Khandekar <
ami...@en...> wrote:

>
>
>
> On 3 April 2013 15:10, Ashutosh Bapat <ash...@en...>wrote:
>
>> Hi Amit,
>> Given the magnitude of the change, I think we should break this work into
>> smaller self-sufficient patches and commit them (either on master or in a
>> separate branch for trigger work). This will allow us to review and commit
>> small amount of work and set it aside, rather than going over everything in
>> every round.
>>
>
> I have created a new branch "rowtriggers" in
> postgres-xc.git.sourceforge.net/gitroot/postgres-xc/postgres-xc, where I
> have dumped incremental changes.
>
>
>>
>> On Wed, Apr 3, 2013 at 10:46 AM, Amit Khandekar <
>> ami...@en...> wrote:
>>
>>>
>>>
>>>
>>> On 26 March 2013 15:53, Ashutosh Bapat <ash...@en...>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Mar 26, 2013 at 8:56 AM, Amit Khandekar <
>>>> ami...@en...> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 4 March 2013 11:11, Amit Khandekar <ami...@en...
>>>>> > wrote:
>>>>>
>>>>>> On 1 March 2013 13:53, Nikhil Sontakke <ni...@st...> wrote:
>>>>>> >>
>>>>>> >> Issue: Whether we should fetch the whole from the datanode (OLD
>>>>>> row) and not
>>>>>> >> just ctid and node_id and required columns and store it at the
>>>>>> coordinator
>>>>>> >> for the processing  OR whether we should fetch each row (OLD and
>>>>>> NEW
>>>>>> >> variants) while processing each row.
>>>>>> >>
>>>>>> >> Both of them have performance impacts - the first one has disk
>>>>>> impact for
>>>>>> >> large number of rows whereas the second has network impact for
>>>>>> querying
>>>>>> >> rows. Is it possible to do some analytical assessment as to which
>>>>>> of them
>>>>>> >> would be better? If you can come up with something concrete (may
>>>>>> be numbers
>>>>>> >> or formulae) we will be able to judge better as to which one to
>>>>>> pick up.
>>>>>>
>>>>>> Will check if we can come up with some sensible analysis or figures.
>>>>>>
>>>>>>
>>>>> I have done some analysis on both of these approaches here:
>>>>>
>>>>> https://docs.google.com/document/d/10QPPq_go_wHqKqhmOFXjJAokfdLR8OaUyZVNDu47GWk/edit?usp=sharing
>>>>>
>>>>> In practical terms, we anyways would need to implement (B). The reason
>>>>> is because when the trigger has conditional execution(WHEN clause) we
>>>>> *have* to fetch the rows beforehand, so there is no point in fetching all
>>>>> of them again at the end of the statement when we already have them
>>>>> locally. So may be it would be too ambitious to have have both
>>>>> implementations, at least for this release.
>>>>>
>>>>>
>>>> I agree here. We can certainly optimize for various cases later, but we
>>>> should have something which would give all the functionality (albeit at a
>>>> lower performance for now).
>>>>
>>>>
>>>>> So I am focussing on (B) right now. We have two options:
>>>>>
>>>>> 1. Store all rows in palloced memory, and save the HeapTuple pointers
>>>>> in the trigger queue, and directly access the OLD and NEW rows using these
>>>>> pointers when needed. Here we will have no control over how much memory we
>>>>> should use for the old and new records, and this might even hamper system
>>>>> performance, let alone XC performance.
>>>>> 2. Other option is to use tuplestore. Here, we need to store the
>>>>> positions of the records in the tuplestore. So for a particular tigger
>>>>> event, fetch by the position. From what I understand, tuplestore can be
>>>>> advanced only sequentially in either direction. So when the read pointer is
>>>>> at position 6 and we need to fetch a record at position 10, we need to call
>>>>> tuplestore_advance() 4 times, and this call involves palloc/pfree overhead
>>>>> because it calls tuplestore_gettuple(). But the trigger records are not
>>>>> distributed so randomly. In fact a set of trigger events for a particular
>>>>> event id are accessed in the same order as the order in which they are
>>>>> queued. So for a particular event id, only the first access call will
>>>>> require random access. tuplestore supports multiple read pointers, so may
>>>>> be we can make use of that to access the first record using the closest
>>>>> read pointer.
>>>>>
>>>>>
>>>> Using palloc will be a problem if the size of data fetched is more that
>>>> what could fit in memory. Also pallocing frequently is going to be
>>>> performance problem. Let's see how does the tuple store approach go.
>>>>
>>>
>>> While I am working on the AFTER ROW optimization, here's a patch that
>>> has only BEFORE ROW trigger support, so that it can get you started with
>>> first round of review. The regression is not analyzed fully yet. Besides
>>> the AR trigger related changes, I have also stripped the logic of whether
>>> to run the trigger on datanode or coordinator; this logic depends on both
>>> before and after triggers.
>>>
>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>>  >>
>>>>>> >
>>>>>> > Or we can consider a hybrid approach of getting the rows in batches
>>>>>> of
>>>>>> > 1000 or so if possible as well. That ways they get into coordinator
>>>>>> > memory in one shot and can be processed in batches. Obviously this
>>>>>> > should be considered if it's not going to be a complicated
>>>>>> > implementation.
>>>>>>
>>>>>> It just occurred to me that it would not be that hard to optimize the
>>>>>> row-fetching-by-ctid as shown below:
>>>>>> 1. When it is time to fire the queued triggers at the
>>>>>> statement/transaction end, initialize cursors - one cursor per
>>>>>> datanode - which would do: SELECT remote_heap_fetch(table_name,
>>>>>> '<ctidlist>'); We can form this ctidlist out of the trigger even list.
>>>>>> 2. For each trigger event entry in the trigger queue, FETCH NEXT using
>>>>>> the appropriate cursor name according to the datanode id to which the
>>>>>> trigger entry belongs.
>>>>>>
>>>>>> >
>>>>>> >>> Currently we fetch all attributes in the SELECT subplans. I have
>>>>>> >>> created another patch to fetch only the required attribtues, but
>>>>>> have
>>>>>> >>> not merged that into this patch.
>>>>>> >
>>>>>> > Do we have other places where we unnecessary fetch all attributes?
>>>>>> > ISTM, this should be fixed as a performance improvement first ahead
>>>>>> of
>>>>>> > everything else.
>>>>>>
>>>>>> I believe DML subplan is the only remaining place where we fetch all
>>>>>> attributes. And yes, this is a must-have for triggers, otherwise, the
>>>>>> other optimizations would be of no use.
>>>>>>
>>>>>> >
>>>>>> >>> 2.  One important TODO for BEFORE trigger is this: Just before
>>>>>> >>> invoking the trigger functions, in PG, the tuple is row-locked
>>>>>> >>> (exclusive) by GetTupleTrigger() and the locked version is fetched
>>>>>> >>> from the table. So it is made sure that while all the triggers for
>>>>>> >>> that table are executed, no one can update that particular row.
>>>>>> >>> In the patch, we haven't locked the row. We need to lock it
>>>>>> either by
>>>>>> >>> executing :
>>>>>> >>>    1. SELECT * from tab1 where ctid = <ctid_val> FOR UPDATE, and
>>>>>> then
>>>>>> >>> use the returned ROW as the OLD row.
>>>>>> >>> OR
>>>>>> >>>    2. The UPDATE subplan itself should have SELECT for UPDATE so
>>>>>> that
>>>>>> >>> the row is already locked, and we don't have to lock it again.
>>>>>> >>> #2 is simple though it might cause some amount of longer waits in
>>>>>> general.
>>>>>> >>> Using #1, though the locks would be acquired only when the
>>>>>> particular
>>>>>> >>> row is updated, the locks would be released only after transaction
>>>>>> >>> end, so #1 might not be worth implementing.
>>>>>> >>> Also #1 requires another explicit remote fetch for the
>>>>>> >>> lock-and-get-latest-version operation.
>>>>>> >>> I am more inclined towards #2.
>>>>>> >>>
>>>>>> >> The option #2 however, has problem of locking too many rows if
>>>>>> there are
>>>>>> >> coordinator quals in the subplans IOW the number of rows finally
>>>>>> updated are
>>>>>> >> lesser than the number of rows fetched from the datanode. It can
>>>>>> cause
>>>>>> >> unwanted deadlocks. Unless there is a way to release these extra
>>>>>> locks, I am
>>>>>> >> afraid this option will be a problem.
>>>>>>
>>>>>> True. Regardless of anything else - whether it is deadlocks or longer
>>>>>> waits, we should not lock rows that are not to be updated.
>>>>>>
>>>>>> There is a more general row-locking issue that we need to solve first
>>>>>> : 3606317. I anticipate that solving this will solve the trigger
>>>>>> specific lock issue. So for triggers, this is a must-have, and I am
>>>>>> going to solve this issue as part of this bug 3606317.
>>>>>>
>>>>>> >>
>>>>>> > Deadlocks? ISTM, we can get more lock waits because of this but I do
>>>>>> > not see deadlock scenarios..
>>>>>> >
>>>>>> > With the FQS shipping work being done by Ashutosh, will we also ship
>>>>>> > major chunks of subplans to the datanodes? If yes, then row locking
>>>>>> > will only involve required tuples (hopefully) from the coordinator's
>>>>>> > point of view.
>>>>>> >
>>>>>> > Also, something radical is can be invent a new type of FOR [NODE]
>>>>>> > UPDATE type lock to minimize the impact of such locking of rows on
>>>>>> > datanodes?
>>>>>> >
>>>>>> > Regards,
>>>>>> > Nikhils
>>>>>> >
>>>>>> >>>
>>>>>> >>> 3.  The BEFORE trigger function can change the distribution column
>>>>>> >>> itself. We need to add a check at the end of the trigger
>>>>>> executions.
>>>>>> >>>
>>>>>> >>
>>>>>> >> Good, you thought about that. Yes we should check it.
>>>>>> >>
>>>>>> >>>
>>>>>> >>> 4. Fetching OLD row for WHEN clause handling.
>>>>>> >>>
>>>>>> >>> 5. Testing with mix of Shippable and non-shippable ROW triggers
>>>>>> >>>
>>>>>> >>> 6. Other types of triggers. INSTEAD triggers are anticipated to
>>>>>> work
>>>>>> >>> without significant changes, but they are yet to be tested.
>>>>>> >>> INSERT/DELETE triggers: Most of the infrastructure has been done
>>>>>> while
>>>>>> >>> implementing UPDATE triggers. But some changes specific to INSERT
>>>>>> and
>>>>>> >>> DELETE are yet to be done.
>>>>>> >>> Deferred triggers to be tested.
>>>>>> >>>
>>>>>> >>> 7. Regression analysis. There are some new failures. Will post
>>>>>> another
>>>>>> >>> fair version of the patch after regression analysis and fixing
>>>>>> various
>>>>>> >>> TODOs.
>>>>>> >>>
>>>>>> >>> Comments welcome.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> >>> Everyone hates slow websites. So do we.
>>>>>> >>> Make your web apps faster with AppDynamics
>>>>>> >>> Download AppDynamics Lite for free today:
>>>>>> >>> http://p.sf.net/sfu/appdyn_d2d_feb
>>>>>> >>> _______________________________________________
>>>>>> >>> Postgres-xc-developers mailing list
>>>>>> >>> Pos...@li...
>>>>>> >>>
>>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>>> >>>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Best Wishes,
>>>>>> >> Ashutosh Bapat
>>>>>> >> EntepriseDB Corporation
>>>>>> >> The Enterprise Postgres Company
>>>>>> >>
>>>>>> >>
>>>>>> ------------------------------------------------------------------------------
>>>>>> >> Everyone hates slow websites. So do we.
>>>>>> >> Make your web apps faster with AppDynamics
>>>>>> >> Download AppDynamics Lite for free today:
>>>>>> >> http://p.sf.net/sfu/appdyn_d2d_feb
>>>>>> >> _______________________________________________
>>>>>> >> Postgres-xc-developers mailing list
>>>>>> >> Pos...@li...
>>>>>> >>
>>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > StormDB - http://www.stormdb.com
>>>>>> > The Database Cloud
>>>>>> > Postgres-XC Support and Service
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Wishes,
>>>> Ashutosh Bapat
>>>> EntepriseDB Corporation
>>>> The Enterprise Postgres Company
>>>>
>>>
>>>
>>
>>
>> --
>> Best Wishes,
>> Ashutosh Bapat
>> EntepriseDB Corporation
>> The Enterprise Postgres Company
>>
>
>

-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company