Re: [Postgres-xc-developers] Replicated update and delete

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Wed, Feb 12, 2014 at 1:08 AM, 鈴木 幸市 <ko...@in...> wrote:

>  2014/02/12 15:00、Ashutosh Bapat <ash...@en...> のメール：
>
>
>
>
> On Tue, Feb 11, 2014 at 8:03 PM, Abbas Butt <abb...@en...>wrote:
>
>>
>>  The summary of the discussion so far:
>>
>>  Approach A: (Suggested by Amit)
>> In the scan plan, fetch ctid, node_id from all the datanodes.
>> While scanning, the tuples need to be fetched in the same order,
>>  may be using order by 1, 2, 3, ...
>>  Use UPDATE where ctd = ? , but use nodeid-based method to
>> generate the ExecNodes at execute-time (enhance ExecNodes->en_expr
>> evaluation so as to use the nodeid from source plan, as against
>> the distribution column that it currently uses for distributed tables).
>>  This method will not work as-is in case of non-shippable row triggers.
>> Because trigger needs to be fired only once per row, and we are going
>> to execute UPDATE for all of the ctids of a given row corresponding
>> to all of the datanodes. So somehow we should fire triggers only once.
>> This method will also hit performance, because currently we fetch *all*
>> columns and not just ctid, so it's better to first do that optimization
>> of fetching only reqd columns (there's one pending patch submitted in
>> the mailing list, which fixes this).
>>
>
>>  Approach B: (Suggested by many)
>> If the replicated table does not have primary or unique not null key
>> then error out on a non-shippable update or delete otherwise use the
>> patch sent by Mason after some testing and refactoring.
>>
>>
>  This would break backward compatibility. Also, a table which is fairly
> stable and doesn't have a primary key or unique key, will need to be
> distributed even though it's a perfect candidate for being a replicated
> table. Also, we have to see if updating primary or unique key would cause a
> problem.
>
>
>  Then we should keep using the same WHERE close in shipped statement too.
>    As pointed out, using ctid in replicated table is very dangerous.
>

I agree.

I don't think it is unreasonable at all to require a primary key or unique
index for replicated tables... normally one would want to do that. If they
don't have a primary key, they themselves can just add a SERIAL at creation
time and use that.

As an alternative, all columns could be used as a fake primary key to try
to find the particular row. In GridSQL we used that approach, but does not
seem so clean... I believe that there is a check in there that if multiple
rows match the criteria that the operation fails since the row is not
uniquely identifiable. In hindsight, I wish we had not bothered.

>
>
>   Approach C: (Suggested by Amit)
>> Always have some kind of a hidden (or system) column for replicated
>> tables.
>>  Its type can be serial type, or an int column with default
>> nextval('sequence_type')
>>  so that it will always be executed on coordinator and use this colum as
>> primary key.
>>
>>
>  This looks a better approach, but also means that the inserts in the
> replicated table have to be driven through the coordinator. This might not
> be that stringent a condition, given that the replicated tables are
> expected to be fairly stable. Any replicated table being inserted so often
> would anyway get into the performance problem.
>
>
>  I’m afraid it takes long effort to fix all the influences of this
> change.   How do you think about this?    As I noted, approach C has good
> point.   The issue is how long it takes.   With approach B, we can easily
> change this handling to approach C.   I’d like to have you opinion on this.
>

It seems unnecessary if the table already has a primary key or unique
index. Anyway, approach C is the approach that I originally took with
GridSQL/Stado, adding something called xrowid, but we later disabled it by
default. In hindsight I would have saved the trouble and not implemented it
to keep the code simpler and easier to maintain, and just left it up to the
user to use a key.

To summarize, I would go with B.

>
>
>>  My vote is for approach B.
>>
>>
>  Whatever approach is taken, we will have to stick to it in future
> versions of XC. We can not keep on changing the user visible functionality
> with every version. Till 1.2 we didn't have the restriction that replicated
> tables should have a primary key. Now introducing that requirement, means
> users have to modify their applications. If we change it again, in the next
> version, we will break the compatibility again. So, considering the long
> term benefit, we should go with C.
>
>
> This is reasonable to require at this stage since how it is done now is
very dangerous. As previously mentioned, I have seen this cause problems
with production data.  I think people would gladly add a key in exchange
for not having bad things happen with their data.

Also, I would not release any new version of Postgres-XC without this fix.
If a release of 1.2 is a ways away, there should be an intermediate 1.1.x
release that fixes this soon. I would not recommend using Postgres-XC for
people who will be updating replicated tables without the patch I
submitted, it is too dangerous.

-- 
Mason Sharp

TransLattice - http://www.translattice.com
Distributed and Clustered Database Solutions