| 
      
      
      From: Krzysztof N. <krz...@gm...> - 2015-07-09 15:54:03
      
     | 
| I'm moving this thread from general to bugs list.
I run into the "too many KnownAssignedXids" error on the coordinator slave
now - not sure if it's related with this patch or some other problems that
had been there before the patch was created as this error was supposed to
be fixed long time ago.
2015-07-09 13:30:53.074 CEST,,,5137,,559dc09d.1411,66,,2015-07-09 02:30:21
CEST,,0,LOG,00000,"recovery restart point at 0/294EF478","last completed
transaction was at log time 2015-07-09
13:30:51.700251+02",,,,,,,"CreateRestartPoint, xlog.c:8665",""
2015-07-09 14:00:52.179 CEST,,,5137,,559dc09d.1411,67,,2015-07-09 02:30:21
CEST,,0,LOG,00000,"restartpoint starting: time",,,,,,,,"LogCheckpointStart,
xlog.c:7885",""
2015-07-09 14:01:18.391 CEST,,,5137,,559dc09d.1411,68,,2015-07-09 02:30:21
CEST,,0,LOG,00000,"restartpoint complete: wrote 262 buffers (0.2%); 0
transaction log file(s) added, 0 removed, 0 recycled; write=26.151 s,
sync=0.032 s, total=26.212 s; sync files=132, longest=0.001 s,
average=0.000 s",,,,,,,,"LogCheckpointEnd, xlog.c:7957",""
2015-07-09 14:01:18.391 CEST,,,5137,,559dc09d.1411,69,,2015-07-09 02:30:21
CEST,,0,LOG,00000,"recovery restart point at 0/298F2638","last completed
transaction was at log time 2015-07-09
14:00:26.299532+02",,,,,,,"CreateRestartPoint, xlog.c:8665",""
2015-07-09 14:15:26.138 CEST,,,5093,,559dc09c.13e5,7,,2015-07-09 02:30:20
CEST,1/0,0,FATAL,XX000,"too many KnownAssignedXids",,,,,"xlog redo commit:
2015-07-09 14:15:26.138639+02",,,"KnownAssignedXidsAdd, procarray.c:3347",""
2015-07-09 14:15:26.143 CEST,,,5083,,559dc09c.13db,2,,2015-07-09 02:30:20
CEST,,0,LOG,00000,"startup process (PID 5093) exited with exit code
1",,,,,,,,"LogChildExit, postmaster.c:3133",""
2015-07-09 14:15:26.143 CEST,,,5083,,559dc09c.13db,3,,2015-07-09 02:30:20
CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,"HandleChildCrash, postmaster.c:2912",""
2015-07-09 17:09:17.920 CEST,,,10527,,559e8e9d.291f,1,,2015-07-09 17:09:17
CEST,,0,LOG,00000,"database system was interrupted while in recovery at log
time 2015-07-09 14:00:20 CEST",,"If this has occurred more than once some
data might be corrupted and you might need to choose an earlier recovery
target.",,,,,,"StartupXLOG, xlog.c:6347",""
2015-07-09 17:09:17.920 CEST,,,10527,,559e8e9d.291f,2,,2015-07-09 17:09:17
CEST,,0,LOG,00000,"entering standby mode",,,,,,,,"StartupXLOG,
xlog.c:6417",""
2015-07-09 17:09:17.932 CEST,,,10527,,559e8e9d.291f,3,,2015-07-09 17:09:17
CEST,1/0,0,LOG,00000,"consistent recovery state reached at
0/298F2730",,,,,,,,"CheckRecoveryConsistency, xlog.c:7420",""
2015-07-09 17:09:17.932 CEST,,,10527,,559e8e9d.291f,4,,2015-07-09 17:09:17
CEST,1/0,0,LOG,00000,"redo starts at 0/298F2638",,,,,,,,"StartupXLOG,
xlog.c:6881",""
2015-07-09 17:09:17.933 CEST,,,10517,,559e8e9d.2915,1,,2015-07-09 17:09:17
CEST,,0,LOG,00000,"database system is ready to accept read only
connections",,,,,,,,"sigusr1_handler, postmaster.c:4533",""
2015-07-09 17:09:17.936 CEST,,,10527,,559e8e9d.291f,5,,2015-07-09 17:09:17
CEST,1/0,0,FATAL,XX000,"too many KnownAssignedXids",,,,,"xlog redo commit:
2015-07-09 14:15:26.138639+02",,,"KnownAssignedXidsAdd, procarray.c:3347",""
2015-07-09 17:09:17.937 CEST,,,10517,,559e8e9d.2915,2,,2015-07-09 17:09:17
CEST,,0,LOG,00000,"startup process (PID 10527) exited with exit code
1",,,,,,,,"LogChildExit, postmaster.c:3133",""
2015-07-09 17:09:17.937 CEST,,,10517,,559e8e9d.2915,3,,2015-07-09 17:09:17
CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,"HandleChildCrash, postmaster.c:2912",""
2015-07-09 17:13:42.741 CEST,,,15899,,559e8fa6.3e1b,1,,2015-07-09 17:13:42
CEST,,0,LOG,00000,"database system was interrupted while in recovery at log
time 2015-07-09 14:00:20 CEST",,"If this has occurred more than once some
data might be corrupted and you might need to choose an earlier recovery
target.",,,,,,"StartupXLOG, xlog.c:6347",""
2015-07-09 17:13:42.741 CEST,,,15899,,559e8fa6.3e1b,2,,2015-07-09 17:13:42
CEST,,0,LOG,00000,"entering standby mode",,,,,,,,"StartupXLOG,
xlog.c:6417",""
2015-07-09 17:13:43.010 CEST,,,15899,,559e8fa6.3e1b,3,,2015-07-09 17:13:42
CEST,,0,LOG,00000,"restored log file ""00000001000000000000000A"" from
archive",,,,,,,,"RestoreArchivedFile, xlog.c:3277",""
2015-07-09 17:13:43.040 CEST,,,15899,,559e8fa6.3e1b,4,,2015-07-09 17:13:42
CEST,1/0,0,LOG,00000,"consistent recovery state reached at
0/298F2730",,,,,,,,"CheckRecoveryConsistency, xlog.c:7420",""
2015-07-09 17:13:43.040 CEST,,,15899,,559e8fa6.3e1b,5,,2015-07-09 17:13:42
CEST,1/0,0,LOG,00000,"redo starts at 0/298F2638",,,,,,,,"StartupXLOG,
xlog.c:6881",""
2015-07-09 17:13:43.042 CEST,,,15889,,559e8fa6.3e11,1,,2015-07-09 17:13:42
CEST,,0,LOG,00000,"database system is ready to accept read only
connections",,,,,,,,"sigusr1_handler, postmaster.c:4533",""
2015-07-09 17:13:43.044 CEST,,,15899,,559e8fa6.3e1b,6,,2015-07-09 17:13:42
CEST,1/0,0,FATAL,XX000,"too many KnownAssignedXids",,,,,"xlog redo commit:
2015-07-09 14:15:26.138639+02",,,"KnownAssignedXidsAdd, procarray.c:3347",""
2015-07-09 17:13:43.045 CEST,,,15889,,559e8fa6.3e11,2,,2015-07-09 17:13:42
CEST,,0,LOG,00000,"startup process (PID 15899) exited with exit code
1",,,,,,,,"LogChildExit, postmaster.c:3133",""
2015-07-09 17:13:43.045 CEST,,,15889,,559e8fa6.3e11,3,,2015-07-09 17:13:42
CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,"HandleChildCrash, postmaster.c:2912",""
On Thu, Jul 9, 2015 at 1:09 PM, Krzysztof Nienartowicz <
krz...@un...> wrote:
> I found the solution:
> I realized partitioning [insert] triggers increment command id with each
> trigger invocation while the transaction is in the COPY block. This seems
> to confuse coordinator to datanode communication as command id messages got
> interleaved with COPY messages and whacks the COPY state machine.
> In XL there's additional call to propagate new command id that does not
> exist in X2, that snapmgr.c:*UpdateActiveSnapshotCommandId *switches on
> sending back new command ids to other nodes.
>
> Switching this off fixes the problem.
>
> Question: Is this command ID propagation really needed except for
> serializable transactions? Any other logic is affected? Is there a cleaner
> solution?
>
> Cheers,
> Krzysztof
>
> /*
>
>  * UpdateActiveSnapshotCommandId
>
>  *
>
>  * Update the current CID of the active snapshot.  This can only be applied
>
>  * to a snapshot that is not referenced elsewhere.
>
>  */
>
> *void*
>
> *UpdateActiveSnapshotCommandId*(*void*)
>
> {
>
> Assert(ActiveSnapshot != NULL);
>
> Assert(ActiveSnapshot->as_snap->active_count == 1);
>
> Assert(ActiveSnapshot->as_snap->regd_count == 0);
>
>
> ActiveSnapshot->as_snap->curcid = GetCurrentCommandId(false);
>
> *#ifdef* XCP
>
> /*
>
>  * Set flag so that updated command ID is sent to the datanodes before the
>
>  * next query. This ensures that the effects of previous statements are
>
>  * visible to the subsequent statements
>
>  */
>
> SetSendCommandId(true); //<--------------------- should be false to fix
> the COPY protocol bug.
>
> *#endif*
>
> }
>
> On Thu, Jul 2, 2015 at 4:12 PM, Krzysztof Nienartowicz <
> krz...@un...> wrote:
>
>> Hello,
>> While trying to ingest into the partitioned table via trigger [we patched
>> XL head source to be able to have insert triggers] we got error after
>> [sometimes varying] number of lines ingested, i.e.:
>> <sql>
>> -- simplified example with 3 datanodes, 3 coords, same coord used
>>
>> create or replace function x_t() returns trigger
>> as
>> $BODY$
>>
>> declare
>> begin
>>      IF ( NEW.id between  0  and 10) THEN
>>          INSERT INTO x1 VALUES (NEW.*);
>>      elsif(NEW.id between 11 and 100) then
>>          INSERT INTO x2 VALUES (NEW.*);
>>      else
>>          INSERT INTO x3 VALUES (NEW.*);
>>      end if;
>>      return null;
>> end
>> $BODY$ language plpgsql;
>>
>>
>> create table x (id int primary key, ii text) distribute by hash(id);
>> create table x1(check (id between 0 and 10)) inherits (x);
>> create table x2(check (id between 11 and 100)) inherits (x);
>> create table x3(check (id > 100)) inherits (x);
>>
>> CREATE TRIGGER x_t BEFORE INSERT ON x FOR EACH ROW EXECUTE PROCEDURE
>> x_t();
>>
>> -- populate tables via trigger with insert
>>
>> insert into x select i,rpad('X',300,'y') from generate_series(1,1000) i;
>> explain analyze select count(*) from x;
>> -- all good:
>> --
>> QUERY
>> PLAN
>>
>> --
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>> --  Aggregate  (cost=559.68..559.69 rows=1 width=0) (actual
>> time=8.808..8.809 rows=1 loops=1)
>> --   ->  Append  (cost=100.00..541.15 rows=7411 width=0) (actual
>> time=0.677..7.015 rows=1000 loops=1)
>> --         ->  Remote Subquery Scan on all
>> (datanode1,datanode2,datanode3)  (cost=100.00..100.00 rows=1 width=0)
>> (actual time=0.486..0.486 rows=0 loops=1)
>> --         ->  Remote Subquery Scan on all
>> (datanode1,datanode2,datanode3)  (cost=100.00..147.05 rows=2470 width=0)
>> (actual time=0.185..0.210 rows=10 loops=1)
>> --         ->  Remote Subquery Scan on all
>> (datanode1,datanode2,datanode3)  (cost=100.00..147.05 rows=2470 width=0)
>> (actual time=0.246..0.429 rows=90 loops=1)
>> --         ->  Remote Subquery Scan on all
>> (datanode1,datanode2,datanode3)  (cost=100.00..147.05 rows=2470 width=0)
>> (actual time=0.269..2.507 rows=900 loops=1)
>>
>>
>> </sql>
>> <bash>
>> # do dump
>> psql -p 20004 -h coord1 -c "COPY (select * from x) TO STDOUT DELIMITER AS
>> ';'" | tee  /tmp/x.dat | wc -l
>>
>> # all good, 1000 rows saved.
>> # now, truncate and try to reingest:
>> </bash>
>> <sql>
>> truncate table x;
>> </sql>
>>
>> <bash>
>> awk 'NR>0 && NR <= 1000 {print $0;}' /tmp/x.dat | psql  -h 20004 -c "set
>> client_min_messages=DEBUG5; COPY dpac_or5_s2_epsl_dex.x FROM STDIN
>> DELIMITER AS ';'"
>> DEBUG:  ProcessUtility
>> DEBUG:  global snapshot info: gxmin: 63682, gxmax: 63682, gxcnt: 0
>> DEBUG:  conn_count = 3, need_tran_block = true
>> LOG:  Connection error invalid string in message
>> LOG:  Connection error invalid string in message
>> LOG:  Connection error invalid string in message
>> ERROR:  Copy failed on a data node
>> CONTEXT:  COPY x, line 142:
>> "251;Xyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy..."
>> DEBUG:  Record transaction abort 63682
>>
>> # repeat the cmd, fails on different line number
>> ERROR:  Copy failed on a data node
>> CONTEXT:  COPY x, line 241:
>> "655;Xyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy..."
>>
>> #repeat the cmd, fails on the same line number
>> ERROR:  Copy failed on a data node
>> CONTEXT:  COPY x, line 142:
>> "251;Xyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy..."
>>
>> #repeat the cmd, fails on the same line number
>> ERROR:  Copy failed on a data node
>> CONTEXT:  COPY x, line 142:
>> "251;Xyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy..."
>>
>> # but when to limit number of rows it would work:
>>
>> awk 'NR>0 && NR <= 100 {print $0;}' /tmp/x.dat | psql -p20004  -c "set
>> client_min_messages=DEBUG5; COPY dpac_or5_s2_epsl_dex.x FROM STDIN
>> DELIMITER AS ';'"
>> DEBUG:  ProcessUtility
>> DEBUG:  global snapshot info: gxmin: 63723, gxmax: 63723, gxcnt: 0
>> DEBUG:  conn_count = 3, need_tran_block = true
>> DEBUG:  CommitTransactionCommand
>> DEBUG:  CommitTransaction
>> DEBUG:  name: unnamed; blockState:       STARTED; state: INPROGR,
>> xid/subid/cid: 63723/1/1 (used), nestlvl: 1, children:
>> DEBUG:  Record transaction commit 63723
>> DEBUG:  [re]setting xid = 0, old_value = 0
>>
>> # select gives back proper number of rows properly distributed by the
>> trigger
>> </bash>
>>
>> Clearly the messages passing rows from coordinator to datanodes are
>> truncated/corrupted somehow (copy.c:622). Of course we could try to make a
>> workaround to copy into the partitions directly, but would really prefer to
>> avoid this.
>>
>> Any hints are welcome how to fix this!
>>
>> Best regards,
>> Krzysztof
>>
>>
>>
>
 |