From: Lionel F. <lio...@gm...> - 2011-06-06 12:39:33
|
Hello, looking at the debug1 mode log on datanode3, I found some interesting points hereafter (vacuum on, max_prepared_transactions=5000): (with normal inserts) [....] DEBUG: unset snapshot info DEBUG: global snapshot info: gxmin: 101, gxmax: 101, gxcnt: 0 DEBUG: unset snapshot info DEBUG: global snapshot info: gxmin: 101, gxmax: 101, gxcnt: 0 DEBUG: unset snapshot info DEBUG: [re]setting xid = 0, old_value = 0 DEBUG: unset snapshot info DEBUG: Received new gxid 102 DEBUG: [re]setting xid = 102, old_value = 0 DEBUG: TransactionId = 102 DEBUG: xid (102) does not follow ShmemVariableCache->nextXid (665) DEBUG: Record transaction commit 101 DEBUG: Record transaction commit 102 DEBUG: [re]setting xid = 0, old_value = 0 DEBUG: unset snapshot info DEBUG: Received new gxid 103 DEBUG: [re]setting xid = 103, old_value = 0 DEBUG: unset snapshot info DEBUG: global snapshot info: gxmin: 103, gxmax: 103, gxcnt: 0 DEBUG: TransactionId = 103 DEBUG: xid (103) does not follow ShmemVariableCache->nextXid (665) DEBUG: unset snapshot info DEBUG: global snapshot info: gxmin: 103, gxmax: 103, gxcnt: 0 While inserting with dsitrubuted hashed keys : [...] DEBUG: [re]setting xid = 0, old_value = 0 DEBUG: unset snapshot info DEBUG: Received new gxid 522 DEBUG: [re]setting xid = 522, old_value = 0 DEBUG: TransactionId = 522 DEBUG: xid (522) does not follow ShmemVariableCache->nextXid (665) DEBUG: Record transaction commit 521 DEBUG: Record transaction commit 522 DEBUG: [re]setting xid = 0, old_value = 0 DEBUG: unset snapshot info DEBUG: Received new gxid 524 DEBUG: [re]setting xid = 524, old_value = 0 ERROR: prepared transaction with identifier "T523" does not exist STATEMENT: COMMIT PREPARED 'T523' DEBUG: [re]setting xid = 0, old_value = 524 DEBUG: unset snapshot info DEBUG: Received new gxid 526 DEBUG: [re]setting xid = 526, old_value = 0 ERROR: prepared transaction with identifier "T525" does not exist STATEMENT: COMMIT PREPARED 'T525' DEBUG: [re]setting xid = 0, old_value = 526 DEBUG: unset snapshot info DEBUG: Received new gxid 528 DEBUG: [re]setting xid = 528, old_value = 0 ERROR: prepared transaction with identifier "T527" does not exist [...] No special info on the gtm node regarding the same transactions, though. Hope this can help Regards Lionel F. 2011/6/2 Michael Paquier <mic...@gm...> > The problem you are facing with the pooler may be related to this bug that > has been found recently: > > https://sourceforge.net/tracker/?func=detail&aid=3310399&group_id=311227&atid=1310232 > > It looks that datanode is not able to manage efficiently autovacuum commit. > This problem may cause problems in data consistency, making a node to crash > in the worst scenario. > > This could explain why you cannot begin a transaction correctly on nodes, > connections to backends being closed by a crash or a consistency problem. > Can you provide some backtrace or give hints about the problem you have? > Some tips in node logs perhaps? > > > On Wed, Jun 1, 2011 at 8:12 PM, Lionel Frachon <lio...@gm...>wrote: > >> Hello, >> >> I was forced to distribute data by replication and not by hash, as I'm >> constantly getting "ERROR: Could not commit prepared transaction >> implicitely" on other tables than Warehouse (w_id), using 10 >> warehouses (this error appears both on data loading, when using hash, >> and when performing distributed queries). >> >> I used slightly different setup : >> - 1 GTM-only node >> - 1 Coordinator-only node >> - 3 Datanodes >> >> Coordinator has 256MB RAM, Datanodes having 768. They did not reach at >> any moment the full usage of dedicated RAM. >> >> However, running benchmark more than a few minutes (2 or 3) drives to >> the following errors >> >> --- Unexpected SQLException caught in NEW-ORDER Txn --- >> Message: ERROR: Could not begin transaction on data nodes. >> SQLState: XX000 >> ErrorCode: 0 >> >> Then a bit later >> --- Unexpected SQLException caught in NEW-ORDER Txn --- >> >> Message: ERROR: Failed to get pooled connections >> SQLState: 53000 >> ErrorCode: 0 >> >> then (and I assume they are linked) >> --- Unexpected SQLException caught in NEW-ORDER Txn --- >> Message: ERROR: Could not begin transaction on data nodes. >> SQLState: XX000 >> ErrorCode: 0 >> >> additionnally, the test end with many >> --- Unexpected SQLException caught in NEW-ORDER Txn --- >> Message: This connection has been closed. >> SQLState: 08003 >> ErrorCode: 0 >> >> I'm using 10 terminals, using 10 warehouses. >> >> Any clue for this error, (and for distribution by hash, I understand >> they're probably linked...) >> >> Lionel F. >> >> >> >> 2011/5/31 Lionel Frachon <lio...@gm...>: >> > Hi, >> > >> > yes, persistent_datanode_connections is now set to off - it may not be >> > related to the issues I have. >> > >> > What amount of memory do you have on your datanodes & coordinator ? >> > >> > Here are my settings : >> > datanode : shared_buffers = 512MB >> > coordinator=256MB (now, was 96MB) >> > >> > I still get for some distributed tables (by hash) >> > "ERROR: Could not commit prepared transaction implicitely" >> > >> > For distribution syntax, yes, I found your webpage talking about >> > regression tests >> > >> >> You also have to know that it is important to set a limit of >> connections on >> >> datanodes equal to the sum of max connections on all coordinators. >> >> For example, if your cluster is using 2 coordinator with 20 max >> connections >> >> each, you may have a maximum of 40 connections to datanodes. >> > >> > Ok, tweaking this today and launching the tests again... >> > >> > >> > Lionel F. >> > >> > >> > >> > 2011/5/31 Michael Paquier <mic...@gm...>: >> >> >> >> >> >> On Mon, May 30, 2011 at 7:34 PM, Lionel Frachon < >> lio...@gm...> >> >> wrote: >> >>> >> >>> Hi again, >> >>> >> >>> I turned off connection pooling on coordinator (dunno why it sayed >> >>> on), raised the shared_buffers of coordinator, allowed 1000 >> >>> connections and the error disappeared. >> >> >> >> I am not really sure I get the meaning of this, but how did you turn >> off >> >> pooler on coordinator. >> >> Did you use the parameter persistent_connections? >> >> Connection pooling from coordinator is an automatic feature and you >> have to >> >> use it if you want to connect from a remote coordinator to backend XC >> nodes. >> >> >> >> You also have to know that it is important to set a limit of >> connections on >> >> datanodes equal to the sum of max connections on all coordinators. >> >> For example, if your cluster is using 2 coordinator with 20 max >> connections >> >> each, you may have a maximum of 40 connections to datanodes. >> >> This uses a lot of shared buffer on a node, but typically this maximum >> >> number of connections is never reached thanks to the connection >> pooling. >> >> >> >> Please node also that number of Coordinator <-> Coordinator connections >> may >> >> also increase if DDL are used from several coordinators. >> >> >> >>> However, all data is still going on one node (and whatever I could >> >>> choose as primary datanode), with 40 warehouses... any specific syntax >> >>> to load balance warehouses over nodes ? >> >> >> >> CREATE TABLE foo (column_key type, other_column int) DISTRIBUTE BY >> >> HASH(column_key); >> >> -- >> >> Michael Paquier >> >> http://michael.otacoo.com >> >> >> > >> > > > > -- > Michael Paquier > http://michael.otacoo.com > |