From: Ashutosh B. <ash...@en...> - 2014-02-28 03:48:40
|
Hi David, It might be due to the large amount of data sent from the datanode to the connector. When you see message "connection to client lost" at datanode, it means that the connection to the coordinator was lost. In XC, coordinators act as client to the datanode. Further, no message in the coordinator log implies that there wasn't any segfault or error on the coordinator which can result in loosing client (to the datanode). One way to verify this is to check what happens for smaller amounts of the data. There is still some code in executor, which saves data from datanode in a linked list and because of large amount of data that process runs out of memory. You may find something in system logs, if that is true. Please do the following, Run explain verbose on the query which showed this behavior and in that output you will find what query is being sent to the datanode Reduce your data on the datanode such that, that particular query returns may be a few thousand rows to the coordinator. BTW, I have seen millions of rows being exchanged between the coordinator and datanode without problem. But still there is a case where large data would be a problem. Now, see if the query runs without problem. On Fri, Feb 28, 2014 at 6:23 AM, David E. Wheeler <da...@ju...>wrote: > PGXC Hakers, > > I have finally loaded up my testing PGXC four-node cluster with a nice > beefy database similar to a PostgreSQL database we use for long-running > reporting queries. I gathered up one of our slower-running queries (26.6m > run) and ran it on XC. Alas, after a while, it died with this error: > > psql:slow.sql:73: connection to server was lost > > The coordinator log was not much help: nothing was logged. So I trolled > through the logs on the data nodes. All four had these messages: > > > 2014-02-27 15:45:51 PST dwheeler 10.4.34.1(56968) 22213 530fc838.56c5 > LOG: could not send data to client: Connection reset by peer > > 2014-02-27 15:45:51 PST dwheeler 10.4.34.1(56968) 22213 530fc838.56c5 > STATEMENT: SELECT subscriber_cmd_id, rule_reason, rule_score, txn_uuid, > txn_timestamp_utc FROM ONLY subscriber_482900.transactions_rule tma WHERE > (subscriber_id = 482900) > > 2014-02-27 15:45:51 PST dwheeler 10.4.34.1(56968) 22213 530fc838.56c5 > FATAL: connection to client lost > > 2014-02-27 15:45:51 PST dwheeler 10.4.34.1(56968) 22213 530fc838.56c5 > STATEMENT: SELECT subscriber_cmd_id, rule_reason, rule_score, txn_uuid, > txn_timestamp_utc FROM ONLY subscriber_482900.transactions_rule tma WHERE > (subscriber_id = 482900) > > No reason given for the dropped connection. I ran the query on the > coordinator box, so psql should have connected via a socket rather than > TCP. Out of curiosity, I looked at the logs for the other three > coordinators. None had any error messages, either. > > So, no idea what's timing out; statement_timeout is set to 0. Here are the > settings from my coordinator's postgreql.conf: > > max_connections = 100 > shared_buffers = 32MB > log_destination = 'stderr' > logging_collector = on > log_directory = 'pg_log' > log_filename = 'postgresql-%a.log' > log_truncate_on_rotation = on > log_rotation_age = 1d > log_rotation_size = 0 > log_line_prefix = '< %m >' > log_timezone = 'US/Pacific' > datestyle = 'iso, mdy' > timezone = 'US/Pacific' > lc_messages = 'en_US.UTF-8' > lc_monetary = 'en_US.UTF-8' > lc_numeric = 'en_US.UTF-8' > lc_time = 'en_US.UTF-8' > default_text_search_config = 'pg_catalog.english' > pgxc_node_name = 'node1' > port = 5432 > listen_addresses = '*' > shared_buffers = 250MB > work_mem = 128MB > maintenance_work_mem = 128MB > effective_cache_size = 8GB > log_line_prefix = '%t %u %r %p %c ' > timezone = 'UTC' > gtm_host = 'node1.example.com' > > And from one of the data nodes (only the names differ on the others): > > max_connections = 100 > shared_buffers = 32MB > log_destination = 'stderr' > logging_collector = on > log_directory = 'pg_log' > log_filename = 'postgresql-%a.log' > log_truncate_on_rotation = on > log_rotation_age = 1d > log_rotation_size = 0 > log_line_prefix = '< %m >' > log_timezone = 'US/Pacific' > datestyle = 'iso, mdy' > timezone = 'US/Pacific' > lc_messages = 'en_US.UTF-8' > lc_monetary = 'en_US.UTF-8' > lc_numeric = 'en_US.UTF-8' > lc_time = 'en_US.UTF-8' > default_text_search_config = 'pg_catalog.english' > pgxc_node_name = 'node1' > port = 15432 > listen_addresses = '*' > shared_buffers = 750MB > work_mem = 128MB > maintenance_work_mem = 128MB > effective_cache_size = 23GB > log_line_prefix = '%t %u %r %p %c ' > timezone = 'UTC' > gtm_host = 'node1.iovationnp.com' > > Thoughts? What could be timing out? > > Thanks, > > David > > > > > ------------------------------------------------------------------------------ > Flow-based real-time traffic analytics software. Cisco certified tool. > Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer > Customize your own dashboards, set traffic alerts and generate reports. > Network behavioral analysis & security monitoring. All-in-one tool. > > http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company |