From: David E. W. <da...@ju...> - 2014-02-28 21:17:16
|
On Feb 27, 2014, at 7:48 PM, Ashutosh Bapat <ash...@en...> wrote: > It might be due to the large amount of data sent from the datanode to the connector. When you see message "connection to client lost" at datanode, it means that the connection to the coordinator was lost. In XC, coordinators act as client to the datanode. Further, no message in the coordinator log implies that there wasn't any segfault or error on the coordinator which can result in loosing client (to the datanode). One way to verify this is to check what happens for smaller amounts of the data. There is still some code in executor, which saves data from datanode in a linked list and because of large amount of data that process runs out of memory. You may find something in system logs, if that is true. Ah ha. Now that I pay more attention to the statement in the log, I see what the problem is. That's a full table scan it's doing, on a very large table. I think the planner is making a mistake. The query I'm running is far more complicated than that bit in the log. Really, the full query should be able to run on each node, and the results aggregated on the coordinator. I suspect I need to add some more JOIN clauses to make sure that the planner better knows how to run the query on each node. > Please do the following, > Run explain verbose on the query which showed this behavior and in that output you will find what query is being sent to the datanode So I did this, but even with the EXPLAIN VERBOSE I got the disconnect error. With plain EXPLAIN, too. The query should not run without ANALZYE, right? This is 1.1, BTW. > Reduce your data on the datanode such that, that particular query returns may be a few thousand rows to the coordinator. BTW, I have seen millions of rows being exchanged between the coordinator and datanode without problem. But still there is a case where large data would be a problem. > Now, see if the query runs without problem. I updated my query to make sure that I was joining on partitioned columns, thinking that would get the queries to run more on the data nodes, but it made no difference. I still got an error for a table scan on a very large table. :-( David |