From: Koichi S. <koi...@gm...> - 2013-03-26 05:41:18
|
Understood the situation. Bulk row transfer between coordinator/datanode is another infrastructure we need for sure. This will fit 10G network (we need to use giant packet to use its bandwidth). Regards; ---------- Koichi Suzuki 2013/3/26 Ashutosh Bapat <ash...@en...>: > > > On Tue, Mar 26, 2013 at 10:19 AM, Koichi Suzuki <koi...@gm...> > wrote: >> >> On thing we should think for option 1 is: >> >> When a number of the result is huge, applications has to wait long >> time until they get the first row. Because this option may need disk >> write, total resource consumption will be larger. >> > > Yes, I am aware of this fact. Please read the next paragraph and you will > see that the current situation is no better. > >> >> I'm wondering if we can use "cursor" at database so that we can read >> each tape more simply, I mean, to leave each query node open and read >> next row from any query node. >> > > We do that right now. But because of such a simulated cursor (it's not > cursor per say, but we just fetch the required result from connection as the > demand arises in merging runs), we observer following things > > If the plan has multiple remote query nodes (as there will be in case of > merge join), we assign the same connection to these nodes. Before this > assignment, the result from the previous connection is materialised at the > coordinator. This means that, when we will get huge result from the > datanode, it will be materialised (which will have the more cost as > materialising it on tape, as this materialisation happens in a linked list, > which is not optimized). We need to share connection between more than one > RemoteQuery node because same transaction can not work on two connections to > same server. Not only performance, but the code has become ugly because of > this approach. At various places in executor, we have special handling for > sorting, which needs to be maintained. > > Instead if we materialise all the result on tape and then proceed with step > D5 in Knuth's algorithm for polyphase merge sort, the code will be much > simpler and we won't loose much performance. In fact, we might be able to > leverage fetching bulk data on connection which can be materialised on tape > in bulk. > >> >> Regards; >> ---------- >> Koichi Suzuki >> >> >> 2013/3/25 Ashutosh Bapat <ash...@en...>: >> > Hi All, >> > I am working on using remote sorting for merge joins. The idea is while >> > using merge join at the coordinator, get the data sorted from the >> > datanodes; >> > for replicated relations, we can get all the rows sorted and for >> > distributed >> > tables we have to get sorted runs which can be merged at the >> > coordinator. >> > For merge join the sorted inner relation needs to be randomly >> > accessible. >> > For replicated relations this can be achieved by materialising the >> > result. >> > But for distributed relations, we do not materialise the sorted result >> > at >> > coordinator but compute the sorted result by merging the sorted results >> > from >> > individual nodes on the fly. For distributed relations, the connection >> > to >> > the datanodes themselves are used as logical tapes (which provide the >> > sorted >> > runs). The final result is computed on the fly by choosing the smallest >> > or >> > greatest row (as required) from the connections. >> > >> > For a Sort node the materialised result can reside in memory (if it fits >> > there) or on one of the logical tapes used for merge sort. So, in order >> > to >> > provide random access to the sorted result, we need to materialise the >> > result either in the memory or on the logical tape. In-memory >> > materialisation is not easily possible since we have already resorted >> > for >> > tape based sort, in case of distributed relations and to materialise the >> > result on tape, there is no logical tape available in current algorithm. >> > To >> > make it work, there are following possible ways >> > >> > 1. When random access is required, materialise the sorted runs from >> > individual nodes onto tapes (one tape for each node) and then merge them >> > on >> > one extra tape, which can be used for materialisation. >> > 2. Use a mix of connections and logical tape in the same tape set. Merge >> > the >> > sorted runs from connections on a logical tape in the same logical tape >> > set. >> > >> > While the second one looks attractive from performance perspective (it >> > saves >> > writing and reading from the tape), it would make the merge code ugly by >> > using mixed tapes. The read calls for connection and logical tape are >> > different and we will need both on the logical tape where the final >> > result >> > is materialized. So, I am thinking of going with 1, in fact, to have >> > same >> > code to handle remote sort, use 1 in all cases (whether or not >> > materialization is required). >> > >> > Had original authors of remote sort code thought about this >> > materialization? >> > Anything they can share on this topic? >> > Any comment? >> > -- >> > Best Wishes, >> > Ashutosh Bapat >> > EntepriseDB Corporation >> > The Enterprise Postgres Company >> > >> > >> > ------------------------------------------------------------------------------ >> > Everyone hates slow websites. So do we. >> > Make your web apps faster with AppDynamics >> > Download AppDynamics Lite for free today: >> > http://p.sf.net/sfu/appdyn_d2d_mar >> > _______________________________________________ >> > Postgres-xc-developers mailing list >> > Pos...@li... >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers >> > > > > > > -- > Best Wishes, > Ashutosh Bapat > EntepriseDB Corporation > The Enterprise Postgres Company |