Re: [Postgres-xc-developers] Using remote sorting for merge-join

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I understand materialize everything makes code clearer and implementation
becomes simpler and better structured.

What do you mean by x% improvement?   Does 90% improvement mean the total
duration is 10% of the original?
----------
Koichi Suzuki

2013/3/29 Ashutosh Bapat <ash...@en...>

> Hi All,
> I measured the scale up for both approaches - a. using datanode
> connections as tapes (existing one) b. materialising result on tapes before
> merging (the approach I proposed). For 1M rows, 5 coordinators I have found
> that approach (a) gives 90% improvement whereas approach (b) gives 50%
> improvement. Although the difference is significant, I feel that approach
> (b) is much cleaner than approach (a) and doesn't have large footprint
> compared to PG code and it takes care of all the cases like 1.
> materialising sorted result, 2. takes care of any number of datanode
> connections without memory overrun. It's possible to improve it further if
> we avoid materialisation of datanode result in tuplestore.
>
> Patch attached for reference.
>
> On Tue, Mar 26, 2013 at 10:38 AM, Ashutosh Bapat <
> ash...@en...> wrote:
>
>>
>>
>> On Tue, Mar 26, 2013 at 10:19 AM, Koichi Suzuki <
>> koi...@gm...> wrote:
>>
>>> On thing we should think for option 1 is:
>>>
>>> When a number of the result is huge, applications has to wait long
>>> time until they get the first row.  Because this option may need disk
>>> write, total resource consumption will be larger.
>>>
>>>
>> Yes, I am aware of this fact. Please read the next paragraph and you will
>> see that the current situation is no better.
>>
>>
>>> I'm wondering if we can use "cursor" at database so that we can read
>>> each tape more simply, I mean, to leave each query node open and read
>>> next row from any query node.
>>>
>>>
>> We do that right now. But because of such a simulated cursor (it's not
>> cursor per say, but we just fetch the required result from connection as
>> the demand arises in merging runs), we observer following things
>>
>> If the plan has multiple remote query nodes (as there will be in case of
>> merge join), we assign the same connection to these nodes. Before this
>> assignment, the result from the previous connection is materialised at the
>> coordinator. This means that, when we will get huge result from the
>> datanode, it will be materialised (which will have the more cost as
>> materialising it on tape, as this materialisation happens in a linked list,
>> which is not optimized). We need to share connection between more than one
>> RemoteQuery node because same transaction can not work on two connections
>> to same server. Not only performance, but the code has become ugly because
>> of this approach. At various places in executor, we have special handling
>> for sorting, which needs to be maintained.
>>
>> Instead if we materialise all the result on tape and then proceed with
>> step D5 in Knuth's algorithm for polyphase merge sort, the code will be
>> much simpler and we won't loose much performance. In fact, we might be able
>> to leverage fetching bulk data on connection which can be materialised on
>> tape in bulk.
>>
>>
>>> Regards;
>>> ----------
>>> Koichi Suzuki
>>>
>>>
>>> 2013/3/25 Ashutosh Bapat <ash...@en...>:
>>> > Hi All,
>>> > I am working on using remote sorting for merge joins. The idea is while
>>> > using merge join at the coordinator, get the data sorted from the
>>> datanodes;
>>> > for replicated relations, we can get all the rows sorted and for
>>> distributed
>>> > tables we have to get sorted runs which can be merged at the
>>> coordinator.
>>> > For merge join the sorted inner relation needs to be randomly
>>> accessible.
>>> > For replicated relations this can be achieved by materialising the
>>> result.
>>> > But for distributed relations, we do not materialise the sorted result
>>> at
>>> > coordinator but compute the sorted result by merging the sorted
>>> results from
>>> > individual nodes on the fly. For distributed relations, the connection
>>> to
>>> > the datanodes themselves are used as logical tapes (which provide the
>>> sorted
>>> > runs). The final result is computed on the fly by choosing the
>>> smallest or
>>> > greatest row (as required) from the connections.
>>> >
>>> > For a Sort node the materialised result can reside in memory (if it
>>> fits
>>> > there) or on one of the logical tapes used for merge sort. So, in
>>> order to
>>> > provide random access to the sorted result, we need to materialise the
>>> > result either in the memory or on the logical tape. In-memory
>>> > materialisation is not easily possible since we have already resorted
>>> for
>>> > tape based sort, in case of distributed relations and to materialise
>>> the
>>> > result on tape, there is no logical tape available in current
>>> algorithm. To
>>> > make it work, there are following possible ways
>>> >
>>> > 1. When random access is required, materialise the sorted runs from
>>> > individual nodes onto tapes (one tape for each node) and then merge
>>> them on
>>> > one extra tape, which can be used for materialisation.
>>> > 2. Use a mix of connections and logical tape in the same tape set.
>>> Merge the
>>> > sorted runs from connections on a logical tape in the same logical
>>> tape set.
>>> >
>>> > While the second one looks attractive from performance perspective (it
>>> saves
>>> > writing and reading from the tape), it would make the merge code ugly
>>> by
>>> > using mixed tapes. The read calls for connection and logical tape are
>>> > different and we will need both on the logical tape where the final
>>> result
>>> > is materialized. So, I am thinking of going with 1, in fact, to have
>>> same
>>> > code to handle remote sort, use 1 in all cases (whether or not
>>> > materialization is required).
>>> >
>>> > Had original authors of remote sort code thought about this
>>> materialization?
>>> > Anything they can share on this topic?
>>> > Any comment?
>>> > --
>>> > Best Wishes,
>>> > Ashutosh Bapat
>>> > EntepriseDB Corporation
>>> > The Enterprise Postgres Company
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > Everyone hates slow websites. So do we.
>>> > Make your web apps faster with AppDynamics
>>> > Download AppDynamics Lite for free today:
>>> > http://p.sf.net/sfu/appdyn_d2d_mar
>>> > _______________________________________________
>>> > Postgres-xc-developers mailing list
>>> > Pos...@li...
>>> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>> >
>>>
>>
>>
>>
>> --
>> Best Wishes,
>> Ashutosh Bapat
>> EntepriseDB Corporation
>> The Enterprise Postgres Company
>>
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EntepriseDB Corporation
> The Enterprise Postgres Company
>