Re: [Postgres-xc-developers] Using remote sorting for merge-join

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Apr 18, 2013 at 1:07 AM, Abbas Butt <abb...@en...>wrote:

> Hi,
> Here is the review of the patch.
>
> Overall the patch is good to go. I have reviewed the code and found some
> minor errors, which I corrected and have attached the revised patch with
> the mail.
>
> I have tested both the cases when the sort happens in memory and when it
> happens using disk and found both working.
>
> I agree that the approach used in the patch is cleaner and has smaller
> footprint.
>
> I have corrected some white space errors and an unintentional change in
> function set_dbcleanup_callback
>     git apply /home/edb/Desktop/MergeSort/xc_sort.patch
>     /home/edb/Desktop/MergeSort/xc_sort.patch:539: trailing whitespace.
>         void *fparams;
>     /home/edb/Desktop/MergeSort/xc_sort.patch:1012: trailing whitespace.
>
>     /home/edb/Desktop/MergeSort/xc_sort.patch:1018: trailing whitespace.
>
>     /home/edb/Desktop/MergeSort/xc_sort.patch:1087: trailing whitespace.
>         /*
>     /home/edb/Desktop/MergeSort/xc_sort.patch:1228: trailing whitespace.
>                           size_t len, Oid msgnode_oid,
>     warning: 5 lines add whitespace errors.
>
> I am leaving a query running for tonight which would sort 10M rows of a
> distributed table and would return top 100 of them. I would report its
> outcome tomorrow morning.
>

It worked, here is the test case

1. create table test1 (id integer primary key , padding text);
2. Load 10M rows
3. select id from test1 order by 1 limit 100

>
> Best Regards
>
>
> On Mon, Apr 1, 2013 at 11:02 AM, Koichi Suzuki <koi...@gm...>wrote:
>
>> Thanks.  Then 90% improvement means about 53% of the duration, while 50%
>> means 67% of it.   Number of queries in a given duration is 190 vs. 150,
>> difference is 40.
>>
>> Considering the needed resource, it may be okay to begin with
>> materialization.
>>
>> Any other inputs?
>> ----------
>> Koichi Suzuki
>>
>>
>> 2013/4/1 Ashutosh Bapat <ash...@en...>
>>
>>>
>>>
>>> On Mon, Apr 1, 2013 at 10:59 AM, Koichi Suzuki <
>>> koi...@gm...> wrote:
>>>
>>>> I understand materialize everything makes code clearer and
>>>> implementation becomes simpler and better structured.
>>>>
>>>> What do you mean by x% improvement?   Does 90% improvement mean the
>>>> total duration is 10% of the original?
>>>>
>>> x% improvement means, duration reduces to 100/(100+x) as compared to the
>>> non-pushdown scenario. Or in simpler words, we see (100+x) queries being
>>> completed by pushdown approach in the same time in which nonpushdown
>>> approach completes 100 queries.
>>>
>>>> ----------
>>>> Koichi Suzuki
>>>>
>>>>
>>>> 2013/3/29 Ashutosh Bapat <ash...@en...>
>>>>
>>>>> Hi All,
>>>>> I measured the scale up for both approaches - a. using datanode
>>>>> connections as tapes (existing one) b. materialising result on tapes before
>>>>> merging (the approach I proposed). For 1M rows, 5 coordinators I have found
>>>>> that approach (a) gives 90% improvement whereas approach (b) gives 50%
>>>>> improvement. Although the difference is significant, I feel that approach
>>>>> (b) is much cleaner than approach (a) and doesn't have large footprint
>>>>> compared to PG code and it takes care of all the cases like 1.
>>>>> materialising sorted result, 2. takes care of any number of datanode
>>>>> connections without memory overrun. It's possible to improve it further if
>>>>> we avoid materialisation of datanode result in tuplestore.
>>>>>
>>>>> Patch attached for reference.
>>>>>
>>>>> On Tue, Mar 26, 2013 at 10:38 AM, Ashutosh Bapat <
>>>>> ash...@en...> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 26, 2013 at 10:19 AM, Koichi Suzuki <
>>>>>> koi...@gm...> wrote:
>>>>>>
>>>>>>> On thing we should think for option 1 is:
>>>>>>>
>>>>>>> When a number of the result is huge, applications has to wait long
>>>>>>> time until they get the first row.  Because this option may need disk
>>>>>>> write, total resource consumption will be larger.
>>>>>>>
>>>>>>>
>>>>>> Yes, I am aware of this fact. Please read the next paragraph and you
>>>>>> will see that the current situation is no better.
>>>>>>
>>>>>>
>>>>>>> I'm wondering if we can use "cursor" at database so that we can read
>>>>>>> each tape more simply, I mean, to leave each query node open and read
>>>>>>> next row from any query node.
>>>>>>>
>>>>>>>
>>>>>> We do that right now. But because of such a simulated cursor (it's
>>>>>> not cursor per say, but we just fetch the required result from connection
>>>>>> as the demand arises in merging runs), we observer following things
>>>>>>
>>>>>> If the plan has multiple remote query nodes (as there will be in case
>>>>>> of merge join), we assign the same connection to these nodes. Before this
>>>>>> assignment, the result from the previous connection is materialised at the
>>>>>> coordinator. This means that, when we will get huge result from the
>>>>>> datanode, it will be materialised (which will have the more cost as
>>>>>> materialising it on tape, as this materialisation happens in a linked list,
>>>>>> which is not optimized). We need to share connection between more than one
>>>>>> RemoteQuery node because same transaction can not work on two connections
>>>>>> to same server. Not only performance, but the code has become ugly because
>>>>>> of this approach. At various places in executor, we have special handling
>>>>>> for sorting, which needs to be maintained.
>>>>>>
>>>>>> Instead if we materialise all the result on tape and then proceed
>>>>>> with step D5 in Knuth's algorithm for polyphase merge sort, the code will
>>>>>> be much simpler and we won't loose much performance. In fact, we might be
>>>>>> able to leverage fetching bulk data on connection which can be materialised
>>>>>> on tape in bulk.
>>>>>>
>>>>>>
>>>>>>> Regards;
>>>>>>> ----------
>>>>>>> Koichi Suzuki
>>>>>>>
>>>>>>>
>>>>>>> 2013/3/25 Ashutosh Bapat <ash...@en...>:
>>>>>>> > Hi All,
>>>>>>> > I am working on using remote sorting for merge joins. The idea is
>>>>>>> while
>>>>>>> > using merge join at the coordinator, get the data sorted from the
>>>>>>> datanodes;
>>>>>>> > for replicated relations, we can get all the rows sorted and for
>>>>>>> distributed
>>>>>>> > tables we have to get sorted runs which can be merged at the
>>>>>>> coordinator.
>>>>>>> > For merge join the sorted inner relation needs to be randomly
>>>>>>> accessible.
>>>>>>> > For replicated relations this can be achieved by materialising the
>>>>>>> result.
>>>>>>> > But for distributed relations, we do not materialise the sorted
>>>>>>> result at
>>>>>>> > coordinator but compute the sorted result by merging the sorted
>>>>>>> results from
>>>>>>> > individual nodes on the fly. For distributed relations, the
>>>>>>> connection to
>>>>>>> > the datanodes themselves are used as logical tapes (which provide
>>>>>>> the sorted
>>>>>>> > runs). The final result is computed on the fly by choosing the
>>>>>>> smallest or
>>>>>>> > greatest row (as required) from the connections.
>>>>>>> >
>>>>>>> > For a Sort node the materialised result can reside in memory (if
>>>>>>> it fits
>>>>>>> > there) or on one of the logical tapes used for merge sort. So, in
>>>>>>> order to
>>>>>>> > provide random access to the sorted result, we need to materialise
>>>>>>> the
>>>>>>> > result either in the memory or on the logical tape. In-memory
>>>>>>> > materialisation is not easily possible since we have already
>>>>>>> resorted for
>>>>>>> > tape based sort, in case of distributed relations and to
>>>>>>> materialise the
>>>>>>> > result on tape, there is no logical tape available in current
>>>>>>> algorithm. To
>>>>>>> > make it work, there are following possible ways
>>>>>>> >
>>>>>>> > 1. When random access is required, materialise the sorted runs from
>>>>>>> > individual nodes onto tapes (one tape for each node) and then
>>>>>>> merge them on
>>>>>>> > one extra tape, which can be used for materialisation.
>>>>>>> > 2. Use a mix of connections and logical tape in the same tape set.
>>>>>>> Merge the
>>>>>>> > sorted runs from connections on a logical tape in the same logical
>>>>>>> tape set.
>>>>>>> >
>>>>>>> > While the second one looks attractive from performance perspective
>>>>>>> (it saves
>>>>>>> > writing and reading from the tape), it would make the merge code
>>>>>>> ugly by
>>>>>>> > using mixed tapes. The read calls for connection and logical tape
>>>>>>> are
>>>>>>> > different and we will need both on the logical tape where the
>>>>>>> final result
>>>>>>> > is materialized. So, I am thinking of going with 1, in fact, to
>>>>>>> have same
>>>>>>> > code to handle remote sort, use 1 in all cases (whether or not
>>>>>>> > materialization is required).
>>>>>>> >
>>>>>>> > Had original authors of remote sort code thought about this
>>>>>>> materialization?
>>>>>>> > Anything they can share on this topic?
>>>>>>> > Any comment?
>>>>>>> > --
>>>>>>> > Best Wishes,
>>>>>>> > Ashutosh Bapat
>>>>>>> > EntepriseDB Corporation
>>>>>>> > The Enterprise Postgres Company
>>>>>>> >
>>>>>>> >
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> > Everyone hates slow websites. So do we.
>>>>>>> > Make your web apps faster with AppDynamics
>>>>>>> > Download AppDynamics Lite for free today:
>>>>>>> > http://p.sf.net/sfu/appdyn_d2d_mar
>>>>>>> > _______________________________________________
>>>>>>> > Postgres-xc-developers mailing list
>>>>>>> > Pos...@li...
>>>>>>> >
>>>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Wishes,
>>>>>> Ashutosh Bapat
>>>>>> EntepriseDB Corporation
>>>>>> The Enterprise Postgres Company
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Wishes,
>>>>> Ashutosh Bapat
>>>>> EntepriseDB Corporation
>>>>> The Enterprise Postgres Company
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Wishes,
>>> Ashutosh Bapat
>>> EntepriseDB Corporation
>>> The Enterprise Postgres Company
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Own the Future-Intel&reg; Level Up Game Demo Contest 2013
>> Rise to greatness in Intel's independent game demo contest.
>> Compete for recognition, cash, and the chance to get your game
>> on Steam. $5K grand prize plus 10 genre and skill prizes.
>> Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
>> _______________________________________________
>> Postgres-xc-developers mailing list
>> Pos...@li...
>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>
>>
>
>
> --
> --
> Abbas
> Architect
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
> Phone: 92-334-5100153
>
> Website: www.enterprisedb.com
> EnterpriseDB Blog: http://blogs.enterprisedb.com/
> Follow us on Twitter: http://www.twitter.com/enterprisedb
>
> This e-mail message (and any attachment) is intended for the use of
> the individual or entity to whom it is addressed. This message
> contains information from EnterpriseDB Corporation that may be
> privileged, confidential, or exempt from disclosure under applicable
> law. If you are not the intended recipient or authorized to receive
> this for the intended recipient, any use, dissemination, distribution,
> retention, archiving, or copying of this communication is strictly
> prohibited. If you have received this e-mail in error, please notify
> the sender immediately by reply e-mail and delete this message.
>

-- 
--
Abbas
Architect
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: 92-334-5100153

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of
the individual or entity to whom it is addressed. This message
contains information from EnterpriseDB Corporation that may be
privileged, confidential, or exempt from disclosure under applicable
law. If you are not the intended recipient or authorized to receive
this for the intended recipient, any use, dissemination, distribution,
retention, archiving, or copying of this communication is strictly
prohibited. If you have received this e-mail in error, please notify
the sender immediately by reply e-mail and delete this message.