Re: [Postgres-xc-developers] Using remote sorting for merge-join

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Did you measure the performance?

On Thu, Apr 18, 2013 at 9:02 AM, Abbas Butt <abb...@en...>wrote:

>
>
> On Thu, Apr 18, 2013 at 1:07 AM, Abbas Butt <abb...@en...>wrote:
>
>> Hi,
>> Here is the review of the patch.
>>
>> Overall the patch is good to go. I have reviewed the code and found some
>> minor errors, which I corrected and have attached the revised patch with
>> the mail.
>>
>> I have tested both the cases when the sort happens in memory and when it
>> happens using disk and found both working.
>>
>> I agree that the approach used in the patch is cleaner and has smaller
>> footprint.
>>
>> I have corrected some white space errors and an unintentional change in
>> function set_dbcleanup_callback
>>     git apply /home/edb/Desktop/MergeSort/xc_sort.patch
>>     /home/edb/Desktop/MergeSort/xc_sort.patch:539: trailing whitespace.
>>         void *fparams;
>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1012: trailing whitespace.
>>
>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1018: trailing whitespace.
>>
>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1087: trailing whitespace.
>>         /*
>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1228: trailing whitespace.
>>                           size_t len, Oid msgnode_oid,
>>     warning: 5 lines add whitespace errors.
>>
>> I am leaving a query running for tonight which would sort 10M rows of a
>> distributed table and would return top 100 of them. I would report its
>> outcome tomorrow morning.
>>
>
> It worked, here is the test case
>
> 1. create table test1 (id integer primary key , padding text);
> 2. Load 10M rows
> 3. select id from test1 order by 1 limit 100
>
>
>
>
>>
>> Best Regards
>>
>>
>> On Mon, Apr 1, 2013 at 11:02 AM, Koichi Suzuki <koi...@gm...
>> > wrote:
>>
>>> Thanks.  Then 90% improvement means about 53% of the duration, while 50%
>>> means 67% of it.   Number of queries in a given duration is 190 vs. 150,
>>> difference is 40.
>>>
>>> Considering the needed resource, it may be okay to begin with
>>> materialization.
>>>
>>> Any other inputs?
>>> ----------
>>> Koichi Suzuki
>>>
>>>
>>> 2013/4/1 Ashutosh Bapat <ash...@en...>
>>>
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 10:59 AM, Koichi Suzuki <
>>>> koi...@gm...> wrote:
>>>>
>>>>> I understand materialize everything makes code clearer and
>>>>> implementation becomes simpler and better structured.
>>>>>
>>>>> What do you mean by x% improvement?   Does 90% improvement mean the
>>>>> total duration is 10% of the original?
>>>>>
>>>> x% improvement means, duration reduces to 100/(100+x) as compared to
>>>> the non-pushdown scenario. Or in simpler words, we see (100+x) queries
>>>> being completed by pushdown approach in the same time in which nonpushdown
>>>> approach completes 100 queries.
>>>>
>>>>> ----------
>>>>> Koichi Suzuki
>>>>>
>>>>>
>>>>> 2013/3/29 Ashutosh Bapat <ash...@en...>
>>>>>
>>>>>> Hi All,
>>>>>> I measured the scale up for both approaches - a. using datanode
>>>>>> connections as tapes (existing one) b. materialising result on tapes before
>>>>>> merging (the approach I proposed). For 1M rows, 5 coordinators I have found
>>>>>> that approach (a) gives 90% improvement whereas approach (b) gives 50%
>>>>>> improvement. Although the difference is significant, I feel that approach
>>>>>> (b) is much cleaner than approach (a) and doesn't have large footprint
>>>>>> compared to PG code and it takes care of all the cases like 1.
>>>>>> materialising sorted result, 2. takes care of any number of datanode
>>>>>> connections without memory overrun. It's possible to improve it further if
>>>>>> we avoid materialisation of datanode result in tuplestore.
>>>>>>
>>>>>> Patch attached for reference.
>>>>>>
>>>>>> On Tue, Mar 26, 2013 at 10:38 AM, Ashutosh Bapat <
>>>>>> ash...@en...> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 26, 2013 at 10:19 AM, Koichi Suzuki <
>>>>>>> koi...@gm...> wrote:
>>>>>>>
>>>>>>>> On thing we should think for option 1 is:
>>>>>>>>
>>>>>>>> When a number of the result is huge, applications has to wait long
>>>>>>>> time until they get the first row.  Because this option may need
>>>>>>>> disk
>>>>>>>> write, total resource consumption will be larger.
>>>>>>>>
>>>>>>>>
>>>>>>> Yes, I am aware of this fact. Please read the next paragraph and you
>>>>>>> will see that the current situation is no better.
>>>>>>>
>>>>>>>
>>>>>>>> I'm wondering if we can use "cursor" at database so that we can read
>>>>>>>> each tape more simply, I mean, to leave each query node open and
>>>>>>>> read
>>>>>>>> next row from any query node.
>>>>>>>>
>>>>>>>>
>>>>>>> We do that right now. But because of such a simulated cursor (it's
>>>>>>> not cursor per say, but we just fetch the required result from connection
>>>>>>> as the demand arises in merging runs), we observer following things
>>>>>>>
>>>>>>> If the plan has multiple remote query nodes (as there will be in
>>>>>>> case of merge join), we assign the same connection to these nodes. Before
>>>>>>> this assignment, the result from the previous connection is materialised at
>>>>>>> the coordinator. This means that, when we will get huge result from the
>>>>>>> datanode, it will be materialised (which will have the more cost as
>>>>>>> materialising it on tape, as this materialisation happens in a linked list,
>>>>>>> which is not optimized). We need to share connection between more than one
>>>>>>> RemoteQuery node because same transaction can not work on two connections
>>>>>>> to same server. Not only performance, but the code has become ugly because
>>>>>>> of this approach. At various places in executor, we have special handling
>>>>>>> for sorting, which needs to be maintained.
>>>>>>>
>>>>>>> Instead if we materialise all the result on tape and then proceed
>>>>>>> with step D5 in Knuth's algorithm for polyphase merge sort, the code will
>>>>>>> be much simpler and we won't loose much performance. In fact, we might be
>>>>>>> able to leverage fetching bulk data on connection which can be materialised
>>>>>>> on tape in bulk.
>>>>>>>
>>>>>>>
>>>>>>>> Regards;
>>>>>>>> ----------
>>>>>>>> Koichi Suzuki
>>>>>>>>
>>>>>>>>
>>>>>>>> 2013/3/25 Ashutosh Bapat <ash...@en...>:
>>>>>>>> > Hi All,
>>>>>>>> > I am working on using remote sorting for merge joins. The idea is
>>>>>>>> while
>>>>>>>> > using merge join at the coordinator, get the data sorted from the
>>>>>>>> datanodes;
>>>>>>>> > for replicated relations, we can get all the rows sorted and for
>>>>>>>> distributed
>>>>>>>> > tables we have to get sorted runs which can be merged at the
>>>>>>>> coordinator.
>>>>>>>> > For merge join the sorted inner relation needs to be randomly
>>>>>>>> accessible.
>>>>>>>> > For replicated relations this can be achieved by materialising
>>>>>>>> the result.
>>>>>>>> > But for distributed relations, we do not materialise the sorted
>>>>>>>> result at
>>>>>>>> > coordinator but compute the sorted result by merging the sorted
>>>>>>>> results from
>>>>>>>> > individual nodes on the fly. For distributed relations, the
>>>>>>>> connection to
>>>>>>>> > the datanodes themselves are used as logical tapes (which provide
>>>>>>>> the sorted
>>>>>>>> > runs). The final result is computed on the fly by choosing the
>>>>>>>> smallest or
>>>>>>>> > greatest row (as required) from the connections.
>>>>>>>> >
>>>>>>>> > For a Sort node the materialised result can reside in memory (if
>>>>>>>> it fits
>>>>>>>> > there) or on one of the logical tapes used for merge sort. So, in
>>>>>>>> order to
>>>>>>>> > provide random access to the sorted result, we need to
>>>>>>>> materialise the
>>>>>>>> > result either in the memory or on the logical tape. In-memory
>>>>>>>> > materialisation is not easily possible since we have already
>>>>>>>> resorted for
>>>>>>>> > tape based sort, in case of distributed relations and to
>>>>>>>> materialise the
>>>>>>>> > result on tape, there is no logical tape available in current
>>>>>>>> algorithm. To
>>>>>>>> > make it work, there are following possible ways
>>>>>>>> >
>>>>>>>> > 1. When random access is required, materialise the sorted runs
>>>>>>>> from
>>>>>>>> > individual nodes onto tapes (one tape for each node) and then
>>>>>>>> merge them on
>>>>>>>> > one extra tape, which can be used for materialisation.
>>>>>>>> > 2. Use a mix of connections and logical tape in the same tape
>>>>>>>> set. Merge the
>>>>>>>> > sorted runs from connections on a logical tape in the same
>>>>>>>> logical tape set.
>>>>>>>> >
>>>>>>>> > While the second one looks attractive from performance
>>>>>>>> perspective (it saves
>>>>>>>> > writing and reading from the tape), it would make the merge code
>>>>>>>> ugly by
>>>>>>>> > using mixed tapes. The read calls for connection and logical tape
>>>>>>>> are
>>>>>>>> > different and we will need both on the logical tape where the
>>>>>>>> final result
>>>>>>>> > is materialized. So, I am thinking of going with 1, in fact, to
>>>>>>>> have same
>>>>>>>> > code to handle remote sort, use 1 in all cases (whether or not
>>>>>>>> > materialization is required).
>>>>>>>> >
>>>>>>>> > Had original authors of remote sort code thought about this
>>>>>>>> materialization?
>>>>>>>> > Anything they can share on this topic?
>>>>>>>> > Any comment?
>>>>>>>> > --
>>>>>>>> > Best Wishes,
>>>>>>>> > Ashutosh Bapat
>>>>>>>> > EntepriseDB Corporation
>>>>>>>> > The Enterprise Postgres Company
>>>>>>>> >
>>>>>>>> >
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> > Everyone hates slow websites. So do we.
>>>>>>>> > Make your web apps faster with AppDynamics
>>>>>>>> > Download AppDynamics Lite for free today:
>>>>>>>> > http://p.sf.net/sfu/appdyn_d2d_mar
>>>>>>>> > _______________________________________________
>>>>>>>> > Postgres-xc-developers mailing list
>>>>>>>> > Pos...@li...
>>>>>>>> >
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Wishes,
>>>>>>> Ashutosh Bapat
>>>>>>> EntepriseDB Corporation
>>>>>>> The Enterprise Postgres Company
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Wishes,
>>>>>> Ashutosh Bapat
>>>>>> EntepriseDB Corporation
>>>>>> The Enterprise Postgres Company
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Wishes,
>>>> Ashutosh Bapat
>>>> EntepriseDB Corporation
>>>> The Enterprise Postgres Company
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Own the Future-Intel&reg; Level Up Game Demo Contest 2013
>>> Rise to greatness in Intel's independent game demo contest.
>>> Compete for recognition, cash, and the chance to get your game
>>> on Steam. $5K grand prize plus 10 genre and skill prizes.
>>> Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
>>> _______________________________________________
>>> Postgres-xc-developers mailing list
>>> Pos...@li...
>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>
>>>
>>
>>
>> --
>> --
>> Abbas
>> Architect
>> EnterpriseDB Corporation
>> The Enterprise PostgreSQL Company
>>
>> Phone: 92-334-5100153
>>
>> Website: www.enterprisedb.com
>> EnterpriseDB Blog: http://blogs.enterprisedb.com/
>> Follow us on Twitter: http://www.twitter.com/enterprisedb
>>
>> This e-mail message (and any attachment) is intended for the use of
>> the individual or entity to whom it is addressed. This message
>> contains information from EnterpriseDB Corporation that may be
>> privileged, confidential, or exempt from disclosure under applicable
>> law. If you are not the intended recipient or authorized to receive
>> this for the intended recipient, any use, dissemination, distribution,
>> retention, archiving, or copying of this communication is strictly
>> prohibited. If you have received this e-mail in error, please notify
>> the sender immediately by reply e-mail and delete this message.
>>
>
>
>
> --
> --
> Abbas
> Architect
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
> Phone: 92-334-5100153
>
> Website: www.enterprisedb.com
> EnterpriseDB Blog: http://blogs.enterprisedb.com/
> Follow us on Twitter: http://www.twitter.com/enterprisedb
>
> This e-mail message (and any attachment) is intended for the use of
> the individual or entity to whom it is addressed. This message
> contains information from EnterpriseDB Corporation that may be
> privileged, confidential, or exempt from disclosure under applicable
> law. If you are not the intended recipient or authorized to receive
> this for the intended recipient, any use, dissemination, distribution,
> retention, archiving, or copying of this communication is strictly
> prohibited. If you have received this e-mail in error, please notify
> the sender immediately by reply e-mail and delete this message.
>

-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company