Re: [Postgres-xc-developers] Using remote sorting for merge-join

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Did you do it on true cluster or by running all instances on same machine?
The later would degrade the performance.

On Thu, Apr 18, 2013 at 4:38 PM, Abbas Butt <abb...@en...>wrote:

>
>
> On Thu, Apr 18, 2013 at 8:43 AM, Ashutosh Bapat <
> ash...@en...> wrote:
>
>> Did you measure the performance?
>>
>
> I tried but I was getting very strange numbers , It took some hours but
> reported
>
> Time: 365649.353 ms
>
> which comes out to be some 6 minutes, I am not sure why.
>
>
>>
>>
>> On Thu, Apr 18, 2013 at 9:02 AM, Abbas Butt <abb...@en...>wrote:
>>
>>>
>>>
>>> On Thu, Apr 18, 2013 at 1:07 AM, Abbas Butt <abb...@en...
>>> > wrote:
>>>
>>>> Hi,
>>>> Here is the review of the patch.
>>>>
>>>> Overall the patch is good to go. I have reviewed the code and found
>>>> some minor errors, which I corrected and have attached the revised patch
>>>> with the mail.
>>>>
>>>> I have tested both the cases when the sort happens in memory and when
>>>> it happens using disk and found both working.
>>>>
>>>> I agree that the approach used in the patch is cleaner and has smaller
>>>> footprint.
>>>>
>>>> I have corrected some white space errors and an unintentional change in
>>>> function set_dbcleanup_callback
>>>>     git apply /home/edb/Desktop/MergeSort/xc_sort.patch
>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:539: trailing whitespace.
>>>>         void *fparams;
>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1012: trailing whitespace.
>>>>
>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1018: trailing whitespace.
>>>>
>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1087: trailing whitespace.
>>>>         /*
>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1228: trailing whitespace.
>>>>                           size_t len, Oid msgnode_oid,
>>>>     warning: 5 lines add whitespace errors.
>>>>
>>>> I am leaving a query running for tonight which would sort 10M rows of a
>>>> distributed table and would return top 100 of them. I would report its
>>>> outcome tomorrow morning.
>>>>
>>>
>>> It worked, here is the test case
>>>
>>> 1. create table test1 (id integer primary key , padding text);
>>> 2. Load 10M rows
>>> 3. select id from test1 order by 1 limit 100
>>>
>>>
>>>
>>>
>>>>
>>>> Best Regards
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 11:02 AM, Koichi Suzuki <
>>>> koi...@gm...> wrote:
>>>>
>>>>> Thanks.  Then 90% improvement means about 53% of the duration, while
>>>>> 50% means 67% of it.   Number of queries in a given duration is 190 vs.
>>>>> 150, difference is 40.
>>>>>
>>>>> Considering the needed resource, it may be okay to begin with
>>>>> materialization.
>>>>>
>>>>> Any other inputs?
>>>>> ----------
>>>>> Koichi Suzuki
>>>>>
>>>>>
>>>>> 2013/4/1 Ashutosh Bapat <ash...@en...>
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 1, 2013 at 10:59 AM, Koichi Suzuki <
>>>>>> koi...@gm...> wrote:
>>>>>>
>>>>>>> I understand materialize everything makes code clearer and
>>>>>>> implementation becomes simpler and better structured.
>>>>>>>
>>>>>>> What do you mean by x% improvement?   Does 90% improvement mean the
>>>>>>> total duration is 10% of the original?
>>>>>>>
>>>>>> x% improvement means, duration reduces to 100/(100+x) as compared to
>>>>>> the non-pushdown scenario. Or in simpler words, we see (100+x) queries
>>>>>> being completed by pushdown approach in the same time in which nonpushdown
>>>>>> approach completes 100 queries.
>>>>>>
>>>>>>> ----------
>>>>>>> Koichi Suzuki
>>>>>>>
>>>>>>>
>>>>>>> 2013/3/29 Ashutosh Bapat <ash...@en...>
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>> I measured the scale up for both approaches - a. using datanode
>>>>>>>> connections as tapes (existing one) b. materialising result on tapes before
>>>>>>>> merging (the approach I proposed). For 1M rows, 5 coordinators I have found
>>>>>>>> that approach (a) gives 90% improvement whereas approach (b) gives 50%
>>>>>>>> improvement. Although the difference is significant, I feel that approach
>>>>>>>> (b) is much cleaner than approach (a) and doesn't have large footprint
>>>>>>>> compared to PG code and it takes care of all the cases like 1.
>>>>>>>> materialising sorted result, 2. takes care of any number of datanode
>>>>>>>> connections without memory overrun. It's possible to improve it further if
>>>>>>>> we avoid materialisation of datanode result in tuplestore.
>>>>>>>>
>>>>>>>> Patch attached for reference.
>>>>>>>>
>>>>>>>> On Tue, Mar 26, 2013 at 10:38 AM, Ashutosh Bapat <
>>>>>>>> ash...@en...> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 26, 2013 at 10:19 AM, Koichi Suzuki <
>>>>>>>>> koi...@gm...> wrote:
>>>>>>>>>
>>>>>>>>>> On thing we should think for option 1 is:
>>>>>>>>>>
>>>>>>>>>> When a number of the result is huge, applications has to wait long
>>>>>>>>>> time until they get the first row.  Because this option may need
>>>>>>>>>> disk
>>>>>>>>>> write, total resource consumption will be larger.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Yes, I am aware of this fact. Please read the next paragraph and
>>>>>>>>> you will see that the current situation is no better.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm wondering if we can use "cursor" at database so that we can
>>>>>>>>>> read
>>>>>>>>>> each tape more simply, I mean, to leave each query node open and
>>>>>>>>>> read
>>>>>>>>>> next row from any query node.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> We do that right now. But because of such a simulated cursor (it's
>>>>>>>>> not cursor per say, but we just fetch the required result from connection
>>>>>>>>> as the demand arises in merging runs), we observer following things
>>>>>>>>>
>>>>>>>>> If the plan has multiple remote query nodes (as there will be in
>>>>>>>>> case of merge join), we assign the same connection to these nodes. Before
>>>>>>>>> this assignment, the result from the previous connection is materialised at
>>>>>>>>> the coordinator. This means that, when we will get huge result from the
>>>>>>>>> datanode, it will be materialised (which will have the more cost as
>>>>>>>>> materialising it on tape, as this materialisation happens in a linked list,
>>>>>>>>> which is not optimized). We need to share connection between more than one
>>>>>>>>> RemoteQuery node because same transaction can not work on two connections
>>>>>>>>> to same server. Not only performance, but the code has become ugly because
>>>>>>>>> of this approach. At various places in executor, we have special handling
>>>>>>>>> for sorting, which needs to be maintained.
>>>>>>>>>
>>>>>>>>> Instead if we materialise all the result on tape and then proceed
>>>>>>>>> with step D5 in Knuth's algorithm for polyphase merge sort, the code will
>>>>>>>>> be much simpler and we won't loose much performance. In fact, we might be
>>>>>>>>> able to leverage fetching bulk data on connection which can be materialised
>>>>>>>>> on tape in bulk.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Regards;
>>>>>>>>>> ----------
>>>>>>>>>> Koichi Suzuki
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2013/3/25 Ashutosh Bapat <ash...@en...>:
>>>>>>>>>> > Hi All,
>>>>>>>>>> > I am working on using remote sorting for merge joins. The idea
>>>>>>>>>> is while
>>>>>>>>>> > using merge join at the coordinator, get the data sorted from
>>>>>>>>>> the datanodes;
>>>>>>>>>> > for replicated relations, we can get all the rows sorted and
>>>>>>>>>> for distributed
>>>>>>>>>> > tables we have to get sorted runs which can be merged at the
>>>>>>>>>> coordinator.
>>>>>>>>>> > For merge join the sorted inner relation needs to be randomly
>>>>>>>>>> accessible.
>>>>>>>>>> > For replicated relations this can be achieved by materialising
>>>>>>>>>> the result.
>>>>>>>>>> > But for distributed relations, we do not materialise the sorted
>>>>>>>>>> result at
>>>>>>>>>> > coordinator but compute the sorted result by merging the sorted
>>>>>>>>>> results from
>>>>>>>>>> > individual nodes on the fly. For distributed relations, the
>>>>>>>>>> connection to
>>>>>>>>>> > the datanodes themselves are used as logical tapes (which
>>>>>>>>>> provide the sorted
>>>>>>>>>> > runs). The final result is computed on the fly by choosing the
>>>>>>>>>> smallest or
>>>>>>>>>> > greatest row (as required) from the connections.
>>>>>>>>>> >
>>>>>>>>>> > For a Sort node the materialised result can reside in memory
>>>>>>>>>> (if it fits
>>>>>>>>>> > there) or on one of the logical tapes used for merge sort. So,
>>>>>>>>>> in order to
>>>>>>>>>> > provide random access to the sorted result, we need to
>>>>>>>>>> materialise the
>>>>>>>>>> > result either in the memory or on the logical tape. In-memory
>>>>>>>>>> > materialisation is not easily possible since we have already
>>>>>>>>>> resorted for
>>>>>>>>>> > tape based sort, in case of distributed relations and to
>>>>>>>>>> materialise the
>>>>>>>>>> > result on tape, there is no logical tape available in current
>>>>>>>>>> algorithm. To
>>>>>>>>>> > make it work, there are following possible ways
>>>>>>>>>> >
>>>>>>>>>> > 1. When random access is required, materialise the sorted runs
>>>>>>>>>> from
>>>>>>>>>> > individual nodes onto tapes (one tape for each node) and then
>>>>>>>>>> merge them on
>>>>>>>>>> > one extra tape, which can be used for materialisation.
>>>>>>>>>> > 2. Use a mix of connections and logical tape in the same tape
>>>>>>>>>> set. Merge the
>>>>>>>>>> > sorted runs from connections on a logical tape in the same
>>>>>>>>>> logical tape set.
>>>>>>>>>> >
>>>>>>>>>> > While the second one looks attractive from performance
>>>>>>>>>> perspective (it saves
>>>>>>>>>> > writing and reading from the tape), it would make the merge
>>>>>>>>>> code ugly by
>>>>>>>>>> > using mixed tapes. The read calls for connection and logical
>>>>>>>>>> tape are
>>>>>>>>>> > different and we will need both on the logical tape where the
>>>>>>>>>> final result
>>>>>>>>>> > is materialized. So, I am thinking of going with 1, in fact, to
>>>>>>>>>> have same
>>>>>>>>>> > code to handle remote sort, use 1 in all cases (whether or not
>>>>>>>>>> > materialization is required).
>>>>>>>>>> >
>>>>>>>>>> > Had original authors of remote sort code thought about this
>>>>>>>>>> materialization?
>>>>>>>>>> > Anything they can share on this topic?
>>>>>>>>>> > Any comment?
>>>>>>>>>> > --
>>>>>>>>>> > Best Wishes,
>>>>>>>>>> > Ashutosh Bapat
>>>>>>>>>> > EntepriseDB Corporation
>>>>>>>>>> > The Enterprise Postgres Company
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>> > Everyone hates slow websites. So do we.
>>>>>>>>>> > Make your web apps faster with AppDynamics
>>>>>>>>>> > Download AppDynamics Lite for free today:
>>>>>>>>>> > http://p.sf.net/sfu/appdyn_d2d_mar
>>>>>>>>>> > _______________________________________________
>>>>>>>>>> > Postgres-xc-developers mailing list
>>>>>>>>>> > Pos...@li...
>>>>>>>>>> >
>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Wishes,
>>>>>>>>> Ashutosh Bapat
>>>>>>>>> EntepriseDB Corporation
>>>>>>>>> The Enterprise Postgres Company
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Wishes,
>>>>>>>> Ashutosh Bapat
>>>>>>>> EntepriseDB Corporation
>>>>>>>> The Enterprise Postgres Company
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Wishes,
>>>>>> Ashutosh Bapat
>>>>>> EntepriseDB Corporation
>>>>>> The Enterprise Postgres Company
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Own the Future-Intel&reg; Level Up Game Demo Contest 2013
>>>>> Rise to greatness in Intel's independent game demo contest.
>>>>> Compete for recognition, cash, and the chance to get your game
>>>>> on Steam. $5K grand prize plus 10 genre and skill prizes.
>>>>> Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
>>>>> _______________________________________________
>>>>> Postgres-xc-developers mailing list
>>>>> Pos...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Abbas
>>>> Architect
>>>> EnterpriseDB Corporation
>>>> The Enterprise PostgreSQL Company
>>>>
>>>> Phone: 92-334-5100153
>>>>
>>>> Website: www.enterprisedb.com
>>>> EnterpriseDB Blog: http://blogs.enterprisedb.com/
>>>> Follow us on Twitter: http://www.twitter.com/enterprisedb
>>>>
>>>> This e-mail message (and any attachment) is intended for the use of
>>>> the individual or entity to whom it is addressed. This message
>>>> contains information from EnterpriseDB Corporation that may be
>>>> privileged, confidential, or exempt from disclosure under applicable
>>>> law. If you are not the intended recipient or authorized to receive
>>>> this for the intended recipient, any use, dissemination, distribution,
>>>> retention, archiving, or copying of this communication is strictly
>>>> prohibited. If you have received this e-mail in error, please notify
>>>> the sender immediately by reply e-mail and delete this message.
>>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Abbas
>>> Architect
>>> EnterpriseDB Corporation
>>> The Enterprise PostgreSQL Company
>>>
>>> Phone: 92-334-5100153
>>>
>>> Website: www.enterprisedb.com
>>> EnterpriseDB Blog: http://blogs.enterprisedb.com/
>>> Follow us on Twitter: http://www.twitter.com/enterprisedb
>>>
>>> This e-mail message (and any attachment) is intended for the use of
>>> the individual or entity to whom it is addressed. This message
>>> contains information from EnterpriseDB Corporation that may be
>>> privileged, confidential, or exempt from disclosure under applicable
>>> law. If you are not the intended recipient or authorized to receive
>>> this for the intended recipient, any use, dissemination, distribution,
>>> retention, archiving, or copying of this communication is strictly
>>> prohibited. If you have received this e-mail in error, please notify
>>> the sender immediately by reply e-mail and delete this message.
>>>
>>
>>
>>
>> --
>> Best Wishes,
>> Ashutosh Bapat
>> EntepriseDB Corporation
>> The Enterprise Postgres Company
>>
>
>
>
> --
> --
> Abbas
> Architect
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
> Phone: 92-334-5100153
>
> Website: www.enterprisedb.com
> EnterpriseDB Blog: http://blogs.enterprisedb.com/
> Follow us on Twitter: http://www.twitter.com/enterprisedb
>
> This e-mail message (and any attachment) is intended for the use of
> the individual or entity to whom it is addressed. This message
> contains information from EnterpriseDB Corporation that may be
> privileged, confidential, or exempt from disclosure under applicable
> law. If you are not the intended recipient or authorized to receive
> this for the intended recipient, any use, dissemination, distribution,
> retention, archiving, or copying of this communication is strictly
> prohibited. If you have received this e-mail in error, please notify
> the sender immediately by reply e-mail and delete this message.
>

-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company