Re: [Postgres-xc-developers] Using remote sorting for merge-join

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Then the perf is expected. There is too much IO on the same machine and too
much context switch.

On Thu, Apr 18, 2013 at 7:35 PM, Abbas Butt <abb...@en...>wrote:

> All instances on the same machine.
>
>
> On Thu, Apr 18, 2013 at 4:38 PM, Ashutosh Bapat <
> ash...@en...> wrote:
>
>> Did you do it on true cluster or by running all instances on same
>> machine? The later would degrade the performance.
>>
>>
>> On Thu, Apr 18, 2013 at 4:38 PM, Abbas Butt <abb...@en...>wrote:
>>
>>>
>>>
>>> On Thu, Apr 18, 2013 at 8:43 AM, Ashutosh Bapat <
>>> ash...@en...> wrote:
>>>
>>>> Did you measure the performance?
>>>>
>>>
>>> I tried but I was getting very strange numbers , It took some hours but
>>> reported
>>>
>>> Time: 365649.353 ms
>>>
>>> which comes out to be some 6 minutes, I am not sure why.
>>>
>>>
>>>>
>>>>
>>>> On Thu, Apr 18, 2013 at 9:02 AM, Abbas Butt <
>>>> abb...@en...> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 18, 2013 at 1:07 AM, Abbas Butt <
>>>>> abb...@en...> wrote:
>>>>>
>>>>>> Hi,
>>>>>> Here is the review of the patch.
>>>>>>
>>>>>> Overall the patch is good to go. I have reviewed the code and found
>>>>>> some minor errors, which I corrected and have attached the revised patch
>>>>>> with the mail.
>>>>>>
>>>>>> I have tested both the cases when the sort happens in memory and when
>>>>>> it happens using disk and found both working.
>>>>>>
>>>>>> I agree that the approach used in the patch is cleaner and has
>>>>>> smaller footprint.
>>>>>>
>>>>>> I have corrected some white space errors and an unintentional change
>>>>>> in function set_dbcleanup_callback
>>>>>>     git apply /home/edb/Desktop/MergeSort/xc_sort.patch
>>>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:539: trailing
>>>>>> whitespace.
>>>>>>         void *fparams;
>>>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1012: trailing
>>>>>> whitespace.
>>>>>>
>>>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1018: trailing
>>>>>> whitespace.
>>>>>>
>>>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1087: trailing
>>>>>> whitespace.
>>>>>>         /*
>>>>>>     /home/edb/Desktop/MergeSort/xc_sort.patch:1228: trailing
>>>>>> whitespace.
>>>>>>                           size_t len, Oid msgnode_oid,
>>>>>>     warning: 5 lines add whitespace errors.
>>>>>>
>>>>>> I am leaving a query running for tonight which would sort 10M rows of
>>>>>> a distributed table and would return top 100 of them. I would report its
>>>>>> outcome tomorrow morning.
>>>>>>
>>>>>
>>>>> It worked, here is the test case
>>>>>
>>>>> 1. create table test1 (id integer primary key , padding text);
>>>>> 2. Load 10M rows
>>>>> 3. select id from test1 order by 1 limit 100
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Best Regards
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 1, 2013 at 11:02 AM, Koichi Suzuki <
>>>>>> koi...@gm...> wrote:
>>>>>>
>>>>>>> Thanks.  Then 90% improvement means about 53% of the duration, while
>>>>>>> 50% means 67% of it.   Number of queries in a given duration is 190 vs.
>>>>>>> 150, difference is 40.
>>>>>>>
>>>>>>> Considering the needed resource, it may be okay to begin with
>>>>>>> materialization.
>>>>>>>
>>>>>>> Any other inputs?
>>>>>>> ----------
>>>>>>> Koichi Suzuki
>>>>>>>
>>>>>>>
>>>>>>> 2013/4/1 Ashutosh Bapat <ash...@en...>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Apr 1, 2013 at 10:59 AM, Koichi Suzuki <
>>>>>>>> koi...@gm...> wrote:
>>>>>>>>
>>>>>>>>> I understand materialize everything makes code clearer and
>>>>>>>>> implementation becomes simpler and better structured.
>>>>>>>>>
>>>>>>>>> What do you mean by x% improvement?   Does 90% improvement mean
>>>>>>>>> the total duration is 10% of the original?
>>>>>>>>>
>>>>>>>> x% improvement means, duration reduces to 100/(100+x) as compared
>>>>>>>> to the non-pushdown scenario. Or in simpler words, we see (100+x) queries
>>>>>>>> being completed by pushdown approach in the same time in which nonpushdown
>>>>>>>> approach completes 100 queries.
>>>>>>>>
>>>>>>>>> ----------
>>>>>>>>> Koichi Suzuki
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2013/3/29 Ashutosh Bapat <ash...@en...>
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>> I measured the scale up for both approaches - a. using datanode
>>>>>>>>>> connections as tapes (existing one) b. materialising result on tapes before
>>>>>>>>>> merging (the approach I proposed). For 1M rows, 5 coordinators I have found
>>>>>>>>>> that approach (a) gives 90% improvement whereas approach (b) gives 50%
>>>>>>>>>> improvement. Although the difference is significant, I feel that approach
>>>>>>>>>> (b) is much cleaner than approach (a) and doesn't have large footprint
>>>>>>>>>> compared to PG code and it takes care of all the cases like 1.
>>>>>>>>>> materialising sorted result, 2. takes care of any number of datanode
>>>>>>>>>> connections without memory overrun. It's possible to improve it further if
>>>>>>>>>> we avoid materialisation of datanode result in tuplestore.
>>>>>>>>>>
>>>>>>>>>> Patch attached for reference.
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 26, 2013 at 10:38 AM, Ashutosh Bapat <
>>>>>>>>>> ash...@en...> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 26, 2013 at 10:19 AM, Koichi Suzuki <
>>>>>>>>>>> koi...@gm...> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On thing we should think for option 1 is:
>>>>>>>>>>>>
>>>>>>>>>>>> When a number of the result is huge, applications has to wait
>>>>>>>>>>>> long
>>>>>>>>>>>> time until they get the first row.  Because this option may
>>>>>>>>>>>> need disk
>>>>>>>>>>>> write, total resource consumption will be larger.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Yes, I am aware of this fact. Please read the next paragraph and
>>>>>>>>>>> you will see that the current situation is no better.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I'm wondering if we can use "cursor" at database so that we can
>>>>>>>>>>>> read
>>>>>>>>>>>> each tape more simply, I mean, to leave each query node open
>>>>>>>>>>>> and read
>>>>>>>>>>>> next row from any query node.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> We do that right now. But because of such a simulated cursor
>>>>>>>>>>> (it's not cursor per say, but we just fetch the required result from
>>>>>>>>>>> connection as the demand arises in merging runs), we observer following
>>>>>>>>>>> things
>>>>>>>>>>>
>>>>>>>>>>> If the plan has multiple remote query nodes (as there will be in
>>>>>>>>>>> case of merge join), we assign the same connection to these nodes. Before
>>>>>>>>>>> this assignment, the result from the previous connection is materialised at
>>>>>>>>>>> the coordinator. This means that, when we will get huge result from the
>>>>>>>>>>> datanode, it will be materialised (which will have the more cost as
>>>>>>>>>>> materialising it on tape, as this materialisation happens in a linked list,
>>>>>>>>>>> which is not optimized). We need to share connection between more than one
>>>>>>>>>>> RemoteQuery node because same transaction can not work on two connections
>>>>>>>>>>> to same server. Not only performance, but the code has become ugly because
>>>>>>>>>>> of this approach. At various places in executor, we have special handling
>>>>>>>>>>> for sorting, which needs to be maintained.
>>>>>>>>>>>
>>>>>>>>>>> Instead if we materialise all the result on tape and then
>>>>>>>>>>> proceed with step D5 in Knuth's algorithm for polyphase merge sort, the
>>>>>>>>>>> code will be much simpler and we won't loose much performance. In fact, we
>>>>>>>>>>> might be able to leverage fetching bulk data on connection which can be
>>>>>>>>>>> materialised on tape in bulk.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Regards;
>>>>>>>>>>>> ----------
>>>>>>>>>>>> Koichi Suzuki
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2013/3/25 Ashutosh Bapat <ash...@en...>:
>>>>>>>>>>>> > Hi All,
>>>>>>>>>>>> > I am working on using remote sorting for merge joins. The
>>>>>>>>>>>> idea is while
>>>>>>>>>>>> > using merge join at the coordinator, get the data sorted from
>>>>>>>>>>>> the datanodes;
>>>>>>>>>>>> > for replicated relations, we can get all the rows sorted and
>>>>>>>>>>>> for distributed
>>>>>>>>>>>> > tables we have to get sorted runs which can be merged at the
>>>>>>>>>>>> coordinator.
>>>>>>>>>>>> > For merge join the sorted inner relation needs to be randomly
>>>>>>>>>>>> accessible.
>>>>>>>>>>>> > For replicated relations this can be achieved by
>>>>>>>>>>>> materialising the result.
>>>>>>>>>>>> > But for distributed relations, we do not materialise the
>>>>>>>>>>>> sorted result at
>>>>>>>>>>>> > coordinator but compute the sorted result by merging the
>>>>>>>>>>>> sorted results from
>>>>>>>>>>>> > individual nodes on the fly. For distributed relations, the
>>>>>>>>>>>> connection to
>>>>>>>>>>>> > the datanodes themselves are used as logical tapes (which
>>>>>>>>>>>> provide the sorted
>>>>>>>>>>>> > runs). The final result is computed on the fly by choosing
>>>>>>>>>>>> the smallest or
>>>>>>>>>>>> > greatest row (as required) from the connections.
>>>>>>>>>>>> >
>>>>>>>>>>>> > For a Sort node the materialised result can reside in memory
>>>>>>>>>>>> (if it fits
>>>>>>>>>>>> > there) or on one of the logical tapes used for merge sort.
>>>>>>>>>>>> So, in order to
>>>>>>>>>>>> > provide random access to the sorted result, we need to
>>>>>>>>>>>> materialise the
>>>>>>>>>>>> > result either in the memory or on the logical tape. In-memory
>>>>>>>>>>>> > materialisation is not easily possible since we have already
>>>>>>>>>>>> resorted for
>>>>>>>>>>>> > tape based sort, in case of distributed relations and to
>>>>>>>>>>>> materialise the
>>>>>>>>>>>> > result on tape, there is no logical tape available in current
>>>>>>>>>>>> algorithm. To
>>>>>>>>>>>> > make it work, there are following possible ways
>>>>>>>>>>>> >
>>>>>>>>>>>> > 1. When random access is required, materialise the sorted
>>>>>>>>>>>> runs from
>>>>>>>>>>>> > individual nodes onto tapes (one tape for each node) and then
>>>>>>>>>>>> merge them on
>>>>>>>>>>>> > one extra tape, which can be used for materialisation.
>>>>>>>>>>>> > 2. Use a mix of connections and logical tape in the same tape
>>>>>>>>>>>> set. Merge the
>>>>>>>>>>>> > sorted runs from connections on a logical tape in the same
>>>>>>>>>>>> logical tape set.
>>>>>>>>>>>> >
>>>>>>>>>>>> > While the second one looks attractive from performance
>>>>>>>>>>>> perspective (it saves
>>>>>>>>>>>> > writing and reading from the tape), it would make the merge
>>>>>>>>>>>> code ugly by
>>>>>>>>>>>> > using mixed tapes. The read calls for connection and logical
>>>>>>>>>>>> tape are
>>>>>>>>>>>> > different and we will need both on the logical tape where the
>>>>>>>>>>>> final result
>>>>>>>>>>>> > is materialized. So, I am thinking of going with 1, in fact,
>>>>>>>>>>>> to have same
>>>>>>>>>>>> > code to handle remote sort, use 1 in all cases (whether or not
>>>>>>>>>>>> > materialization is required).
>>>>>>>>>>>> >
>>>>>>>>>>>> > Had original authors of remote sort code thought about this
>>>>>>>>>>>> materialization?
>>>>>>>>>>>> > Anything they can share on this topic?
>>>>>>>>>>>> > Any comment?
>>>>>>>>>>>> > --
>>>>>>>>>>>> > Best Wishes,
>>>>>>>>>>>> > Ashutosh Bapat
>>>>>>>>>>>> > EntepriseDB Corporation
>>>>>>>>>>>> > The Enterprise Postgres Company
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>> > Everyone hates slow websites. So do we.
>>>>>>>>>>>> > Make your web apps faster with AppDynamics
>>>>>>>>>>>> > Download AppDynamics Lite for free today:
>>>>>>>>>>>> > http://p.sf.net/sfu/appdyn_d2d_mar
>>>>>>>>>>>> > _______________________________________________
>>>>>>>>>>>> > Postgres-xc-developers mailing list
>>>>>>>>>>>> > Pos...@li...
>>>>>>>>>>>> >
>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>>>>>>>>> >
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best Wishes,
>>>>>>>>>>> Ashutosh Bapat
>>>>>>>>>>> EntepriseDB Corporation
>>>>>>>>>>> The Enterprise Postgres Company
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best Wishes,
>>>>>>>>>> Ashutosh Bapat
>>>>>>>>>> EntepriseDB Corporation
>>>>>>>>>> The Enterprise Postgres Company
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Wishes,
>>>>>>>> Ashutosh Bapat
>>>>>>>> EntepriseDB Corporation
>>>>>>>> The Enterprise Postgres Company
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Own the Future-Intel&reg; Level Up Game Demo Contest 2013
>>>>>>> Rise to greatness in Intel's independent game demo contest.
>>>>>>> Compete for recognition, cash, and the chance to get your game
>>>>>>> on Steam. $5K grand prize plus 10 genre and skill prizes.
>>>>>>> Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
>>>>>>> _______________________________________________
>>>>>>> Postgres-xc-developers mailing list
>>>>>>> Pos...@li...
>>>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Abbas
>>>>>> Architect
>>>>>> EnterpriseDB Corporation
>>>>>> The Enterprise PostgreSQL Company
>>>>>>
>>>>>> Phone: 92-334-5100153
>>>>>>
>>>>>> Website: www.enterprisedb.com
>>>>>> EnterpriseDB Blog: http://blogs.enterprisedb.com/
>>>>>> Follow us on Twitter: http://www.twitter.com/enterprisedb
>>>>>>
>>>>>> This e-mail message (and any attachment) is intended for the use of
>>>>>> the individual or entity to whom it is addressed. This message
>>>>>> contains information from EnterpriseDB Corporation that may be
>>>>>> privileged, confidential, or exempt from disclosure under applicable
>>>>>> law. If you are not the intended recipient or authorized to receive
>>>>>> this for the intended recipient, any use, dissemination, distribution,
>>>>>> retention, archiving, or copying of this communication is strictly
>>>>>> prohibited. If you have received this e-mail in error, please notify
>>>>>> the sender immediately by reply e-mail and delete this message.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Abbas
>>>>> Architect
>>>>> EnterpriseDB Corporation
>>>>> The Enterprise PostgreSQL Company
>>>>>
>>>>> Phone: 92-334-5100153
>>>>>
>>>>> Website: www.enterprisedb.com
>>>>> EnterpriseDB Blog: http://blogs.enterprisedb.com/
>>>>> Follow us on Twitter: http://www.twitter.com/enterprisedb
>>>>>
>>>>> This e-mail message (and any attachment) is intended for the use of
>>>>> the individual or entity to whom it is addressed. This message
>>>>> contains information from EnterpriseDB Corporation that may be
>>>>> privileged, confidential, or exempt from disclosure under applicable
>>>>> law. If you are not the intended recipient or authorized to receive
>>>>> this for the intended recipient, any use, dissemination, distribution,
>>>>> retention, archiving, or copying of this communication is strictly
>>>>> prohibited. If you have received this e-mail in error, please notify
>>>>> the sender immediately by reply e-mail and delete this message.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Wishes,
>>>> Ashutosh Bapat
>>>> EntepriseDB Corporation
>>>> The Enterprise Postgres Company
>>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Abbas
>>> Architect
>>> EnterpriseDB Corporation
>>> The Enterprise PostgreSQL Company
>>>
>>> Phone: 92-334-5100153
>>>
>>> Website: www.enterprisedb.com
>>> EnterpriseDB Blog: http://blogs.enterprisedb.com/
>>> Follow us on Twitter: http://www.twitter.com/enterprisedb
>>>
>>> This e-mail message (and any attachment) is intended for the use of
>>> the individual or entity to whom it is addressed. This message
>>> contains information from EnterpriseDB Corporation that may be
>>> privileged, confidential, or exempt from disclosure under applicable
>>> law. If you are not the intended recipient or authorized to receive
>>> this for the intended recipient, any use, dissemination, distribution,
>>> retention, archiving, or copying of this communication is strictly
>>> prohibited. If you have received this e-mail in error, please notify
>>> the sender immediately by reply e-mail and delete this message.
>>>
>>
>>
>>
>> --
>> Best Wishes,
>> Ashutosh Bapat
>> EntepriseDB Corporation
>> The Enterprise Postgres Company
>>
>
>
>
> --
> --
> Abbas
> Architect
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
> Phone: 92-334-5100153
>
> Website: www.enterprisedb.com
> EnterpriseDB Blog: http://blogs.enterprisedb.com/
> Follow us on Twitter: http://www.twitter.com/enterprisedb
>
> This e-mail message (and any attachment) is intended for the use of
> the individual or entity to whom it is addressed. This message
> contains information from EnterpriseDB Corporation that may be
> privileged, confidential, or exempt from disclosure under applicable
> law. If you are not the intended recipient or authorized to receive
> this for the intended recipient, any use, dissemination, distribution,
> retention, archiving, or copying of this communication is strictly
> prohibited. If you have received this e-mail in error, please notify
> the sender immediately by reply e-mail and delete this message.
>

-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company