From: Jim B. <ba...@ne...> - 2014-11-07 20:19:06
|
Yes, I am getting the exact same ordering with and without DISTINCT. Thanks! Jim > On Nov 7, 2014, at 3:09 PM, Bryan Thompson <br...@sy...> wrote: > > Jim, > > Did that file fix the issue for you? > > Thanks, > Bryan > > ---- > Bryan Thompson > Chief Scientist & Founder > SYSTAP, LLC > 4501 Tower Road > Greensboro, NC 27410 > br...@sy... > http://bigdata.com > http://mapgraph.io > CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. > > > > On Fri, Nov 7, 2014 at 10:00 AM, Jim Balhoff <ba...@ne...> wrote: > I can give this a try—it will be good experience since I am not really familiar with the Bigdata source code. So it may take me a little while. > > Thanks, > Jim > > > > On Nov 7, 2014, at 8:44 AM, Bryan Thompson <br...@sy...> wrote: > > > > Jim, > > > > Can you put together a unit test for this so we can avoid regressions? It would need to have a sufficiently large data set to allow the problem to be demonstrated. You would need to run both queries and compare the resulting ordering. The data would have to be something that could be committed into SVN, so with appropriate data rights and not too large. But still large enough. > > > > Thanks, > > Bryan > > > > ---- > > Bryan Thompson > > Chief Scientist & Founder > > SYSTAP, LLC > > 4501 Tower Road > > Greensboro, NC 27410 > > br...@sy... > > http://bigdata.com > > http://mapgraph.io > > CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. > > > > > > > > On Fri, Nov 7, 2014 at 8:37 AM, Bryan Thompson <br...@sy...> wrote: > > Jim. > > > > Ok. I was able to pull together the output of both queries into a single worksheet and then compare the rows and mark the rows that were not EQUALS and as such had a different ordering. > > > > I have created a ticket for this. See http://trac.bigdata.com/ticket/1044. > > > > I would appreciate it if you could have gone a little further with this and reduced the problem to something that clearly highlighted the problem. I had to spend quite a bit of time trying to figure out why you were seeing a problem in the output data. I could not spot any problem myself until I put the data sets side-by-side in Excel and even then I had to automate the comparison and then FILTER (in Excel) to find the rows where the output differed. > > > > I think that I know the root cause. I will update the ticket shortly and attach a file that you can test on your end for a fix. > > > > Thanks, > > Bryan > > > > > > ---- > > Bryan Thompson > > Chief Scientist & Founder > > SYSTAP, LLC > > 4501 Tower Road > > Greensboro, NC 27410 > > br...@sy... > > http://bigdata.com > > http://mapgraph.io > > CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. > > > > > > > > On Thu, Nov 6, 2014 at 9:14 PM, Jim Balhoff <ba...@ne...> wrote: > > I just realized my message may have been misleading. By "results are the same", I mean that the problem is still apparent. When using SELECT DISTINCT, ORDER BY does not work correctly and produces a different ordering compared to SELECT. > > > > > > > > > > On Nov 6, 2014, at 12:22 PM, Jim Balhoff <ba...@ne...> wrote: > > > > > > I updated the query to use the simple variable in ORDER BY, and the results are the same. > > > > > > Here is the exact query (with or without DISTINCT) for the linked results: > > > > > > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> > > > PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> > > > PREFIX owl: <http://www.w3.org/2002/07/owl#> > > > > > > SELECT DISTINCT ?term ?string_label > > > WHERE > > > { > > > ?term rdf:type owl:Class . > > > ?term rdfs:label ?term_label . > > > BIND (STR(?term_label) AS ?string_label) > > > } > > > ORDER BY ?string_label > > > > > > > > > Results (same number of rows either way): > > > SELECT DISTINCT: > > > explain: https://dl.dropboxusercontent.com/u/6704325/bigdata/2014-11-6/with_distinct_explain.html > > > result: https://dl.dropboxusercontent.com/u/6704325/bigdata/2014-11-6/with_distinct_result.csv > > > > > > SELECT: > > > explain: https://dl.dropboxusercontent.com/u/6704325/bigdata/2014-11-6/no_distinct_explain.html > > > result: https://dl.dropboxusercontent.com/u/6704325/bigdata/2014-11-6/no_distinct_result.csv > > > > > > Thanks, > > > Jim > > > > > > > > > > > >> On Nov 6, 2014, at 12:01 PM, Bryan Thompson <br...@sy...> wrote: > > >> > > >> What happens if you replace that last line with: > > >> > > >> ORDER BY ?string_label > > >> > > >> rather than > > >> > > >> ORDER BY STR(?string_label) > > >> > > >> Remember, it is assuming that the ORDER BY is using simple variables. > > >> > > >> Bryan > > >> > > >> On Thu, Nov 6, 2014 at 11:58 AM, Jim Balhoff <ba...@ne...> wrote: > > >> Here is the exact query (with or without DISTINCT) for the linked results: > > >> > > >> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> > > >> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> > > >> PREFIX owl: <http://www.w3.org/2002/07/owl#> > > >> > > >> SELECT DISTINCT ?term ?string_label > > >> WHERE > > >> { > > >> ?term rdf:type owl:Class . > > >> ?term rdfs:label ?term_label . > > >> BIND (STR(?term_label) AS ?string_label) > > >> } > > >> ORDER BY STR(?string_label) > > >> > > >> > > >> Results (same number of rows either way): > > >> SELECT DISTINCT: > > >> explain: https://dl.dropboxusercontent.com/u/6704325/bigdata/2014-11-6/with_distinct_explain.html > > >> result: https://dl.dropboxusercontent.com/u/6704325/bigdata/2014-11-6/with_distinct_result.csv > > >> > > >> SELECT: > > >> explain: https://dl.dropboxusercontent.com/u/6704325/bigdata/2014-11-6/no_distinct_explain.html > > >> result: https://dl.dropboxusercontent.com/u/6704325/bigdata/2014-11-6/no_distinct_result.csv > > >> > > >> You can diff the two results files to see the out-of-order blocks. > > >> > > >> I suppose it does look like the DISTINCT query plan has ORDER BY applied before DISTINCT, if I am reading it right. > > >> > > >> Thanks, > > >> Jim > > >> > > >> > > >> > > >> > > >>> On Nov 6, 2014, at 10:10 AM, Bryan Thompson <br...@sy...> wrote: > > >>> > > >>> Jim, > > >>> > > >>> 502 is about support for expressions (other than simple variables in ORDER_BY). > > >>> > > >>> If there is an issue with DISTINCT + ORDER_BY then this would be a new ticket. > > >>> > > >>> Just post the EXPLAIN (attach to the email) for the moment. I want to see how this is being generated. We should then check the specification and make sure that the correct behavior is DISTINCT followed by ORDER BY with any limit applied after the ORDER BY. I can then check the code for how we are handling this. > > >>> > > >>> The relevant logic is in AST2BOpUtility at line 451. You can see that it is already attempting to handle this and that there was a historical ticket for this issue (#563). > > >>> > > >>> > > >>> > > >>> /* > > >>> > > >>> * Note: The DISTINCT operators also enforce the projection. > > >>> > > >>> * > > >>> > > >>> * Note: REDUCED allows, but does not require, either complete or > > >>> > > >>> * partial filtering of duplicates. It is part of what openrdf does > > >>> > > >>> * for a DESCRIBE query. > > >>> > > >>> * > > >>> > > >>> * Note: We do not currently have special operator for REDUCED. One > > >>> > > >>> * could be created using chunk wise DISTINCT. Note that REDUCED may > > >>> > > >>> * not change the order in which the solutions appear (but we are > > >>> > > >>> * evaluating it before ORDER BY so that is Ok.) > > >>> > > >>> * > > >>> > > >>> * TODO If there is an ORDER BY and a DISTINCT then the sort can be > > >>> > > >>> * used to impose the distinct without the overhead of a hash index > > >>> > > >>> * by filtering out the duplicate solutions after the sort. > > >>> > > >>> */ > > >>> > > >>> > > >>> > > >>> // When true, DISTINCT must preserve ORDER BY ordering. > > >>> > > >>> final boolean preserveOrder; > > >>> > > >>> > > >>> > > >>> if (orderBy != null && !orderBy.isEmpty()) { > > >>> > > >>> > > >>> > > >>> /* > > >>> > > >>> * Note: ORDER BY before DISTINCT, so DISTINCT must preserve > > >>> > > >>> * order. > > >>> > > >>> * > > >>> > > >>> * @see https://sourceforge.net/apps/trac/bigdata/ticket/563 > > >>> > > >>> * (ORDER BY + DISTINCT) > > >>> > > >>> */ > > >>> > > >>> > > >>> preserveOrder = true; > > >>> > > >>> > > >>> > > >>> left = addOrderBy(left, queryBase, orderBy, ctx); > > >>> > > >>> > > >>> > > >>> } else { > > >>> > > >>> > > >>> preserveOrder = false; > > >>> > > >>> > > >>> } > > >>> > > >>> > > >>> > > >>> if (projection.isDistinct() || projection.isReduced()) { > > >>> > > >>> > > >>> > > >>> left = addDistinct(left, queryBase, preserveOrder, ctx); > > >>> > > >>> > > >>> > > >>> } > > >>> > > >>> > > >>> > > >>> } else { > > >>> > > >>> > > >>> > > >>> /* > > >>> > > >>> * TODO Under what circumstances can the projection be [null]? > > >>> > > >>> */ > > >>> > > >>> > > >>> if (orderBy != null && !orderBy.isEmpty()) { > > >>> > > >>> > > >>> > > >>> left = addOrderBy(left, queryBase, orderBy, ctx); > > >>> > > >>> > > >>> > > >>> } > > >>> > > >>> > > >>> > > >>> } > > >>> > > >>> > > >>> > > >>> Bryan > > >>> > > >>> > > >>> ---- > > >>> Bryan Thompson > > >>> Chief Scientist & Founder > > >>> SYSTAP, LLC > > >>> 4501 Tower Road > > >>> Greensboro, NC 27410 > > >>> br...@sy... > > >>> http://bigdata.com > > >>> http://mapgraph.io > > >>> CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. > > >>> > > >>> > > >>> > > >>> On Thu, Nov 6, 2014 at 10:03 AM, Jim Balhoff <ba...@ne...> wrote: > > >>> Hi Bryan, > > >>> > > >>> Just to clarify, would you like me to attach the info to ticket 502, or continue posting to the developer list? > > >>> > > >>> Thanks, > > >>> Jim > > >>> > > >>> > > >>>> On Nov 6, 2014, at 8:28 AM, Bryan Thompson <br...@sy...> wrote: > > >>>> > > >>>> The ticket for allowing aggregates in ORDER BY is: > > >>>> > > >>>> - http://trac.bigdata.com/ticket/502 (Allow aggregates in ORDER BY clause) > > >>>> > > >>>> Can you attach the EXPLAIN of the query with and without DISTINCT. The issue may be that the DISTINCT is being applied after the ORDER BY. I seem to remember some issue historically with operations being performed before/after the ORDER BY, but I do not have any distinct recollection of a problematic interaction between DISTINCT and ORDER BY. > > >>>> > > >>>> Bryan > > >>>> > > >>>> ---- > > >>>> Bryan Thompson > > >>>> Chief Scientist & Founder > > >>>> SYSTAP, LLC > > >>>> 4501 Tower Road > > >>>> Greensboro, NC 27410 > > >>>> br...@sy... > > >>>> http://bigdata.com > > >>>> http://mapgraph.io > > >>>> CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. > > >>>> > > >>>> > > >>>> > > >>>> On Wed, Nov 5, 2014 at 6:14 PM, Jim Balhoff <ba...@ne...> wrote: > > >>>>> On Nov 5, 2014, at 5:46 PM, Jeremy J Carroll <jj...@sy...> wrote: > > >>>>> > > >>>>> > > >>>>>> On Nov 5, 2014, at 1:02 PM, Bryan Thompson <br...@sy...> wrote: > > >>>>>> > > >>>>>> There could be an issue with ORDER BY operating on an anonymous and non-projected variable. Try declaring and binding a variable for STR(?label) inside of the query and then using that variable in the ORDER BY clause. > > >>>>> > > >>>>> > > >>>>> Yes I tend to find the results of ORDER BY are more what I expect if I do not include an expression in the ORDER BY but simply variables. I BIND any expression before the ORDER BY. > > >>>>> > > >>>>> I believe there is a trac item for this, but since the workaround is easy, I have never seen it as high priority > > >>>>> > > >>>> > > >>>> As suggested I tried binding a variable as `BIND (STR(?term_label) AS ?string_label)` and using that to sort. Still incorrect ordering. But, I tried removing DISTINCT, and then the ordering is correct. Even going back to the anonymous `ORDER BY STR(?term_label)`, ordering is still correct if I remove DISTINCT. For this specific query DISTINCT is not needed, but I do need it for my application. Is there a reason to not expect DISTINCT to work correctly with ORDER BY? > > >>>> > > >>>> Thanks both of you for all of your help, > > >>>> Jim > > >>>> > > >>>> > > >>> > > >>> > > >> > > >> > > > > > > > > > > > |