From: Justin B. <jgb...@gm...> - 2008-03-25 19:32:17
|
One of the benefits of haskelldb is the ability to decompose a query into component parts and put them together in different ways. I believe I have a found a bug in the library that makes this hard to do. This literate email demonstrates the bug. The full source is given at the end, but I want to talk about the cause here. Imagine two queries that project one aggregate column each: tbl1 = do x <- table ... restrict ... project $ ... id1 << count ( ... ) tbl2 = do x <- table ... restrict .. project $ .. id2 << count ( ... ) When each query is run individually, the SQL produced is as expected (i.e., "select xx as id1 from (select count(...) as xx from ..)"). When the queries are combined by hand: cmb1 = do x1 <- table .. x2 <- table .. restrict ( x1 ! ...) restrict (x2 ! ...) project $ id1 << count ( ... ) # id2 << count (...) The SQL produced is also as expected: SELECT id13 as id1, id23 as id2 FROM (SELECT COUNT(id11) as id13, COUNT(id22) as id23 FROM (SELECT id2 as id22 FROM sample_table as T1 WHERE id1 < 0) as T1, (SELECT id1 as id11 FROM sample_table as T1 WHERE id1 > 0) as T2) as T1 The aggregates are at the same level, which is very important. However, if the individual queries are reused: cmb2 = do x1 <- tbl1 x2 <- tbl2 project $ id1 << x1 ! id1 # id2 << x2 ! id 2 the SQL produced is NOT correct. It will execute but the aggregates are at different levels: SELECT id12 as id1, id24 as id2 FROM (SELECT COUNT(id23) as id24, id12 FROM (SELECT id2 as id23 FROM sample_table as T1 WHERE id1 < 0) as T1, (SELECT COUNT(id11) as id12 FROM (SELECT id1 as id11 FROM sample_table as T1) as T1 WHERE id11 > 0) as T2 GROUP BY id12) as T1 When this query is executed, it will return NO rows if "select ID2 as ID23 .." returns no rows. In the correct query, 0 is returned in both columns (as expected with aggregates). That looks like a bug to me (though possibly its a quirk in postgres). The same SQL is produced by this, which is definitely counter-intuitive: cmb3 = do t1 <- do { x <- table ...; restrict ...; project $ id1 << count (x ! id1) } t2 <- do { x <- table ...; restrict ...; project $ id2 << count (x ! id2) } project $ id1 << t1 ! id1 # id2 << t2 ! id2 The bug occurs when the t2 value is constructed. The "table" operation causes a "Binary Times" value to be placed in the PrimQuery being built, BEFORE the aggregate projection is added. I would expect the query for t2 to be built in isolation first, and then added via Times to the existing PrimQuery. This structure can be seen by loading this email into GHCi and entering cmb1Qry, cmb2Qry, and cmb3Qry at the prompt. tbl1Qry and tbl2Qry also show the individual queries in isolation. tbl1SQL, tbl2SQL, etc. are provided which show the SQL generated by each different query. There really isn't a workaround for this bug, except to redefine queries. I am not sure how to fix it as I tried many different approaches. Any suggestions? Justin \begin{code} import Database.HaskellDB import Database.HaskellDB.DBLayout tbl1 :: Query (Rel (RecCons Id1 (Expr Int) RecNil)) tbl1 = do t1 <- table sample_table restrict (t1 ! id1 .>. constant 0) project $ id1 << count (t1 ! id1) tbl2 :: Query (Rel (RecCons Id2 (Expr Int) RecNil)) tbl2 = do t1 <- table sample_table restrict (t1 ! id1 .<. constant 0) project $ id2 << count (t1 ! id2) cmb1 :: Query (Rel (RecCons Id1 (Expr Int) (RecCons Id2 (Expr Int) RecNil))) cmb1 = do t1 <- table sample_table t2 <- table sample_table restrict (t1 ! id1 .>. constant 0) restrict (t2 ! id1 .<. constant 0) project $ id1 << count(t1 ! id1) # id2 << count(t2 ! id2) cmb2 :: Query (Rel (RecCons Id1 (Expr Int) (RecCons Id2 (Expr Int) RecNil))) cmb2 = do t1 <- tbl1 t2 <- tbl2 project $ id1 << t1 ! id1 # id2 << t2 ! id2 cmb3 :: Query (Rel (RecCons Id1 (Expr Int) (RecCons Id2 (Expr Int) RecNil))) cmb3 = do t1 <- do { x <- table sample_table; restrict (x ! id1 .>. constant 0); project $ id1 << count (x ! id1) } t2 <- do { x <- table sample_table; restrict (x ! id1 .<. constant 0); project $ id2 << count (x ! id2) } project $ id1 << t1 ! id1 # id2 << t2 ! id2 tbl1SQL = putStrLn $ showSql tbl1 tbl2SQL = putStrLn $ showSql tbl2 cmb1SQL = putStrLn $ showSql cmb1 cmb2SQL = putStrLn $ showSql cmb2 cmb3SQL = putStrLn $ showSql cmb3 tbl1Qry = putStrLn $ showQuery tbl1 tbl2Qry = putStrLn $ showQuery tbl2 cmb1Qry = putStrLn $ showQuery cmb1 cmb2Qry = putStrLn $ showQuery cmb2 cmb3Qry = putStrLn $ showQuery cmb3 data Id1 = Id1 data Id2 = Id2 instance FieldTag Id1 where fieldName _ = "id1" instance FieldTag Id2 where fieldName _ = "id2" id1 :: Attr Id1 Int id1 = mkAttr Id1 id2 :: Attr Id2 Int id2 = mkAttr Id2 sample_table :: Table (RecCons Id1 (Expr Int) (RecCons Id2 (Expr Int) RecNil)) sample_table = baseTable "sample_table" $ hdbMakeEntry Id1 # hdbMakeEntry Id2 \end{code} |
From: Bjorn B. <bj...@br...> - 2008-03-25 21:08:07
|
On Tue, Mar 25, 2008 at 8:32 PM, Justin Bailey <jgb...@gm...> wrote: > One of the benefits of haskelldb is the ability to decompose a query > into component parts and put them together in different ways. I > believe I have a found a bug in the library that makes this hard to > do. This literate email demonstrates the bug. The full source is given > at the end, but I want to talk about the cause here. Imagine two > queries that project one aggregate column each: > > tbl1 = do > x <- table ... > restrict ... > project $ ... id1 << count ( ... ) > > tbl2 = do > x <- table ... > restrict .. > project $ .. id2 << count ( ... ) > > When each query is run individually, the SQL produced is as expected > (i.e., "select xx as id1 from (select count(...) as xx from ..)"). > When the queries are combined by hand: > > cmb1 = do > x1 <- table .. > x2 <- table .. > restrict ( x1 ! ...) > restrict (x2 ! ...) > project $ id1 << count ( ... ) # id2 << count (...) > > The SQL produced is also as expected: > > SELECT id13 as id1, > id23 as id2 > FROM (SELECT COUNT(id11) as id13, > COUNT(id22) as id23 > FROM (SELECT id2 as id22 > FROM sample_table as T1 > WHERE id1 < 0) as T1, > (SELECT id1 as id11 > FROM sample_table as T1 > WHERE id1 > 0) as T2) as T1 > > The aggregates are at the same level, which is very important. > However, if the individual queries are reused: > > cmb2 = do > x1 <- tbl1 > x2 <- tbl2 > project $ id1 << x1 ! id1 # id2 << x2 ! id 2 > > the SQL produced is NOT correct. It will execute but the aggregates > are at different levels: > > SELECT id12 as id1, > id24 as id2 > FROM (SELECT COUNT(id23) as id24, > id12 > FROM (SELECT id2 as id23 > FROM sample_table as T1 > WHERE id1 < 0) as T1, > (SELECT COUNT(id11) as id12 > FROM (SELECT id1 as id11 > FROM sample_table as T1) as T1 > WHERE id11 > 0) as T2 > GROUP BY id12) as T1 > > When this query is executed, it will return NO rows if "select ID2 as > ID23 .." returns no rows. In the correct query, 0 is returned in both > columns (as expected with aggregates). That looks like a bug to me > (though possibly its a quirk in postgres). > > The same SQL is produced by this, which is definitely counter-intuitive: > > cmb3 = do > t1 <- do { x <- table ...; restrict ...; project $ id1 << count (x ! id1) } > t2 <- do { x <- table ...; restrict ...; project $ id2 << count (x ! id2) } > project $ id1 << t1 ! id1 # > id2 << t2 ! id2 > > The bug occurs when the t2 value is constructed. The "table" operation > causes a "Binary Times" value to be placed in the PrimQuery being > built, BEFORE the aggregate projection is added. I would expect the > query for t2 to be built in isolation first, and then added via Times > to the existing PrimQuery. > > This structure can be seen by loading this email into GHCi and > entering cmb1Qry, cmb2Qry, and cmb3Qry at the prompt. tbl1Qry and > tbl2Qry also show the individual queries in isolation. tbl1SQL, > tbl2SQL, etc. are provided which show the SQL generated by each > different query. > > There really isn't a workaround for this bug, except to redefine > queries. I am not sure how to fix it as I tried many different > approaches. Any suggestions? > > Justin Hi Justin, this is a very good observation, and a good argument for why using a Monad for constructing queries is a bad idea. Heffalump has proposed an alternative functional API which should make it easier to avoid problems like this. This is your counter-intuitive example: do do { table1; aggregate1 } do { table2; aggregate2 } rest But Monads should be associative (see http://www.haskell.org/haskellwiki/Monad_Laws for a little bit of explanation), so this should be equivalent to: do table1 aggregate1 table2 aggregate2 rest Which gives you the result that you didn't want. The conclusion is that with the current API, you can't compose aggregate queries the way you want. Queries in the monadic interface should be read line by line, applying the operations to the "current relation". This is not necessarily intuitive, especially not in Haskell. I think that Query was originally made a Monad just to take advantage of do-notation. I'm thinking that it should be possible to get around this problem by adding an explict cartesian product operator to use instead of >>= / do. But then we are well on our way to a non-monadic interface. /Björn > \begin{code} > > import Database.HaskellDB > import Database.HaskellDB.DBLayout > > tbl1 :: Query (Rel (RecCons Id1 (Expr Int) RecNil)) > tbl1 = do > t1 <- table sample_table > restrict (t1 ! id1 .>. constant 0) > project $ id1 << count (t1 ! id1) > > tbl2 :: Query (Rel (RecCons Id2 (Expr Int) RecNil)) > tbl2 = do > t1 <- table sample_table > restrict (t1 ! id1 .<. constant 0) > project $ id2 << count (t1 ! id2) > > cmb1 :: Query (Rel (RecCons Id1 (Expr Int) (RecCons Id2 (Expr Int) RecNil))) > cmb1 = do > t1 <- table sample_table > t2 <- table sample_table > restrict (t1 ! id1 .>. constant 0) > restrict (t2 ! id1 .<. constant 0) > project $ id1 << count(t1 ! id1) # > id2 << count(t2 ! id2) > > cmb2 :: Query (Rel (RecCons Id1 (Expr Int) (RecCons Id2 (Expr Int) RecNil))) > cmb2 = do > t1 <- tbl1 > t2 <- tbl2 > project $ id1 << t1 ! id1 # > id2 << t2 ! id2 > > cmb3 :: Query (Rel (RecCons Id1 (Expr Int) (RecCons Id2 (Expr Int) RecNil))) > cmb3 = do > t1 <- do { x <- table sample_table; restrict (x ! id1 .>. constant > 0); project $ id1 << count (x ! id1) } > t2 <- do { x <- table sample_table; restrict (x ! id1 .<. constant > 0); project $ id2 << count (x ! id2) } > project $ id1 << t1 ! id1 # > id2 << t2 ! id2 > > tbl1SQL = putStrLn $ showSql tbl1 > tbl2SQL = putStrLn $ showSql tbl2 > > cmb1SQL = putStrLn $ showSql cmb1 > cmb2SQL = putStrLn $ showSql cmb2 > cmb3SQL = putStrLn $ showSql cmb3 > > tbl1Qry = putStrLn $ showQuery tbl1 > tbl2Qry = putStrLn $ showQuery tbl2 > > cmb1Qry = putStrLn $ showQuery cmb1 > cmb2Qry = putStrLn $ showQuery cmb2 > cmb3Qry = putStrLn $ showQuery cmb3 > > data Id1 = Id1 > data Id2 = Id2 > > instance FieldTag Id1 where fieldName _ = "id1" > instance FieldTag Id2 where fieldName _ = "id2" > > id1 :: Attr Id1 Int > id1 = mkAttr Id1 > > id2 :: Attr Id2 Int > id2 = mkAttr Id2 > > sample_table :: Table (RecCons Id1 (Expr Int) (RecCons Id2 (Expr Int) RecNil)) > sample_table = baseTable "sample_table" $ > hdbMakeEntry Id1 # > hdbMakeEntry Id2 > > \end{code} |
From: Justin B. <jgb...@gm...> - 2008-03-25 21:26:13
|
On Tue, Mar 25, 2008 at 2:08 PM, Bjorn Bringert <bj...@br...> wrote: > Hi Justin, > > this is a very good observation, and a good argument for why using a > Monad for constructing queries is a bad idea. Heffalump has proposed > an alternative functional API which should make it easier to avoid > problems like this. Do you have a link to that? I would be interested in looking at it. > But Monads should be associative (see > http://www.haskell.org/haskellwiki/Monad_Laws for a little bit of > explanation), so this should be equivalent to: > > do > table1 > aggregate1 > table2 > aggregate2 > rest > > Which gives you the result that you didn't want. The conclusion is > that with the current API, you can't compose aggregate queries the way > you want. Queries in the monadic interface should be read line by A very good point. Aggregrates are weird in SQL anyways so I'm not surprised this shows up as a bug. > I'm thinking that it should be possible to get around this problem by > adding an explict cartesian product operator to use instead of >>= / > do. But then we are well on our way to a non-monadic interface. Can you think of a way to inject a primitive query into an existing relation? I tried to write a "seedQuery" function but it gave me an assertion failure in optimize: seedQuery :: (PrimQuery, Rel r) -> Query (Rel r) seedQuery (prim, rel) = Query (\(x, qry) -> (rel, (x, prim))) The signature for seedQuery follows from runQueryRel in Query.hs, so you would use it as: do x <- seedQuery . runQueryRel $ { do ... } y <- seedQuery . runQueryRel $ { do ... } rest If this function worked, a query could be built in isolation and the problem would be avoided (and maybe even still preserving associativity?). Justin |
From: Justin B. <jgb...@gm...> - 2008-03-26 23:40:31
|
On Tue, Mar 25, 2008 at 2:08 PM, Bjorn Bringert <bj...@br...> wrote: > I'm thinking that it should be possible to get around this problem by > adding an explict cartesian product operator to use instead of >>= / > do. But then we are well on our way to a non-monadic interface. How does this function look: prod :: Query (Rel r) -> Query (Rel r) prod (Query qs) = Query make where make (currentAlias, currentQry) = -- Take the query to add and run it first, using the current alias as -- a seed. let (Rel otherAlias otherScheme,(newestAlias, otherQuery)) = qs (currentAlias,Empty) -- Effectively renames all columns in otherQuery to make them unique in this -- query. assoc = zip (map (fresh newestAlias) otherScheme) (map (AttrExpr . fresh otherAlias) otherScheme) -- Produce a query which is a cross product of the other query and the current query. in (Rel newestAlias otherScheme, (newestAlias + 1, times (Project assoc otherQuery) currentQry)) I modeled it after the binrel function. The long variable names reflect me trying to figure out how the state (i.e. alias) was threaded around. Continuing the example above, I can define: cmb2x = do t1 <- prod tbl1 t2 <- prod tbl2 project $ id1 << t1 ! id1 # id2 << t2 ! id2 Evaluating cmb2x gives this SQL: SELECT id13 as id1, id26 as id2 FROM (SELECT id25 as id26 FROM (SELECT COUNT(id24) as id25 FROM (SELECT id2 as id24 FROM sample_table as T1 WHERE id1 < 0) as T1) as T1) as T1, (SELECT id12 as id13 FROM (SELECT COUNT(id11) as id12 FROM (SELECT id1 as id11 FROM sample_table as T1 WHERE id1 > 0) as T1) as T1) as T2 which is far too nested. However, a one line change in mergeProject in Optimize.hs: safe :: Assoc -> Bool -- merge projections that are all aggregates or no safe assoc = not (any (isAggregate.snd) assoc) || all (isAggregate . snd) assoc aggregates gives SQL like this: SELECT id13 as id1, id26 as id2 FROM (SELECT COUNT(id2) as id26 FROM sample_table as T1 WHERE id1 < 0) as T1, (SELECT COUNT(id1) as id13 FROM sample_table as T1 WHERE id1 > 0) as T2 Which is just what I want. Justin |