[Sqlalchemy-tickets] Issue #3650: limit/offset used with a subqueryload creates unpredictable resul
Brought to you by:
zzzeek
From: jvanasco <iss...@bi...> - 2016-02-15 23:58:21
|
New issue 3650: limit/offset used with a subqueryload creates unpredictable results https://bitbucket.org/zzzeek/sqlalchemy/issues/3650/limit-offset-used-with-a-subqueryload jvanasco: This affects `subqueryload`, not `joinedload`. It is related to https://bitbucket.org/zzzeek/sqlalchemy/issues/3064/warn-when-limit-offset-used-w-o-order-by Assume a simple relationship of Foo to Bar: class Bar(Base): __tablename__ = 'bar' id = Column(Integer, primary_key=True) class Foo(Base): __tablename__ = 'foo' id = Column(Integer, primary_key=True) id_bar = Column(Integer, ForeignKey("bar.id"), nullable=False) bar = sqlalchemy.orm.relationship("Bar") The following query will yield unpredictable results for the 'bar' relationship, even when an id_bar is present: foos = s.query(Foo)\ .options(subqueryload('bar'))\ .limit(20)\ .offset(10)\ .all() The issue is that the subqueryload logic generates a selection like such (reformatted for emphasis): SELECT bar.* FROM (SELECT DISTINCT foo.id_bar AS foo_id_bar FROM foo LIMIT :limit OFFSET :offset ) as anon_1 JOIN bar on bar.id = anon_1.foo_id_bar ORDER BY anon_1.foo_id_bar; Because that inner select for the subquery is performing a DISTINCT, the LIMIT and OFFSET apply to the 'distinct' -- not overall -- list. one example scenario that will fail: the offset is greater than the total number of possible values. if every foo is assigned an id_bar of only 2 values (1, 2), paginating an offset of 2 of greater will always return NULL and no `bar` relation will be set. # 1. make a bunch of Bars. Let's do 20! for i in range(0, 20): b = Bar(id=i) s.add(b) s.flush() s.commit() # 2, okay, now make 100 foos... but only use 2 `bars`. for i in range(0, 100): if i < 50: id_bar = 1 else: id_bar = 2 f = Foo(id=i, id_bar=id_bar) s.add(f) f.id s.flush() s.commit() # run the above select , and see the results: for f in foos: print "--" print "examing foo.%s" % f.id print f.id_bar print f.bar # you'll notice that `bar` will be missing even when id_bar is set another example scenario is when you have a lot of repetition within the dataset. e.g. if there are fewer unique BARs than FOOs, you'll encounter errors when you paginate closer to the end. i haven't been able to recreate a reliable dataset for this, but it happens in my production dataset when I paginate 80% through. Then to complicate further, this sort of becomes the other issue -- there is no implicit `order_by` on the inner query unless an `order_by` is explicitly applied to the outer query. `order_by(Foo.id.asc())` will create this inner query: (SELECT DISTINCT foo.id_bar AS foo_id_bar, foo.id AS foo_id FROM foo ORDER BY foo.id ASC LIMIT ? OFFSET ?) however, no order_by will result in the inner query (SELECT DISTINCT foo.id_bar AS foo_id_bar FROM foo LIMIT ? OFFSET ?) there is no reason for the db to guarantee these results will be for the same ids returned in the previous query (the non-subqueryload 'master' query that sqlalchemy just issued). The other ticket talked about a more specific use-case with depth, but I'll address the larger and less detailed concept - in the absence of an explicit order_by, sqlalchemy appears to issue an implicit order_by on the primary key -- however does not extend that implicit order_by to the inner query. the fixes I suggest are: 1. migrate this into 2 queries -- an inner that paginates and an outer that applies distinct (or group_by, which can be much faster for this in certain scenarios) - SELECT DISTINCT foo.id_bar AS foo_id_bar FROM foo LIMIT :limit OFFSET :offset + SELECT DISTINCT foo_id_bar FROM (SELECT foo.id_bar AS foo_id_bar FROM foo LIMIT :limit OFFSET :offset) AS foo 2. in the absence of no order_by issued with a subquery, just apply the same implicit primary-key ordering to the inner query that was used on the outer query. |