[Sqlalchemy-tickets] Issue #3650: limit/offset used with a subqueryload creates unpredictable resul

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

New issue 3650: limit/offset used with a subqueryload creates unpredictable results
https://bitbucket.org/zzzeek/sqlalchemy/issues/3650/limit-offset-used-with-a-subqueryload

jvanasco:

This affects `subqueryload`, not `joinedload`.

It is related to https://bitbucket.org/zzzeek/sqlalchemy/issues/3064/warn-when-limit-offset-used-w-o-order-by

Assume a simple relationship of Foo to Bar:

    class Bar(Base):
        __tablename__ = 'bar'
        id = Column(Integer, primary_key=True)

    class Foo(Base):
        __tablename__ = 'foo'
        id = Column(Integer, primary_key=True)
        id_bar = Column(Integer, ForeignKey("bar.id"), nullable=False)

        bar = sqlalchemy.orm.relationship("Bar")

The following query will yield unpredictable results for the 'bar' relationship, even when an id_bar is present:

    foos = s.query(Foo)\
            .options(subqueryload('bar'))\
            .limit(20)\
            .offset(10)\
            .all()

The issue is that the subqueryload logic generates a selection like such (reformatted for emphasis):

         SELECT bar.* FROM
     (SELECT DISTINCT foo.id_bar AS foo_id_bar FROM foo LIMIT :limit OFFSET :offset )
          as anon_1 JOIN bar on bar.id = anon_1.foo_id_bar ORDER BY anon_1.foo_id_bar;

Because that inner select for the subquery is performing a DISTINCT, the LIMIT and OFFSET apply to the 'distinct' -- not overall -- list.

one example scenario that will fail: the offset is greater than the total number of possible values.  if every foo is assigned an id_bar of only 2 values (1, 2), paginating an offset of 2 of greater will always return NULL and no `bar` relation will be set.  

    # 1. make a bunch of Bars.  Let's do 20!
    for i in range(0, 20):
        b = Bar(id=i)
        s.add(b)
        s.flush()
        s.commit()

    # 2, okay, now make 100 foos... but only use 2 `bars`.
    for i in range(0, 100):
        if i < 50:
            id_bar = 1
        else:
            id_bar = 2
        f = Foo(id=i, id_bar=id_bar)
        s.add(f)
        f.id
        s.flush()
        s.commit()

    # run the above select , and see the results:
    for f in foos:
        print "--"
        print "examing foo.%s" % f.id
        print f.id_bar
        print f.bar

    # you'll notice that `bar` will be missing even when id_bar is set

another example scenario is when you have a lot of repetition within the dataset.  e.g. if there are fewer unique BARs than FOOs, you'll encounter errors when you paginate closer to the end.    i haven't been able to recreate a reliable dataset for this, but it happens in my production dataset when I paginate 80% through.

Then to complicate further, this sort of becomes the other issue -- there is no implicit `order_by` on the inner query unless an `order_by` is explicitly applied to the outer query.  

`order_by(Foo.id.asc())` will create this inner query:

 	(SELECT DISTINCT foo.id_bar AS foo_id_bar, foo.id AS foo_id 
	 FROM foo ORDER BY foo.id ASC
	 LIMIT ? OFFSET ?)

however, no order_by will result in the inner query

	(SELECT DISTINCT foo.id_bar AS foo_id_bar 
	 FROM foo
	 LIMIT ? OFFSET ?)

there is no reason for the db to guarantee these results will be for the same ids returned in the previous query (the non-subqueryload 'master' query that sqlalchemy just issued).

The other ticket talked about a more specific use-case with depth, but I'll address the larger and less detailed concept - in the absence of an explicit order_by, sqlalchemy appears to issue an implicit order_by on the primary key -- however does not extend that implicit order_by to the inner query.

the fixes I suggest are:

1. migrate this into 2 queries -- an inner that paginates and an outer that applies distinct (or group_by, which can be much faster for this in certain scenarios)
    - SELECT DISTINCT foo.id_bar AS foo_id_bar FROM foo LIMIT :limit OFFSET :offset
    + SELECT DISTINCT foo_id_bar FROM (SELECT foo.id_bar AS foo_id_bar FROM foo LIMIT :limit OFFSET :offset) AS foo

2. in the absence of no order_by issued with a subquery, just apply the same implicit primary-key ordering to the inner query that was used on the outer query.

[Sqlalchemy-tickets] Issue #3650: limit/offset used with a subqueryload creates unpredictable resul

[Sqlalchemy-tickets] Issue #3650: limit/offset used with a subqueryload creates unpredictable results (zzzeek/sqlalchemy)