From: Erin S. <eri...@gm...> - 2006-11-15 04:47:09
|
On 11/14/06, Tim Hochberg <tim...@ie...> wrote: SNIP > > Interesting results Tim. From Pierre's results > > we saw that fromiter is the fastest way to get data > > into arrays. With your results we see there is a > > difference between iterating over the cursor and > > doing a fetchall() as well. Surprisingly, running > > the cursor is faster. > > > > This must come not from the data retrieval rate but > > from creating the copies in memory. > I imagine that is correct. In particular, skipping the making of the > list avoids the creation of 1e6 Python floats, which is going to result > in a lot of memory allocation. > > > But just in case > > I think there is one more thing to check. > > I haven't used sqlite, but with other databases I have > > used there is often a large variance in times from > > one select to the next. Can you > > repeat these tests with a timeit().repeat and give the > > minimum? > > > Sure. Here's two sets of numbers. The first is for repeat(3,1) and the > second for repeat (3,3). > > retrieve1 [0.91198546183942375, 0.9042411814909439, 0.90411518782415001] > retrieve2 [0.98355349632425515, 0.95424502276127754, > 0.94714328217692412] > retrieve3 [1.2227562441595268, 1.2195848913758596, 1.2206193803961156] > retrieve4 [8.4344040932576547, 8.3556245276983532, 8.3568341786456131] > > retrieve1 [2.7317457945074026, 2.7274656415829384, 2.7250913174719109] > retrieve2 [2.8857103346933783, 2.8379299603720582, 2.8386803350705136] > retrieve3 [3.6870535221655203, 3.8980253076857565, 3.7002303365371887] > retrieve4 [25.138646950939304, 25.06737169109482, 25.052789390830412] > > The timings of these are pretty consistent with each other with the > previous runs except that the difference between retrieve1 and retrieve2 > has disappeared. In fact, all of the runs that produce lists have gotten > faster by about the same amount.. Odd! A little digging reveals that > timeit turns off garbage collection to make things more repeatable. > Turning gc back on yields the following numbers for repeat(3,1): > > retrieve1 [0.92517736192728406, 0.92109667569481601, > 0.92390960303614023] > retrieve2 [1.3018456256311914, 1.2277141368525903, 1.2929785768861706] > retrieve3 [1.5309831277438946, 1.4998853206203577, 1.5601200711263488] > retrieve4 [8.6400394463542227, 8.7022300320292061, 8.6807761880350682] > > So there we are, back to our original numbers. This also reveals that > the majority of the time difference between retrieve1 and retrieve2 *is* > memory related. However, it's the deallocation (or more precisely > garbage collection) of all those floats that is the killer. Here's what > the timeit routines looked like: > > if __name__ == "__main__": > for name in ['retrieve1', 'retrieve2', 'retrieve3', 'retrieve4']: > print name, timeit.Timer("%s(conn)" % name, "gc.enable(); > from scratch import sqlite3, %s, setup; conn = > sqlite3.connect(':memory:'); setup(conn)" % name).repeat(3, 1) > > > As an aside, your database is running on a local disk, right, so > > the overehead of retrieving data is minimized here? > > For my tests I think I am data retrieval limited because I > > get exactly the same time for the equivalent of retrieve1 > > and retrieve2. > > > As Keith pointed out, I'm keeping the database in memory (although > there's a very good chance some of it is actually swapped to disk) so > it's probably relatively fast. On the other hand, if you are using > timeit to make your measurements you could be running into the (lack of) > garbage collection issue I mention above. > I checked and for my real situation I am totally limited by the time to retrieve the data. From these tests I think this will probably be true even if the data is on a local disk. I think these experiments show that iterating over the cursor is the best approach. It is better from a memory point of view and is probably also the fastest. We should still resolve the slowness for the array() function however when converting lists of tuples. I will file a ticket if no one else has. Erin |