Re: [Numpy-discussion] mysql -> record array

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 11/14/06, Tim Hochberg <tim...@ie...> wrote:
SNIP
> > Interesting results Tim.  From Pierre's results
> > we saw that fromiter is the fastest way to get data
> > into arrays.  With your results we see there is a
> > difference between iterating over the cursor and
> > doing a fetchall() as well. Surprisingly, running
> > the cursor is faster.
> >
> > This must come not from the data retrieval rate but
> > from creating the copies in memory.
> I imagine that is correct. In particular, skipping the making of the
> list avoids the creation of 1e6 Python floats, which is going to result
> in a lot of memory allocation.
>
> >  But just in case
> > I think there is one more thing to check.
> > I haven't used sqlite, but with other databases I have
> > used there is often a large variance in times from
> > one select to the next.  Can you
> > repeat these tests with a timeit().repeat  and give the
> > minimum?
> >
> Sure. Here's two sets of numbers. The first is for repeat(3,1) and the
> second for repeat (3,3).
>
>     retrieve1 [0.91198546183942375, 0.9042411814909439, 0.90411518782415001]
>     retrieve2 [0.98355349632425515, 0.95424502276127754,
>     0.94714328217692412]
>     retrieve3 [1.2227562441595268, 1.2195848913758596, 1.2206193803961156]
>     retrieve4 [8.4344040932576547, 8.3556245276983532, 8.3568341786456131]
>
>     retrieve1 [2.7317457945074026, 2.7274656415829384, 2.7250913174719109]
>     retrieve2 [2.8857103346933783, 2.8379299603720582, 2.8386803350705136]
>     retrieve3 [3.6870535221655203, 3.8980253076857565, 3.7002303365371887]
>     retrieve4 [25.138646950939304, 25.06737169109482, 25.052789390830412]
>
> The timings of these are pretty consistent with each other with the
> previous runs except that the difference between retrieve1 and retrieve2
> has disappeared. In fact, all of the runs that produce lists have gotten
> faster by about the same amount.. Odd! A little digging reveals that
> timeit turns off garbage collection to make things more repeatable.
> Turning gc back on yields the following numbers for repeat(3,1):
>
>     retrieve1 [0.92517736192728406, 0.92109667569481601,
>     0.92390960303614023]
>     retrieve2 [1.3018456256311914, 1.2277141368525903, 1.2929785768861706]
>     retrieve3 [1.5309831277438946, 1.4998853206203577, 1.5601200711263488]
>     retrieve4 [8.6400394463542227, 8.7022300320292061, 8.6807761880350682]
>
> So there we are, back to our original numbers. This also reveals that
> the majority of the time difference between retrieve1 and retrieve2 *is*
> memory related. However, it's the deallocation (or more precisely
> garbage collection) of all those floats that is the killer. Here's what
> the timeit routines looked like:
>
>     if __name__ == "__main__":
>         for name in ['retrieve1', 'retrieve2', 'retrieve3', 'retrieve4']:
>             print name, timeit.Timer("%s(conn)" % name, "gc.enable();
>     from scratch import sqlite3, %s, setup; conn =
>     sqlite3.connect(':memory:'); setup(conn)" % name).repeat(3, 1)
>
> > As an aside, your database is running on a local disk, right, so
> > the overehead of retrieving data is minimized here?
> > For my tests I think I am data retrieval limited because I
> > get exactly the same time for the equivalent of retrieve1
> > and retrieve2.
> >
> As Keith pointed out, I'm keeping the database in memory (although
> there's a very good chance some of it is actually swapped to disk) so
> it's probably relatively fast. On the other hand, if you are using
> timeit to make your measurements you could be running into the (lack of)
> garbage collection issue I mention above.
>

I checked and for my real situation I am totally limited by the time
to retrieve the data.  From these tests I think this will
probably be true even if the data is  on a local disk.

I think these experiments show that iterating over the
cursor is the best approach.  It is better from a memory
point of view and is probably also the fastest.

We should still resolve the slowness for the array()
function however when converting lists of tuples.
I will file a ticket if no one else has.

Erin

Re: [Numpy-discussion] mysql -> record array

A package for scientific computing with Python

Re: [Numpy-discussion] mysql -> record array