I've been using (and recommend) Pandas http://pandas.pydata.org/ along with this book: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDIQFjAA&url=http%3A%2F%2Fshop.oreilly.com%2Fproduct%2F0636920023784.do&ei=GfSnUJSbGqm5ywH7poCwDA&usg=AFQjCNEJuio5DbubgyNQR4Tp9iM1RClZHA 

Good luck,

On Fri, Nov 16, 2012 at 11:02 AM, Jon Wilson <jsw@fnal.gov> wrote:
Hi all,
I am trying to find the best way to make histograms from large data
sets.  Up to now, I've been just loading entire columns into in-memory
numpy arrays and making histograms from those.  However, I'm currently
working on a handful of datasets where this is prohibitively memory
intensive (causing an out-of-memory kernel panic on a shared machine
that you have to open a ticket to have rebooted makes you a little
gun-shy), so I am now exploring other options.

I know that the Column object is rather nicely set up to act, in some
circumstances, like a numpy ndarray.  So my first thought is to try just
creating the histogram out of the Column object directly. This is,
however, 1000x slower than loading it into memory and creating the
histogram from the in-memory array.  Please see my test notebook at:

For such a small table, loading into memory is not an issue.  For larger
tables, though, it is a problem, and I had hoped that pytables was
optimized so that histogramming directly from disk would proceed no
slower than loading into memory and histogramming. Is there some other
way of accessing the column (or Array or CArray) data that will make
faster histograms?

Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
Pytables-users mailing list

David C. Wilson
(612) 460-1329