From: Antonio V. <ant...@ti...> - 2013-07-18 08:27:51
|
Hi Pushkar, Il 18/07/2013 08:45, Pushkar Raj Pande ha scritto: > Both loadtxt and genfromtxt read the entire data into memory which is not > desirable. Is there a way to achieve streaming writes? > OK, probably fromfile [1] can help you to cook something that works without loading the entire file into memory (and without too much iterations over the file). Anyway I strongly recommend you to not perform read/write cycles on single lines, rather define a reasonable data block size (number of rows) and process the file in chunks. If you find a reasonably simple solution it would be nice to include it in out documentation as an example or a "recipe" [2] [1] http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html#numpy.fromfile [2] http://pytables.github.io/latest/cookbook/index.html best regards antonio > Thanks, > Pushkar > > > On Wed, Jul 17, 2013 at 7:04 PM, Pushkar Raj Pande <top...@gm...>wrote: > >> Thanks Antonio and Anthony. I will give this a try. >> >> -Pushkar >> >> >> On Wed, Jul 17, 2013 at 2:59 PM, < >> pyt...@li...> wrote: >> >>> Date: Wed, 17 Jul 2013 16:59:16 -0500 >>> From: Anthony Scopatz <sc...@gm...> >>> Subject: Re: [Pytables-users] Pytables bulk loading data >>> To: Discussion list for PyTables >>> <pyt...@li...> >>> Message-ID: >>> < >>> CAP...@ma...> >>> Content-Type: text/plain; charset="iso-8859-1" >>> >>> Hi Pushkar, >>> >>> I agree with Antonio. You should load your data with NumPy functions and >>> then write back out to PyTables. This is the fastest way to do things. >>> >>> Be Well >>> Anthony >>> >>> >>> On Wed, Jul 17, 2013 at 2:12 PM, Antonio Valentino < >>> ant...@ti...> wrote: >>> >>>> Hi Pushkar, >>>> >>>> Il 17/07/2013 19:28, Pushkar Raj Pande ha scritto: >>>>> Hi all, >>>>> >>>>> I am trying to figure out the best way to bulk load data into >>> pytables. >>>>> This question may have been already answered but I couldn't find what >>> I >>>> was >>>>> looking for. >>>>> >>>>> The source data is in form of csv which may require parsing, type >>>> checking >>>>> and setting default values if it doesn't conform to the type of the >>>> column. >>>>> There are over 100 columns in a record. Doing this in a loop in python >>>> for >>>>> each row of the record is very slow compared to just fetching the rows >>>> from >>>>> one pytable file and writing it to another. Difference is almost a >>> factor >>>>> of ~50. >>>>> >>>>> I believe if I load the data using a C procedure that does the parsing >>>> and >>>>> builds the records to write in pytables I can get close to the speed >>> of >>>>> just copying and writing the rows from 1 pytable to another. But may >>> be >>>>> there is something simple and better that already exists. Can someone >>>>> please advise? But if it is a C procedure that I should write can >>> someone >>>>> point me to some examples or snippets that I can refer to put this >>>> together. >>>>> >>>>> Thanks, >>>>> Pushkar >>>>> >>>> >>>> numpy has some tools for loading data from csv files like loadtxt [1], >>>> genfromtxt [2] and other variants. >>>> >>>> Non of them is OK for you? >>>> >>>> [1] >>>> >>>> >>> http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt >>>> [2] >>>> >>>> >>> http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt >>>> >>>> >>>> cheers >>>> >>>> -- >>>> Antonio Valentino -- Antonio Valentino |