Re: [Pytables-users] Pytables-users Digest, Vol 86, Issue 8

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Pushkar,

Il 18/07/2013 08:45, Pushkar Raj Pande ha scritto:
> Both loadtxt and genfromtxt read the entire data into memory which is not
> desirable. Is there a way to achieve streaming writes?
> 

OK, probably fromfile [1] can help you to cook something that works
without loading the entire file into memory (and without too much
iterations over the file).

Anyway I strongly recommend you to not perform read/write cycles on
single lines, rather define a reasonable data block size (number of
rows) and process the file in chunks.

If you find a reasonably simple solution it would be nice to include it
in out documentation as an example or a "recipe" [2]

[1]
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html#numpy.fromfile
[2] http://pytables.github.io/latest/cookbook/index.html

best regards

antonio

> Thanks,
> Pushkar
> 
> 
> On Wed, Jul 17, 2013 at 7:04 PM, Pushkar Raj Pande <top...@gm...>wrote:
> 
>> Thanks Antonio and Anthony. I will give this a try.
>>
>> -Pushkar
>>
>>
>> On Wed, Jul 17, 2013 at 2:59 PM, <
>> pyt...@li...> wrote:
>>
>>> Date: Wed, 17 Jul 2013 16:59:16 -0500
>>> From: Anthony Scopatz <sc...@gm...>
>>> Subject: Re: [Pytables-users] Pytables bulk loading data
>>> To: Discussion list for PyTables
>>>         <pyt...@li...>
>>> Message-ID:
>>>         <
>>> CAP...@ma...>
>>> Content-Type: text/plain; charset="iso-8859-1"
>>>
>>> Hi Pushkar,
>>>
>>> I agree with Antonio.  You should load your data with NumPy functions and
>>> then write back out to PyTables.  This is the fastest way to do things.
>>>
>>> Be Well
>>> Anthony
>>>
>>>
>>> On Wed, Jul 17, 2013 at 2:12 PM, Antonio Valentino <
>>> ant...@ti...> wrote:
>>>
>>>> Hi Pushkar,
>>>>
>>>> Il 17/07/2013 19:28, Pushkar Raj Pande ha scritto:
>>>>> Hi all,
>>>>>
>>>>> I am trying to figure out the best way to bulk load data into
>>> pytables.
>>>>> This question may have been already answered but I couldn't find what
>>> I
>>>> was
>>>>> looking for.
>>>>>
>>>>> The source data is in form of csv which may require parsing, type
>>>> checking
>>>>> and setting default values if it doesn't conform to the type of the
>>>> column.
>>>>> There are over 100 columns in a record. Doing this in a loop in python
>>>> for
>>>>> each row of the record is very slow compared to just fetching the rows
>>>> from
>>>>> one pytable file and writing it to another. Difference is almost a
>>> factor
>>>>> of ~50.
>>>>>
>>>>> I believe if I load the data using a C procedure that does the parsing
>>>> and
>>>>> builds the records to write in pytables I can get close to the speed
>>> of
>>>>> just copying and writing the rows from 1 pytable to another. But may
>>> be
>>>>> there is something simple and better that already exists. Can someone
>>>>> please advise? But if it is a C procedure that I should write can
>>> someone
>>>>> point me to some examples or snippets that I can refer to put this
>>>> together.
>>>>>
>>>>> Thanks,
>>>>> Pushkar
>>>>>
>>>>
>>>> numpy has some tools for loading data from csv files like loadtxt [1],
>>>> genfromtxt [2] and other variants.
>>>>
>>>> Non of them is OK for you?
>>>>
>>>> [1]
>>>>
>>>>
>>> http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt
>>>> [2]
>>>>
>>>>
>>> http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt
>>>>
>>>>
>>>> cheers
>>>>
>>>> --
>>>> Antonio Valentino

-- 
Antonio Valentino