[Pytables-users] PyTables and Multiprocessing

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,

I wanted to use PyTables in conjunction with multiprocessing for some 
embarrassingly parallel tasks.

However, it seems that it is not possible. In the following (very 
stupid) example, X is a Carray of size (100, 10) stored in the file 
test.hdf5:

import tables

import multiprocessing

# Reload the data

h5file = tables.openFile('test.hdf5', mode='r')

X = h5file.root.X

# Use multiprocessing to perform a simple computation (column average)

def f(X):

     name = multiprocessing.current_process().name

     column = random.randint(0, n_features)

     print '%s use column %i' % (name, column)

     return X[:, column].mean()

p = multiprocessing.Pool(2)

col_mean = p.map(f, [X, X, X])

When executing it the following error:

Exception in thread Thread-2:

Traceback (most recent call last):

   File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner

     self.run()

   File "/usr/lib/python2.7/threading.py", line 504, in run

     self.__target(*self.__args, **self.__kwargs)

   File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks

     put(task)

PicklingError: Can't pickle <type 'weakref'>: attribute lookup __builtin__.weakref failed

I have googled for weakref and pickle but can't find a solution.

Any help?

By the way, I have noticed that by slicing a Carray, I get a numpy array 
(I created the HDF5 file with numpy). Therefore, everything is copied to 
memory. Is there a way to avoid that?

Mathieu