Re: [Pytables-users] PyTables hangs while opening file in worker process

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Anthony,

On 8 October 2012 15:54, Anthony Scopatz <sc...@gm...> wrote:

> Hmmm, Are you actually copying the data (f.root.data[:])  or are you
> simply passing a reference as arguments (f.root.data)?
>

I call f.root.data.read() on any arrays to load them into the process
target args dictionary. I had assumed this returns a copy of the data. The
documentation doesn't specify which, or even if there is any difference
from __getitem__.

So if you are opening a file in the master process and then
> writing/creating/flushing from the workers this may cause a problem.
>  Multiprocess creates a fork of the original process so you are relying on
> the file handle from the master process to not accidentally change somehow.
>  Can you try to open the files in the workers rather than the master?  I
> hope that this clears up the issue.
>

I am not accessing the master file from the worker processes. At least not
by design, though as you say some kind of strange behaviour could be
arising due to the copy-on-fork of Linux. In principle, each process has
its own file and there is no sharing of files between processes.

> Basically, I am advocating a more conservative approach where all data
> that is read or written to in a worker must come from that worker, rather
> than being generated by the master.  If you are *still* experiencing
> these problems, then we know we have a real problem.
>

I'm being about as conservative as can be with my system. Unless read()
returns a reference to the master file there should be absolutely no
sharing between processes. And even if my args dictionary contains a
reference to the in-memory HDF5 file, how could reading it possibly trigger
a call to openFile?

Can you clarify the semantics of read() vs. __getitem__()? Thanks.

Regards,
Owen