Re: [Pytables-users] pytables or pyroot?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thu, Jul 30, 2009 at 9:00 AM, Francesc Alted<fa...@py...> wrote:
> A Thursday 30 July 2009 14:32:42 escriguéreu:
>> Yes, but I can't figure out how to have multiple apache processes
>> connect to the same Queue object.
>
> Ah, correct, you are using different *processes*, not threads.  Well, I think
> that some kind of communications package must be used then.  There are plenty
> of options, but Pyro [1] or Ice [2] (I recently was told about this), seems to
> be powerful and easy enough to program.  If you want more performance, you may
> want to use MPI via mpi4py [3], but I don't really think you are going to need
> this.
>
> [1] http://pyro.sourceforge.net/
> [2] http://www.zeroc.com/icepy.html
> [3] http://mpi4py.scipy.org/

Insert obligatory warning against over-engineering here: simple
problems should have simple solutions.

Simplest: a single-threaded, single-process Python server that
directly handles the HTTP input and writes to PyTables. Concurrent
requests just have to wait. Downside: a slow client can tie up the
server for a long time. Multithreading/multiprocessing (both in the
Python standard library) can help, but if that's an issue, try:

Also pretty simple: a lightweight mod_python or fcgi script, written
with, say, Django/CherryPy/web.py, that buffers the data in some
temporary place while waiting for it to be written. Could be in memory
or a file, or even a conventional relational database. Then the
PyTables writer process just needs to know about that data. Files are
easy; when you're done writing a file, move it into an "incoming"
directory; then the PyTables writer can just poll 'incoming' for a
file, process it, move it out of the way, repeat.

If you're concerned about being able to report failure, you have to
consider all the possible points of failure. The first solution has
very simple failure reporting: "I wrote this to PyTables" or "I
didn't". The second is a two-stage process, where all the client can
report is "I passed this on to the writer process". But if your buffer
is somewhere persistent and reliable (like disk), that's perhaps a
_better_ report: even if the PyTables db gets corrupted somehow, you
still have the data at least until you clean out the old stuff (which
you can do after backing up the HDF5 file, for example).

>> Not innovative as opposed to Pro, but OPSI itself seems to be
>> innovative. And fast, apparently. Oh well, certainly don't want you to
>> stop working on pytables, :-)
>
> Thank you :-)

OPSI, on my brief look at it, seemed to be optimized for write-once,
read-many. There are many other scenarios possible; for example, we
have one scenario that requires checking if an item is already stored
before writing it. The re-indexing that OPSI would require would hurt
performance, though there may be ways around that. The point: if you
know your problem well, you can probably make a more efficient
implementation of just about anything than commercial general-purpose
products.

Regards,
-Ken