Re: [Jython-users] JPython/Jython for large data sets

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

[Anton Gluck]

>I'd like to solicit your response on how useful JPython/Jython is when
>analyzing large data sets.
>
>I am working on a project for which a data analyzing algorithm was written
>in Java, and wrote a JPython 1.5.2 script to test the algorithm. When
>analyzing large data sets (flat text files) of more than 20, 30 MB we
>originally faced two problems: This was slow, and we ran out of memory
>even if the JVM is allocated the full 256 MB RAM. After rewriting the
>JPython code the out-of-memory problem went away, but presumably the limit
>was just pushed a little further.
>
>In short, this is what the script does: It reads the flat file and builds
>Java objects holding data records (really just arrays of doubles). The
>records are built and handed to the analyzing algorithm one at a time.
>Java classes analyze the data, and create objects describing the results.
>The script then examines the results and writes out some diagnostics. In
>order to do this, the data set is read twice more (reading it once and
>holding the data in memory had let to the out-of-memory problem).

I'll assume a data record object like this.

class DataRec {
   public double[] data;
}

>
>So my question is: Is JPython a useful tool for this kind of testing? Do
>we need to be concerned about memory leaks when handling large data sets? 

There are no known leaks in the handling of normal objects. (The known
leaks in jython are around class loading, generated event adapters and
threads).

You will still have to deal with possible deallocation of global data
structures when they are no longer needed. This is standard python
issues, not specially related to jython.

Jython compares differently than Python in memory consumption. Exact
measurement is almost impossible, but here some of what I know:

- Each java object carry an additional ~8-16 bytes overhead, depending 
  of JVM version and object type. An array instance also have this
  overhead.
- A java object in jython will be wrapped with an instance of
  PyJavaInstance. So if you store the data records in a jython list or
  jython dict, there is a 36-44 bytes overhead for each data record.
  If you store the data records in a java.util.Vector instead you save
  this memory overhead, but will instead force jython to create a new
  wrapper each time a data record is retrieved from the Vector.

(The numbers above are so dependent on the JVM to be practically
useless, but I'm assuming 4 bytes for an objet reference).

>We also have a Python script that tests the C++ version of the algorithm,
>and it runs considerably faster on the same machine (by about half). Would
>you say that this is solely because of the difference between Java and
>C++, or are there additional factors influencing the speed of execution?

I would say the java solution is holding up quite good if that is you
measurements. It is highly dependent of the JVM and JIT. The ratio is
also dependent of how much of the time is spend in the Java/C++
algorithm and how much in the script.

It could be a possible performance improvement to let the data record
subclass PyObject. That way the PyJavaInstance overhead is reduced
completely. PyObject subclasses can be stored in python lists and dicts
without creating a wrapper first.

If the script also look at the double data, each access to a double will
also create a new temporary object (a PyFloat), which may explain some
of the difference.

>Considering the out-of-memory problem: Is this a general Python problem,
>or is it more pronounced in JPython?

I don't know what overhead CPython have for each object, but I would
guess that the overhead is bigger in Jython. 

>I realize that these are difficult questions to answer because of their
>generality. But we would like to make an educated decision for future
>projects - whether to stay with JPython/Jython (and Python) or to look for
>alternatives.

Performance enhancement are always difficult, because it is always
possible to improve performance/memory-use a little. At some point the
gain is just to small to worth the work. 

I suggest you try to let your data record subclass PyObject. That should
give a relative large memory saving. If that isn't enough, cut your
losses and use an environment where you have better control over memory
allocation and memory use.

regards,
finn