[Jython-users] JPython/Jython for large data sets

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I'd like to solicit your response on how useful JPython/Jython is when
analyzing large data sets.

I am working on a project for which a data analyzing algorithm was written
in Java, and wrote a JPython 1.5.2 script to test the algorithm. When
analyzing large data sets (flat text files) of more than 20, 30 MB we
originally faced two problems: This was slow, and we ran out of memory
even if the JVM is allocated the full 256 MB RAM. After rewriting the
JPython code the out-of-memory problem went away, but presumably the limit
was just pushed a little further.

In short, this is what the script does: It reads the flat file and builds
Java objects holding data records (really just arrays of doubles). The
records are built and handed to the analyzing algorithm one at a time.
Java classes analyze the data, and create objects describing the results.
The script then examines the results and writes out some diagnostics. In
order to do this, the data set is read twice more (reading it once and
holding the data in memory had let to the out-of-memory problem).

So my question is: Is JPython a useful tool for this kind of testing? Do
we need to be concerned about memory leaks when handling large data sets? 

We also have a Python script that tests the C++ version of the algorithm,
and it runs considerably faster on the same machine (by about half). Would
you say that this is solely because of the difference between Java and
C++, or are there additional factors influencing the speed of execution?

Considering the out-of-memory problem: Is this a general Python problem,
or is it more pronounced in JPython?

I realize that these are difficult questions to answer because of their
generality. But we would like to make an educated decision for future
projects - whether to stay with JPython/Jython (and Python) or to look for
alternatives.

Thank you for any responses.

Anton