From: <bc...@wo...> - 2000-12-10 19:31:49
|
[Anton Gluck] >I'd like to solicit your response on how useful JPython/Jython is when >analyzing large data sets. > >I am working on a project for which a data analyzing algorithm was written >in Java, and wrote a JPython 1.5.2 script to test the algorithm. When >analyzing large data sets (flat text files) of more than 20, 30 MB we >originally faced two problems: This was slow, and we ran out of memory >even if the JVM is allocated the full 256 MB RAM. After rewriting the >JPython code the out-of-memory problem went away, but presumably the limit >was just pushed a little further. > >In short, this is what the script does: It reads the flat file and builds >Java objects holding data records (really just arrays of doubles). The >records are built and handed to the analyzing algorithm one at a time. >Java classes analyze the data, and create objects describing the results. >The script then examines the results and writes out some diagnostics. In >order to do this, the data set is read twice more (reading it once and >holding the data in memory had let to the out-of-memory problem). I'll assume a data record object like this. class DataRec { public double[] data; } > >So my question is: Is JPython a useful tool for this kind of testing? Do >we need to be concerned about memory leaks when handling large data sets? There are no known leaks in the handling of normal objects. (The known leaks in jython are around class loading, generated event adapters and threads). You will still have to deal with possible deallocation of global data structures when they are no longer needed. This is standard python issues, not specially related to jython. Jython compares differently than Python in memory consumption. Exact measurement is almost impossible, but here some of what I know: - Each java object carry an additional ~8-16 bytes overhead, depending of JVM version and object type. An array instance also have this overhead. - A java object in jython will be wrapped with an instance of PyJavaInstance. So if you store the data records in a jython list or jython dict, there is a 36-44 bytes overhead for each data record. If you store the data records in a java.util.Vector instead you save this memory overhead, but will instead force jython to create a new wrapper each time a data record is retrieved from the Vector. (The numbers above are so dependent on the JVM to be practically useless, but I'm assuming 4 bytes for an objet reference). >We also have a Python script that tests the C++ version of the algorithm, >and it runs considerably faster on the same machine (by about half). Would >you say that this is solely because of the difference between Java and >C++, or are there additional factors influencing the speed of execution? I would say the java solution is holding up quite good if that is you measurements. It is highly dependent of the JVM and JIT. The ratio is also dependent of how much of the time is spend in the Java/C++ algorithm and how much in the script. It could be a possible performance improvement to let the data record subclass PyObject. That way the PyJavaInstance overhead is reduced completely. PyObject subclasses can be stored in python lists and dicts without creating a wrapper first. If the script also look at the double data, each access to a double will also create a new temporary object (a PyFloat), which may explain some of the difference. >Considering the out-of-memory problem: Is this a general Python problem, >or is it more pronounced in JPython? I don't know what overhead CPython have for each object, but I would guess that the overhead is bigger in Jython. >I realize that these are difficult questions to answer because of their >generality. But we would like to make an educated decision for future >projects - whether to stay with JPython/Jython (and Python) or to look for >alternatives. Performance enhancement are always difficult, because it is always possible to improve performance/memory-use a little. At some point the gain is just to small to worth the work. I suggest you try to let your data record subclass PyObject. That should give a relative large memory saving. If that isn't enough, cut your losses and use an environment where you have better control over memory allocation and memory use. regards, finn |