From: Anton G. <gl...@mi...> - 2000-12-10 04:55:07
|
Hi, I'd like to solicit your response on how useful JPython/Jython is when analyzing large data sets. I am working on a project for which a data analyzing algorithm was written in Java, and wrote a JPython 1.5.2 script to test the algorithm. When analyzing large data sets (flat text files) of more than 20, 30 MB we originally faced two problems: This was slow, and we ran out of memory even if the JVM is allocated the full 256 MB RAM. After rewriting the JPython code the out-of-memory problem went away, but presumably the limit was just pushed a little further. In short, this is what the script does: It reads the flat file and builds Java objects holding data records (really just arrays of doubles). The records are built and handed to the analyzing algorithm one at a time. Java classes analyze the data, and create objects describing the results. The script then examines the results and writes out some diagnostics. In order to do this, the data set is read twice more (reading it once and holding the data in memory had let to the out-of-memory problem). So my question is: Is JPython a useful tool for this kind of testing? Do we need to be concerned about memory leaks when handling large data sets? We also have a Python script that tests the C++ version of the algorithm, and it runs considerably faster on the same machine (by about half). Would you say that this is solely because of the difference between Java and C++, or are there additional factors influencing the speed of execution? Considering the out-of-memory problem: Is this a general Python problem, or is it more pronounced in JPython? I realize that these are difficult questions to answer because of their generality. But we would like to make an educated decision for future projects - whether to stay with JPython/Jython (and Python) or to look for alternatives. Thank you for any responses. Anton |
From: George H. <ghe...@cf...> - 2000-12-10 17:27:52
|
Anton Gluck wrote: > > I'd like to solicit your response on how useful JPython/Jython is when > analyzing large data sets. I don't feel particularly qualified to comment on the usefulness of Jython for your particular problem; however, there are the usual concerns when using a scripting language to be taken into account. Generally, both execution time and memory requirements are somewhat higher for scripting languages. > I am working on a project for which a data analyzing algorithm was written > in Java, and wrote a JPython 1.5.2 script to test the algorithm. When > analyzing large data sets (flat text files) of more than 20, 30 MB we > originally faced two problems: This was slow, and we ran out of memory > even if the JVM is allocated the full 256 MB RAM. It is possible to allocate more than the available RAM to the JVM. Thiswill probably result in higher paging rates for your program in execution, but the degree of impact would depend on the locality of reference within your program. Some of the slowness you are experiencing currently is probably due to paging. > In short, this is what the script does: It reads the flat file and builds > Java objects holding data records (really just arrays of doubles). Reducing the number of objects your program works with is always a god idea, and using native types (e.g. doubles) is useful from a performance perspective. It sounds as if you are already taking this advice, but it never hurts to re-examine your design. Also, there may be an opportunity to reduce the amount of data worked on at one time. That is it may be possible to work on the data in chunks instead of all at once. > We also have a Python script that tests the C++ version of the algorithm, > and it runs considerably faster on the same machine (by about half). Would > you say that this is solely because of the difference between Java and > C++, or are there additional factors influencing the speed of execution? My comment above about paging may be applicable. I won't weigh in on the Java/C++ debate. Good luck, George |
From: <bc...@wo...> - 2000-12-10 19:31:49
|
[Anton Gluck] >I'd like to solicit your response on how useful JPython/Jython is when >analyzing large data sets. > >I am working on a project for which a data analyzing algorithm was written >in Java, and wrote a JPython 1.5.2 script to test the algorithm. When >analyzing large data sets (flat text files) of more than 20, 30 MB we >originally faced two problems: This was slow, and we ran out of memory >even if the JVM is allocated the full 256 MB RAM. After rewriting the >JPython code the out-of-memory problem went away, but presumably the limit >was just pushed a little further. > >In short, this is what the script does: It reads the flat file and builds >Java objects holding data records (really just arrays of doubles). The >records are built and handed to the analyzing algorithm one at a time. >Java classes analyze the data, and create objects describing the results. >The script then examines the results and writes out some diagnostics. In >order to do this, the data set is read twice more (reading it once and >holding the data in memory had let to the out-of-memory problem). I'll assume a data record object like this. class DataRec { public double[] data; } > >So my question is: Is JPython a useful tool for this kind of testing? Do >we need to be concerned about memory leaks when handling large data sets? There are no known leaks in the handling of normal objects. (The known leaks in jython are around class loading, generated event adapters and threads). You will still have to deal with possible deallocation of global data structures when they are no longer needed. This is standard python issues, not specially related to jython. Jython compares differently than Python in memory consumption. Exact measurement is almost impossible, but here some of what I know: - Each java object carry an additional ~8-16 bytes overhead, depending of JVM version and object type. An array instance also have this overhead. - A java object in jython will be wrapped with an instance of PyJavaInstance. So if you store the data records in a jython list or jython dict, there is a 36-44 bytes overhead for each data record. If you store the data records in a java.util.Vector instead you save this memory overhead, but will instead force jython to create a new wrapper each time a data record is retrieved from the Vector. (The numbers above are so dependent on the JVM to be practically useless, but I'm assuming 4 bytes for an objet reference). >We also have a Python script that tests the C++ version of the algorithm, >and it runs considerably faster on the same machine (by about half). Would >you say that this is solely because of the difference between Java and >C++, or are there additional factors influencing the speed of execution? I would say the java solution is holding up quite good if that is you measurements. It is highly dependent of the JVM and JIT. The ratio is also dependent of how much of the time is spend in the Java/C++ algorithm and how much in the script. It could be a possible performance improvement to let the data record subclass PyObject. That way the PyJavaInstance overhead is reduced completely. PyObject subclasses can be stored in python lists and dicts without creating a wrapper first. If the script also look at the double data, each access to a double will also create a new temporary object (a PyFloat), which may explain some of the difference. >Considering the out-of-memory problem: Is this a general Python problem, >or is it more pronounced in JPython? I don't know what overhead CPython have for each object, but I would guess that the overhead is bigger in Jython. >I realize that these are difficult questions to answer because of their >generality. But we would like to make an educated decision for future >projects - whether to stay with JPython/Jython (and Python) or to look for >alternatives. Performance enhancement are always difficult, because it is always possible to improve performance/memory-use a little. At some point the gain is just to small to worth the work. I suggest you try to let your data record subclass PyObject. That should give a relative large memory saving. If that isn't enough, cut your losses and use an environment where you have better control over memory allocation and memory use. regards, finn |
From: Anton G. <gl...@mi...> - 2000-12-14 07:05:22
|
Many thanks to the people who responded to my post. You helped clear up the situation with memory management and speed considerations a lot. Anton |
From: Ben H. <be...@in...> - 2000-12-11 01:26:02
|
Hi Anton, Anton Gluck wrote: > I am working on a project for which a data analyzing algorithm was written > in Java, and wrote a JPython 1.5.2 script to test the algorithm. When > analyzing large data sets (flat text files) of more than 20, 30 MB we > originally faced two problems: This was slow, and we ran out of memory > even if the JVM is allocated the full 256 MB RAM. After rewriting the > JPython code the out-of-memory problem went away, but presumably the limit > was just pushed a little further. I read an interesting article recently about the differences bewteen virtual machine versions in memory usage behavior. Earlier, non-hotspot VMs throw Out Of Memory much earlier and easier than hotspot. Hotspot appears to utilize the JVM process allocated memory more efficiently, so that more of it 90%+ can be allocated to objects without throwing OutOfMemory, compared to 60-70% on older VMs. This article was on a public Sun webserver, which I got to from the bug parade while investigating garbage collection. So, one piece of advice would be upgrade your VM version if possible. Ben -- Ben Hutchison Software Engineer-Market Predictor Webmind Australia http://www.webmind.com/productspredictor.html |