The main issue since Version 0.500 was speed, speed and speed.
Especially since I recognized that the MPI Version had a profound error till now, in the sense that only a quater of the points were actually transfered. Every point consist of 4 components: x,y,z and the harmonic coordinate w, but only #particlessizeof(particle) was Send.
I didn't recognized it because the virgo-file was pretty random, so that the missing of 75% of the particles was not noticable.
To give you an idea of the speed-up and of how far we got. I'am now working with files consisting of 27Million particles and displaying them with 3,5fps on a 768 screen, with 8CPUs, and with 6,5Million particles@2,5fps on a 640 screen in the serial version, in comparison to ~5fpd with 1,3Million particles on a 400 screen.
So what was done since Version 0.500?
I wrote an new version of XForm, which do the Transform and the Projection, I also unrolled the loop, this results in a speed-up of 10% in comparison to Tiziano's SSE-Version. XForm is now working only on vectors and not on the whole vector matrix, so that only 4 elements have to be bufferd instead of 4#particles. Additionally the densities are stored as unsigned short int instead of 32-bit floats. This results in a reasonable speed-up, because the MPI-version is heavily interconnection bounded. This is also the reason why the quick-modus only increases the frame -rate by about 2-3fps. The serial version always increases the frame-rate to ~100fps, even witch millions particles. I also did some inner-loop optimization of the reductin-pipeline already described on the Version 0.500 round-up. I also integrated all of the features of the serial Version into the parallel. The dificulty here is that all processors are running highly asynchroniously, so that it's the question how to transmit user-interactions consistently to all processors. I solved it by sending a user-inteaction array at the same time as the rot-matrix, so that no Barriers or Waits have to be used.
In the whole code there is only one Barrier.
The bad or good thing is that in every iteration 5-10 MCycles are lost due to stalling caused be the Barrier.
So that I will integrate either ColorMapImage or Project into the reduction loop.
I will also start to work on SSE & MMX versions on different functions. SSE & MMX because the new cluster delivered in a few month will be a Opteron cluster. So for me dislikeing 3Dnow! this the golden way since in the actual Athlon system SSE2 is nonexistant.... read more
Up to now the following stuff was done:
In the serial version I only did some slighlty changes. I inserted some options for point Antialiasing, for changing the shading model between flat and smooth shading although no real image quality improvement could be observed. I also added an option for enabling dithering.
I excluded the plot modes nearest particle and removed the z-Coord-Array. I also took care about the memory consumtion: allocating only if necessary and then as late as possible and realising them as early as possible. Together with the removal of the z-Coord-Array this results in a noticable speed-up. On the workstation (2.x GHz Xeon) used here Hubbble achieve 7.5 fps at an increased resolution of 576*576 pixels, with all optimzations turned on (-O3, SSE, profiling, ...).
But my main effort affect the MPI-Version and first of all the MPI_support.c file. My first attempt was to insert the collective MPI reduce, but this failed. We are believing due to an implementation error caused by the Scali MPI implementation we are using. So I built a tree reduction be myself, but the speed-up was poor to nothing. A more precisely analysis exposed, that this was caused by the latency due to the very big chunks which have to be transmitted. Therefore I ended in the current solution, to packetize the ValueMap-chunks and transmit them pipelined to conceal the reduce-operation (here: max). I also contributed a file called tree.c which constructs a treeTable representing the underlying binary tree. In case the number of used nodes is non-Power of 2 and the tree is slightly unbalanced the delays of the corresponding nodes for the pipeline are also calculated. After that the function TR_bugGen (in tree.c) constructs the what we call "bugs", which are the wiring of each node to its to inputs and two outputs. The resulting bugTable is then scattered, so that each node knows from which to buffered receive the inputs and to whom to output the merged result and to whom to forward the local chunk of ValueMap.
This is all working for any arbitrary choosen number of nodes, and results in a leapfrog of performance, which is now only bounded by the 100MBit connection to the display machine. But we will substitute this in the next days.
What's still not working and what I plan to do in the near future:
Add support for not-filled packets by padding them with zero, up to date not-filled packets are junked. Another problem is what Tiziano (dangermaus ;-)) called "centering problem". My current workaround is to hardcode the max-resplution and to open windows of max-resolution. This is done because the new coordinates of the window due to resizing is only known at the master-node and only working in the serial version. For the master-node one has to find a way to distribute them in a syncronized manner to all nodes. I've idea how this could be done, but only an idea, so stay tuned!
I also want to create an SSE-Version of my reduction-operation max-merge, I additionally want to make extensive use of prefetching, which could result in an other major speed-up of the serial-version, but perhaps not in the parallel-version which should be bounded by the transmition of the packets. In my particular pipelined-version the reduction should be nearly overlaped by the transmission.
I will distribute my first code version perhaps at around chrismas.... read more
Hi I'm Tom Kuehne,
and I'm working for the Institute of Theoretical Physics at the University Zurich on an MPI-Version of Hubble.
I also try to make some contributions on the serial version through small improvements and code documentation.
Hubble in a bottle is a scientific visualization tool for particles (in particular stars) that runs on Linux machines and in particular on Linux Beowulf clusters.
Release 0.400 fixes a bug that prevented the rendering of models with less than 50000 particles, if the density file was missing. Also quicker mode was broken for these files in the previous 0.350 release.
0.400 introduces color bias and contrast. By clicking on the color bar at the bottom of the model with the left or right mouse button, the color bar is changed accordingly. This feature is alpha and could undergo further development.... read more
The most important change regards the "How to compile section": instruction on how to install the GLUT library were added.
A new section "Get the latest development version" explains how to download from CVS and the reasons why packages are behind the latest development stage.
The latest version of the documentation is online, packages will have for a while the old documents.
It would be cool if someone could send binaries for Debian Linux.
Currently, I am working on contrast and bias. The routine is already there, but now I have to transform mouse movements into routine calls... Hopefully, I'll get it right.
Two maintenace releases (0.352, 0.353) are planned for next week.
The new hubble homepage should come as well, if Ron solves the problems with his ISP ;-)
Thanks to Antonio and Keith for their revision of the project's documents.
To get the latest news on hubble development, subscribe to the project mailing list (click on the Lists tab).... read more