We don't have that much experience running jdftx in virtual machines, so I don't know for sure why it runs so much slower on environment B. Here is an idea to try:
You can use the -c flag to manually set the number of threads in jdftx. If the slowdown is due to too many threads (larger than what the OS can run simultaneously), this should fix it.
However, if the problem is due to general overhead in virtual environments, I don't know how it might be easily fixed. Maybe Shankar has suggestions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As countertest I ran JDFTx into enviroment B but with reduced number of cores (only 8) and the elaboration took just:
Enviroment B red: Duration: 0-2:07:14.17
It seems that less core are better for performance!
When decreasing threads increases performance, it feels like it might be a cache coherence issue. I know we've had those before. Maybe the fact that you are running it in a virtual machine triggers something we haven't encountered before. Any ideas Shankar?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For the (automatic) thread number detection, you can look at core/AutoThreadCount.cpp . On line 31, jdftx looks at /proc/cpuinfo to get the necessary information.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It seems like JDFTx detects the cores correctly, since the two output files have the right number in the header.
The fact that you see ~55 threads in the second case seems to indicate that one of the libraries JDFTx is linked to is launching extra threads. The most likely offender is BLAS/LAPACK. What version of that are you linking to?
JDFTx assumes that the standard BLAS is compiled in single-threaded mode; there is no clean way to automatically figure this out at compile time. This is true for the default ATLAS library on most distributions, but there might be variations.
In the case of MKL, JDFTx is aware of the threaded layer in MKL and will use that instead whenever possible.
Could you please provide some info about your BLAS installation on the two VMs?
Thanks,
Shankar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The library I use is the default one provided by Debian 8 Jessie: libatlas-base-dev. The version is 3.10.2. I will check tomorrow how was compiled that library.
Thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Indeed I think the problem is the threads launched by BLAS/LAPACK. Attached there is the graph of the threads for first 20 min of elaboration in a 8 core virtual machine (with 5 sec resolution).
I will check the Debian library to undestand what is happening.
Unfortunately I can't think of an easy way to detect and handle this at run time (else I'd patch it!).
I could, in principle, add a compile time switch to handle threaded ATLAS the same way as I do MKL. But this would lead to a significant performance degradation compared to using a single threaded ATLAS. This is because MKL has functions that allow adjusting the number of thread dynamically so that JDFTx can decide when to thread the linear algebra calls and when to call single-threaded linear algebra functions from each thread. ATLAS does not allow dynamically controlling threads AFAIK.
So the best options seem to be building ATLAS locally with threads disabled at compile time, or to install MKL (recommended if you have access to it).
Best,
Shankar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello
I noticed a very strange behaviour when running JDFTx in some virtual machines.
I have two Debian 8 enviroments:
Enviroment A:
Virtual Box machine configured with 8 cores, 20 GB ram (and some swap).
Enviroment B:
Virtual Box machine configured with 24 cores, 94 GB ram (and no swap).
Enviroment B is running in a dedicated 64 core machine with 128 GB ram (and no swap).
What I noticed is this: if I run this code in env A it will take ~2-3 hours to complete while in env B it will need 3 days to get the results.
I noticed that in env. A jdftx will run with ~8 theads (one per core, I think it is fine) while in env B it creates ~ 55 theads!
Does someone noticed this strange behaviour?
Any other parameter is the same.
Regards
Michele Renda
We don't have that much experience running jdftx in virtual machines, so I don't know for sure why it runs so much slower on environment B. Here is an idea to try:
You can use the -c flag to manually set the number of threads in jdftx. If the slowdown is due to too many threads (larger than what the OS can run simultaneously), this should fix it.
However, if the problem is due to general overhead in virtual environments, I don't know how it might be easily fixed. Maybe Shankar has suggestions.
Hi Michele,
Could you please attach the log files of the two runs, along with the output of 'cat /proc/cpuinfo' on the two VMs as well as their host machines?
(I'm looking for issues such as hyperthreaded cores assigned to separate physical cores in the VM.)
Best,
Shankar
Hello,
Sorry for late answer but I wanted to be sure I did not any objvious mistake and I performed another test.
The results are these:
Enviroment A: Duration: 0-0:54:04.73
Enviroment B: Duration: 0-8:53:03.48
I just noticed the only difference is the version of JDFX (but I think it is not influenting):
Enviroment A: Duration: JDFTx 0.99.alpha (svn revision 1178)
Enviroment B: Duration: JDFTx 0.99.alpha (svn revision 1183)
Attached there are the input file and the generated output.
Best regards
Michele Renda
PS. During the elaboration in enviroment B i noticed ~ 50 theads (with only 24 phisical cores).
Last edit: Michele Renda 2015-05-13
As countertest I ran JDFTx into enviroment B but with reduced number of cores (only 8) and the elaboration took just:
Enviroment B red: Duration: 0-2:07:14.17
It seems that less core are better for performance!
When decreasing threads increases performance, it feels like it might be a cache coherence issue. I know we've had those before. Maybe the fact that you are running it in a virtual machine triggers something we haven't encountered before. Any ideas Shankar?
If you need, I can arrange to get you credentials to be able to execute your tests in both machines.
I have to add that env A runs in a intel server with 2x XeonE5-2620 v3 while env B run in a 4x AMD Opteron Processor 6272 (completly dedicated).
I am just curios about how the number of theads is calculated in openmpi? (if not specified with -c flag)
For the (automatic) thread number detection, you can look at core/AutoThreadCount.cpp . On line 31, jdftx looks at /proc/cpuinfo to get the necessary information.
Hi Michele,
It seems like JDFTx detects the cores correctly, since the two output files have the right number in the header.
The fact that you see ~55 threads in the second case seems to indicate that one of the libraries JDFTx is linked to is launching extra threads. The most likely offender is BLAS/LAPACK. What version of that are you linking to?
JDFTx assumes that the standard BLAS is compiled in single-threaded mode; there is no clean way to automatically figure this out at compile time. This is true for the default ATLAS library on most distributions, but there might be variations.
In the case of MKL, JDFTx is aware of the threaded layer in MKL and will use that instead whenever possible.
Could you please provide some info about your BLAS installation on the two VMs?
Thanks,
Shankar
PS: I don't think this is because of cache coherence issues. It is more likely a core over-commitment problem.
Hello,
The library I use is the default one provided by Debian 8 Jessie: libatlas-base-dev. The version is 3.10.2. I will check tomorrow how was compiled that library.
Thank you
Indeed I think the problem is the threads launched by BLAS/LAPACK. Attached there is the graph of the threads for first 20 min of elaboration in a 8 core virtual machine (with 5 sec resolution).
I will check the Debian library to undestand what is happening.
Regards
Michele
Last edit: Michele Renda 2015-05-14
Hi Michele,
It does seem that the default ATLAS on Debian might be compiled with threading (in contrast to the default ATLAS on Ubuntu which is not).
https://lists.debian.org/debian-science/2011/11/msg00029.html
Unfortunately I can't think of an easy way to detect and handle this at run time (else I'd patch it!).
I could, in principle, add a compile time switch to handle threaded ATLAS the same way as I do MKL. But this would lead to a significant performance degradation compared to using a single threaded ATLAS. This is because MKL has functions that allow adjusting the number of thread dynamically so that JDFTx can decide when to thread the linear algebra calls and when to call single-threaded linear algebra functions from each thread. ATLAS does not allow dynamically controlling threads AFAIK.
So the best options seem to be building ATLAS locally with threads disabled at compile time, or to install MKL (recommended if you have access to it).
Best,
Shankar
Hello,
Thank you very much. I will prepare a complete guide how to compile JDTFx and ATLAS (with disabled theading) in Debian 8.
Best regards
Michele
Thanks a bunch, Michele! We'll gladly incorporate that it into our upcoming manual / guide.
Cheers,
Shankar