Menu

Running time on different machines

2015-05-11
2015-05-15
  • Michele Renda

    Michele Renda - 2015-05-11

    Hello

    I noticed a very strange behaviour when running JDFTx in some virtual machines.

    I have two Debian 8 enviroments:

    Enviroment A:
    Virtual Box machine configured with 8 cores, 20 GB ram (and some swap).

    Enviroment B:
    Virtual Box machine configured with 24 cores, 94 GB ram (and no swap).

    Enviroment B is running in a dedicated 64 core machine with 128 GB ram (and no swap).

    What I noticed is this: if I run this code in env A it will take ~2-3 hours to complete while in env B it will need 3 days to get the results.

    lattice              \
         1 -0.5000     0 \
         0  0.8660     0 \
         0     0       50
    
    latt-scale 4.6865 4.6865 12.66
    
    ion-species /opt/GBRV/$ID_lda_v1.uspp
    
    coords-type lattice
    
    # Middle layer
    ion n   +0.0000 +0.0000 +0.0100  1  
    ion b   +0.3333 +0.6667 +0.0100  1   
    ion b   +0.0000 +0.0000 -0.0100  1  
    ion n   +0.3333 +0.6667 -0.0100  1   
    
    # Upper layer
    ion c   +0.6667 +0.3333 +0.0300  1  
    ion c   +0.3333 +0.6667 +0.0300  1 
    ion c   +0.0000 +0.0000 +0.0500  1  
    ion c   +0.3333 +0.6667 +0.0500  1  
    
    # Lower layer  
    ion c   +0.0000 +0.0000 -0.0300  1  
    ion c   +0.6667 +0.3333 -0.0300  1  
    ion c   +0.0000 +0.0000 -0.0500  1  
    ion c   +0.3333 +0.6667 -0.0500  1  
    
    dump-name sandwich_2b.$VAR
    dump End Ecomponents
    
    kpoint-folding 6 6 1
    
    density-of-states \
        Total \
    

    I noticed that in env. A jdftx will run with ~8 theads (one per core, I think it is fine) while in env B it creates ~ 55 theads!

    Does someone noticed this strange behaviour?

    Any other parameter is the same.

    Regards
    Michele Renda

     
  • Deniz Gunceler

    Deniz Gunceler - 2015-05-11

    We don't have that much experience running jdftx in virtual machines, so I don't know for sure why it runs so much slower on environment B. Here is an idea to try:

    You can use the -c flag to manually set the number of threads in jdftx. If the slowdown is due to too many threads (larger than what the OS can run simultaneously), this should fix it.

    However, if the problem is due to general overhead in virtual environments, I don't know how it might be easily fixed. Maybe Shankar has suggestions.

     
  • Ravishankar Sundararaman

    Hi Michele,

    Could you please attach the log files of the two runs, along with the output of 'cat /proc/cpuinfo' on the two VMs as well as their host machines?

    (I'm looking for issues such as hyperthreaded cores assigned to separate physical cores in the VM.)

    Best,
    Shankar

     
  • Michele Renda

    Michele Renda - 2015-05-13

    Hello,

    Sorry for late answer but I wanted to be sure I did not any objvious mistake and I performed another test.

    The results are these:

    Enviroment A: Duration: 0-0:54:04.73
    Enviroment B: Duration: 0-8:53:03.48

    I just noticed the only difference is the version of JDFX (but I think it is not influenting):

    Enviroment A: Duration: JDFTx 0.99.alpha (svn revision 1178)
    Enviroment B: Duration: JDFTx 0.99.alpha (svn revision 1183)

    Attached there are the input file and the generated output.

    Best regards
    Michele Renda

    PS. During the elaboration in enviroment B i noticed ~ 50 theads (with only 24 phisical cores).

     

    Last edit: Michele Renda 2015-05-13
  • Michele Renda

    Michele Renda - 2015-05-13

    As countertest I ran JDFTx into enviroment B but with reduced number of cores (only 8) and the elaboration took just:
    Enviroment B red: Duration: 0-2:07:14.17

    It seems that less core are better for performance!

     
  • Deniz Gunceler

    Deniz Gunceler - 2015-05-13

    When decreasing threads increases performance, it feels like it might be a cache coherence issue. I know we've had those before. Maybe the fact that you are running it in a virtual machine triggers something we haven't encountered before. Any ideas Shankar?

     
  • Michele Renda

    Michele Renda - 2015-05-13

    If you need, I can arrange to get you credentials to be able to execute your tests in both machines.

    I have to add that env A runs in a intel server with 2x XeonE5-2620 v3 while env B run in a 4x AMD Opteron Processor 6272 (completly dedicated).

    I am just curios about how the number of theads is calculated in openmpi? (if not specified with -c flag)

     
  • Deniz Gunceler

    Deniz Gunceler - 2015-05-13

    For the (automatic) thread number detection, you can look at core/AutoThreadCount.cpp . On line 31, jdftx looks at /proc/cpuinfo to get the necessary information.

     
  • Ravishankar Sundararaman

    Hi Michele,

    It seems like JDFTx detects the cores correctly, since the two output files have the right number in the header.

    The fact that you see ~55 threads in the second case seems to indicate that one of the libraries JDFTx is linked to is launching extra threads. The most likely offender is BLAS/LAPACK. What version of that are you linking to?

    JDFTx assumes that the standard BLAS is compiled in single-threaded mode; there is no clean way to automatically figure this out at compile time. This is true for the default ATLAS library on most distributions, but there might be variations.

    In the case of MKL, JDFTx is aware of the threaded layer in MKL and will use that instead whenever possible.

    Could you please provide some info about your BLAS installation on the two VMs?

    Thanks,
    Shankar

     
  • Ravishankar Sundararaman

    PS: I don't think this is because of cache coherence issues. It is more likely a core over-commitment problem.

     
  • Michele Renda

    Michele Renda - 2015-05-13

    Hello,

    The library I use is the default one provided by Debian 8 Jessie: libatlas-base-dev. The version is 3.10.2. I will check tomorrow how was compiled that library.

    Thank you

     
  • Michele Renda

    Michele Renda - 2015-05-14

    Indeed I think the problem is the threads launched by BLAS/LAPACK. Attached there is the graph of the threads for first 20 min of elaboration in a 8 core virtual machine (with 5 sec resolution).

    I will check the Debian library to undestand what is happening.

    Regards
    Michele

     

    Last edit: Michele Renda 2015-05-14
  • Ravishankar Sundararaman

    Hi Michele,

    It does seem that the default ATLAS on Debian might be compiled with threading (in contrast to the default ATLAS on Ubuntu which is not).

    https://lists.debian.org/debian-science/2011/11/msg00029.html

    Unfortunately I can't think of an easy way to detect and handle this at run time (else I'd patch it!).

    I could, in principle, add a compile time switch to handle threaded ATLAS the same way as I do MKL. But this would lead to a significant performance degradation compared to using a single threaded ATLAS. This is because MKL has functions that allow adjusting the number of thread dynamically so that JDFTx can decide when to thread the linear algebra calls and when to call single-threaded linear algebra functions from each thread. ATLAS does not allow dynamically controlling threads AFAIK.

    So the best options seem to be building ATLAS locally with threads disabled at compile time, or to install MKL (recommended if you have access to it).

    Best,
    Shankar

     
  • Michele Renda

    Michele Renda - 2015-05-15

    Hello,

    Thank you very much. I will prepare a complete guide how to compile JDTFx and ATLAS (with disabled theading) in Debian 8.

    Best regards
    Michele

     
  • Ravishankar Sundararaman

    Thanks a bunch, Michele! We'll gladly incorporate that it into our upcoming manual / guide.

    Cheers,
    Shankar