Menu

A question about cpu usage of OSMPS, and how to enable parallel computing

Technical
Allen Wang
2016-10-25
2016-11-05
  • Allen Wang

    Allen Wang - 2016-10-25

    While running a big simulation program of spin-1/2 interacting fermion system, I notice from ubuntu system monitor that only one core of my 8-core cpu is actively working. Is this a problem with my computer or the OSMPS is designed to use only a single core of CPU? My thought is that if I could somehow utilize all my eight core CPUs the work could be done 8 times faster, right?
    So could anyone teach me (or give a reference) how to enable parallel computing (multiprocessing?) with OSMPS? Sorry I'm quite a novice in computer science :)

     
  • Daniel Jaschke

    Daniel Jaschke - 2016-10-25

    Hallo Allen,

    yes, indeed you can use multiple processor on your machine and it is not a problem with your computer. I assume that you are calling in your python script the "mps.runMPS" function, which is by default serial, with a prior call to "mps.WriteFiles".

    The standard way to parallelize the simulation is using fortran MPI (Message Passing Interface).
    1) You need to install the corresponding package on your ubuntu system, that should be according to my setup "libopenmpi-dev".

    If you are running the v_1.0.tar.gz downloaded from the openMPS-sourceforge site:
    2a) You have to modify the BuildOSMPS.py. In comparison to your current setting the compiler should change to FC="mpif90" and set the parallel flag to True. If you have trouble there, let me know.
    3a) You have to setup your script with templates as shown in the "ParallelHCDipolar" example. This will generate a file ending on sbatch. From the sbatch file you can extract what you want to call from command line. This should be something like "mpirun -n 8 Execute_MPSParallelMain YOUR_WRITE_DIRECTORY/...".

    If you have a newer version from subversion (svn):
    2b) You should be able to specify the command line option --os=unixmpi when calling BuildOSMSP.py
    3b) Just replace the "mps.WriteFiles" with "mps.WriteMPSParallelFiles" and comment the "mps.runMPS". The function will write all input files and generate a file for a cluster without the need to define a template. You have to provide some dummy cluster setup (example see below). You can extract the necessary command line call again from the submission file for the cluster.

    The speed-up in your case would be a factor of 7 - one of the eight cores is used to distribute jobs.

    Let me know if there are any further questions or you run into any problems.

    Kind regards,

    Daniel Jaschke

    P.S. For the recent svn versions the HCDipolar needs to be adapted ... instead of calling WriteMPSParallelFiles, it should call WriteMPSParallelTemplate.
    P.P.S. Here the an example for a dictionary in bold text for the setup of the cluster. Since you run on command line, you can use that as placeholder/dummy:
    MainFiles = mps.WriteMPSParallelFiles(params, Operators, H,
    {'time' : '143:59:59',
    'nodes' : ['000'],
    'ThisFileName' : 'FILNENAME.py',
    'myusername' : 'allenwang'}
    ,
    PostProcess=PostProcess)

     
    • Allen Wang

      Allen Wang - 2016-10-28

      Hi,
      Thanks for reply. I have mpi built successfully, but met some troubles when running ParallelHCDipolar.py.
      1. In the python file comp_info={
      'computer' : 'mio',
      'nodes' : '4',
      'ThisFileName' : os.path.abspath( file )}
      The 'computer' must be set to 'mio'? What does it mean? What does 'nodes' mean? And what does os.path.abspath( file ) mean, should file be changed to the absolute path of the ParallelHCDipolar.py?

      1. My understanding is, when running python ParallelHCDipolar.py, the program only write fortran inputs into TMP/, to run the main simulation we have to call
        mpirun -n 8 Execute_MPSParallelMain TMP/HCDipolar, is this correct?

      3.When I run python ParallelHCDipolar.py, I get
      /software/OSMPS_4NRM/MPSPyLib/MPO.py:1118: RuntimeWarning: overflow encountered in power
      val=val+p[2*n](p[2*n+1](x-1))
      /software/OSMPS_4NRM/MPSPyLib/MPO.py:1118: RuntimeWarning: overflow encountered in multiply
      val=val+p[2*n]*(p[2*n+1]
      (x-1))
      ('resid', 1.4624604586908328e-10)
      ('Number of exponentials used to fit infinite function', 5)
      -1.0
      mu\sum_i nbtotal_i
      -1.0
      t\sum_i bdagger_i b_{i+1} + -1.0t\sum_i b_i bdagger_{i+1}
      0.000294304368234
      U\sum_{j<i} nbtotal_j0.928454528782^{i-j-1}nbtotal_i + 0.630813321853U\sum_{j<i} nbtotal_j0.0309917562456^{i-j-1}nbtotal_i + 0.297622551337U\sum_{j<i} nbtotal_j0.22389189711^{i-j-1}nbtotal_i + 0.00701784373998U\sum_{j<i} nbtotal_j0.771871803326^{i-j-1}nbtotal_i + 0.0642519786466U\sum_{j<i} nbtotal_j0.515541942488^{i-j-1}*nbtotal_i

      Is the overflow error fatal?

       

      Last edit: Allen Wang 2016-10-28
  • Daniel Jaschke

    Daniel Jaschke - 2016-10-28

    Hallo Allen,

    1) The intention of this example was not to run MPI locally, but on a high performance computer cluster. So you feed in some information for this setup which does not matter for your local MPI setup. All of the information, i.e. "computer", "nodes", etc., will not affect you as long as it does not throw an error. Let me know if it does.
    2) Yes, after running the python file without errors, you can execute "mpirun -n 8 Execute_MPSParallelMain TMP/HCDipolar". (You can try as well a "mpirun -n 9" and see if it is faster. Than you have more tasks then cores, but one tasks is only distrubuting jobs and should not need much CPU.)
    3) Overflow warning: The HCDipolar has long-range interactions which are fitted with exponentials such that we can use them in MPS. The overflow here should be in the fitting procedure trying a too large parameter. The quality of the fit can be seen from the residual, in your example output around 1e-10 which is good. So I see no problems arising from that.

    Let me know if that works out for you.

    Kind regards,

    Daniel

    P.S. But for the sake of completeness the description of the variables you were asking about: "mio" is the name of the our cluster and going deep into the python modules one could implement another cluster. "nodes" is how many nodes on the cluster you request. Variables starting with two underscores are usually internal python variables, so __file__ contains the name of the script itself and does not have to be changed.

     
    • Allen Wang

      Allen Wang - 2016-11-02

      Thanks. One more question:
      What if I want to parallelize over the Abelian quantum number (particle number N in a number-conserving model)?
      My model has 50 lattice fermionic sites so I want to find the ground states with 1~49 particles, how to parallelize over N?
      I tried this:
      parameters_template=
      {
      ...
      'Abelian_generators':['nftotal'],
      #'Abelian_quantum_numbers':[Nlist],
      ...
      }
      Nlist=range(1,50)
      iterparams=['Abelian_quantum_numbers']
      iters=[Nlist]
      but it says
      Exception: Abelian_quantum_numbers must be supplied when Abelian_generators are supplied!

       
  • Daniel Jaschke

    Daniel Jaschke - 2016-11-02

    Hallo Allen,

    could you please tell me which version you are running? Your question is getting specific, so it simplifies things if I can run exactly the same version of openMPS as you do.

    (a) You went to the sourceforge page of openMPS and used the green download button. That would already be enough information for me.
    (b) You installed subversion (svn) and used the command "svn checkout svn://svn.code.sf.net/p/openmps/code/ openmps-code". Then please write me revision number from "svn info" in the line "Revision: ??"

    I was trying to get it working anyway, but I run into different errors. The iters variable should be listed lists, e.g. for 1 to 10 particles (the first list contains different parameters you want to parallelize, the second loop contains all the values for one parameter; and the quantum numbers themselves have to be a list):
    iters = [[[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]]

    You should actually have a listed list of a different kind:
    iters =[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]

    But this will not resolve the problem completely, so please post with what version you are working with. Thank you very much.

    Kind regards,

    Daniel

    P.S. Quick help to get the listed list of the first type:
    qiters = []
    for ii in range(1, 50):
    qiters.append([ii])
    iters = [qiters]

     
    • Allen Wang

      Allen Wang - 2016-11-02

      Hi, I run the command as you said
      svn checkout svn://svn.code.sf.net/p/openmps/code/ openmps-code
      and I got
      Checked out revision 68.
      Is this what you mean?

       
  • Daniel Jaschke

    Daniel Jaschke - 2016-11-02

    Hallo Allen,

    yes, that helps. I made the following changes in the ParallelHCDipolar file to get it working with the number of particles specified. I attached the file as well so you can check easily.

    1) We do not need the chemical potential anymore, it just adds a constant. Therefore, you can comment the line:
    #H.AddMPOTerm(Operators,'site','nbtotal',hparam='mu',weight=-1.0)

    2) For the finite size system I set the system size to 10 (instead of two for the unit cell of iMPS)
    L = 10

    3) Setup the number of particles as iteration parameters. You can delete the lines from "mumin=0.5" to "iters=[muiter,titer]" and add instead
    qiter = []
    for ii in range(1, 11):
    qiter.append([ii])
    iterparams = ['Abelian_quantum_numbers']
    iters = [qiter]

    4) Delete/comment the line specifying the iMPS simulation
    # 'simtype' : 'Infinite',

    5) We did not iterate over the tunneling parameter t anymore, so we have to add it to the dictionary template:
    't' : 0.5,

    Then I can run it with revision68 (you might want to recompile the fortran libraries with "python BuildOSMPS.py --local='./' --os=unixmpi --clean" to be sure to have r68). I did not spot any mistake besides the listed list in your setup, but you can attach your file if you want me to check it.

    Best regards,

    Daniel

     
    • Allen Wang

      Allen Wang - 2016-11-02

      Thanks again. A funny mistake occurred: when I run mpirun, I find that the TMP files are suddenly deleted!
      The following is all that I do:
      python ParallelHCDipolar.py
      mpirun -n 8 Execute_MParallelMain TMP/HCDipolar11
      it says
      At line 169 of file /home/lagrenge/software/OSMPS_4NRM/MPSFortLib/Mods/ParallelOps.f90 (unit = 16)
      Fortran runtime error: Cannot open file ' TMP/HCDipolar11Abelian_quantum_numbers[3]Main.nml': No such file or directory

      Screenshots are attached.

       
      • Allen Wang

        Allen Wang - 2016-11-03

        And when I run the v1.0 version downloaded from sourceforge the same problem occurred:
        At line 169 of file /home/lagrenge/bin/OpenSourceMPS_v1.0/MPSFortLib/Mods/ParallelOps.f90 (unit = 16)
        Fortran runtime error: Cannot open file ' TMP/HCDipolart0.0179591836735mu0.5Main.nml': No such file or directory


        Primary job terminated normally, but 1 process returned
        a non-zero exit code.. Per user-direction, the job has been aborted.



        mpirun detected that one or more processes exited with non-zero status, thus causing
        the job to be terminated. The first process to do so was:

        Process name: [[30106,1],2]
        Exit code: 2
        My two computers give similar errors.

         
  • Daniel Jaschke

    Daniel Jaschke - 2016-11-03

    Hallo Allen,

    I'll try to make revision 68 working for you since you already know how to use svn. The error descibed by you looks like you are working with revision 67. I was running it with 67 for a test. Let's make a quick check that svn didn't mess up things:

    1) Please open the file "MPSFortLib/Mods/ParallelOps.f90" and write me if line number 169 is "OPEN(UNIT=jpunit,FILE=nmlname)" or "!Send rank to the master to let it know you are finished". The first one corresponds to revision 67, the latter to 68.
    2) If you have revision 68, please recompile with "python BuildOSMPS.py --local='./' --os=unixmpi --clean" to be sure that the executable is based on revision 68. Try our test example again.
    3) If you have revision 67, run again "svn update" and let me know if you spot any error messages or warnings. Use the ParallelOps.f90 file to check if it has been updated.

    I hope that 1) and 2) will resolve the problem.

    Best regards,

    Daniel

     
    • Allen Wang

      Allen Wang - 2016-11-04

      Could you tell me where to get revision 68? It seems that my version is not even r67, since I've checked the line 1040 of MPSPyLib/tools.py, mine is
      1040 hstr += str(p['MPDO+'])
      1041
      1042 if(hasattr(HamiltonianMPO, 'len')):
      when I run "svn update" I get:
      ~/bin/OSMPS_4NRM$ svn update
      Skipped '.'
      svn: E155007: None of the targets are working copies
      My email: lagrenge@rice.edu
      It would be much simpler if you can send me that zip file of r68, thanks :)

       
  • Daniel Jaschke

    Daniel Jaschke - 2016-11-04

    Hallo Allen,

    voila, revision 68 is attached. Let me know if that finally works.

    Best regards,

    Daniel

     
    • Allen Wang

      Allen Wang - 2016-11-04

      No, even the unmodified example file ParallelHCDipolar doesn't work right. No parallel jobs ever run complete on my two computers up to now-- they normally run for a while and then some files in TMP/ are deleted and the whole programs stops. Thanks for your patience any way.

      mpirun -n 5 Execute_MPSParallelMain TMP/HCDipolar

      At line 169 of file /home/lagrenge/bin/OSMPS_4NRM/MPSFortLib/Mods/ParallelOps.f90 (unit = 16)
      Fortran runtime error: Cannot open file ' TMP/HCDipolart0.01mu0.5Main.nml': No such file or directory


      Primary job terminated normally, but 1 process returned
      a non-zero exit code.. Per user-direction, the job has been aborted.



      mpirun detected that one or more processes exited with non-zero status, thus causing
      the job to be terminated. The first process to do so was:

      Process name: [[18142,1],1]
      Exit code: 2

       
  • Daniel Jaschke

    Daniel Jaschke - 2016-11-05

    Hallo Allen,

    I would say there are only two options left:

    a) Try calling "mpirun -n 5 ./Execute_MPSParallelMain TMP/HCDipolar" with a dot-slash in front of the Execute_MPSParallelMain. In that case you have a global installation which is still the old version and always throws the same error no matter what you do locally. If it is this, I'm sorry that I was blind for quite some time.

    b) Otherwise, we should maybe schedule some skype call if that is possible. I don't think forum does us lead to a solution.

    Kind regards,

    Daniel

     
    • Allen Wang

      Allen Wang - 2016-11-05

      Cool! That final suggestion a) works out. Thanks a lot!

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.