Re: [Gfs-users] problem with communications of nodes in parallel simulations
Brought to you by:
popinet
From: Pascal R. <ra...@lm...> - 2011-03-26 16:26:18
|
Hi Xiaodong, yes I can but it's not exactly the subject of this list, I had submited my problem to Rocks list but I never could obtain the solution, finally I decided to update SGE from 6.2u2 to 6.2u5 and it was the good solution because the 6.2u2 is bugged even if some people use it, it is bugged with infiniband, here is what was my problem : > >>>>> Hi > >>>>> > >>>>> I configured one cluster with Rocks 5.2, SGE 6.2u2, OFED 1.4 and OpenMPI 1.4.2, > >>>>> I'm using two networks, standard Gigabit and Infiniband, and I have one problem > >>>>> with SGE, it is not possible to use Infiniband under SGE without tcp protocol : > >>>>> if I'm running one job directly with mpirun from one node and I'm blocking openib : > >>>>> mpirun --mca btl ^openib -n 32 -machinefile /home/ray/hostfile $(pwd)/hello & > >>>>> everything is OK but it's not the best. > >>>>> > >>>>> now I do the same and I'm blocking tcp : > >>>>> mpirun --mca btl ^tcp -n 32 -machinefile /home/ray/hostfile $(pwd)/hello & > >>>>> everything is OK and it is very efficient, it is what I want. > >>>>> > >>>>> now I come back on the frontend, I'm using SGE with qsub and blocking openib : > >>>>> everything is OK but it's not the best. > >>>>> > >>>>> now I do the same, I'm using SGE with qsub and blocking tcp : > >>>>> the cores can't communicate and I have the error below, but I know it's not > >>>>> the good explanation, it lacks something under SGE, do I need to activate > >>>>> something for sdp protocol ? > >>>>> is it normal that by default the btl layer of openmpi always activate tcp > >>>>> under sge but not with a direct mpirun ? I precise the problem was only for using more than one node. best, Pascal On Sat, 26 Mar 2011 11:05:06 -0400 (EDT) "Chen, Xiaodong" <xch...@ma...> wrote: > Pascal, > > Could you please tell us more about your experience on problem related to batch when you use multiple nodes? Thanks. > > Xiaodong > > 发自我的 iPhone > > 在 Mar 26, 2011,9:11 AM,Pascal Ray <ra...@lm...> 写道: > > > Hi Armin, > > > > what is exactly your problem ? > > did you try to remove this two lines at the begenning of your parallel gfs file ? > > # when editing this file it is recommended to comment out the following line > > GfsDeferredCompilation > > > > in more in one previous send I saw you forgetted to split your file before to > > use -b or -p options of gerris, for example : > > > > gerris2D -s 2 one_box.gfs > s2_one_box.gfs > > it splits two times and you'll have 16 boxes. > > > > gerris -b 4 s2_one_box.gfs > b4s2_one_box.gfs > > you put 4 processors onto your 16 boxes. > > > > last, what is your system for your cluster, linux + Rocks ? > > your network, infiniband ? > > your mpi library, openmpi ? > > your batch system, SGE ? > > from my experience many problems come from the batch system, especially when you > > want to use more than one node. > > > > Best, > > Pascal > > > > > > On Sat, 26 Mar 2011 11:55:01 +0100 > > Armin Ghadjardjazi <ar...@gm...> wrote: > > > >> Hi Everyone, > >> > >> has anyone come up with any solution concerning the hang-up of parallel > >> simulations on multiple nodes? Our people in IT division have not been able > >> to find a solution for this yet. > >> > >> any suggestions are welcomed. > >> > >> thanks, > >> Armin > >> > > > > > > -- > > Pascal Ray <ra...@lm...> > > > > ------------------------------------------------------------------------------ > > Enable your software for Intel(R) Active Management Technology to meet the > > growing manageability and security demands of your customers. Businesses > > are taking advantage of Intel(R) vPro (TM) technology - will your software > > be a part of the solution? Download the Intel(R) Manageability Checker > > today! http://p.sf.net/sfu/intel-dev2devmar > > _______________________________________________ > > Gfs-users mailing list > > Gfs...@li... > > https://lists.sourceforge.net/lists/listinfo/gfs-users -- Pascal Ray <ra...@lm...> |