[swash-users] Upper bound error in mpi mode
                
                Brought to you by:
                
                    mzijlema
                    
                
            
            
        
        
        
    | 
     
      
      
      From:  <mj...@nh...> - 2018-05-10 13:43:50
      
     
   | 
Hi everyone. I'm building a SWASH model to simulate the wave propagation along the Varna coast, Bulgaria. My computational grid is 1900 m*1230 m. Spatial resolution is 1 m. The offshore depth is about 8 m. The model can run normally in serial mode and openmpi mode, but when I try to run it in mpich2 mode, the screen shows "segmentation faults" like : mpirun -np 2 ./swash.exe SWASH is preparing computation forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source swash.exe0 0000000000793C54 Unknown Unknown Unknown libpthread-2.17.s 00002AD5D048C5E0 Unknown Unknown Unknown swash.exe0 000000000070E591 swexchg_ 3152 swanparll.f swash.exe0 0000000000447C4E swashcheckprep_ 752 SwashCheckPrep.f90 swash.exe0 00000000004048B0 swashmain_ 151 SwashMain.f90 swash.exe0 000000000040466D MAIN__ 63 Swash.f90 swash.exe0 00000000004044DE Unknown Unknown Unknown libc-2.17.so 00002AD5D0BC0C05 __libc_start_main Unknown Unknown swash.exe0 00000000004043E9 Unknown Unknown Unknown forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source swash.exe0 0000000000793C54 Unknown Unknown Unknown libpthread-2.17.s 00002B81E4F245E0 Unknown Unknown Unknown swash.exe0 000000000070E591 swexchg_ 3152 swanparll.f swash.exe0 0000000000447C4E swashcheckprep_ 752 SwashCheckPrep.f90 swash.exe0 00000000004048B0 swashmain_ 151 SwashMain.f90 swash.exe0 000000000040466D MAIN__ 63 Swash.f90 swash.exe0 00000000004044DE Unknown Unknown Unknown libc-2.17.so 00002B81E5658C05 __libc_start_main Unknown Unknown swash.exe0 00000000004043E9 Unknown Unknown Unknown rank 1 in job 6 localhost.localdomain_43660 caused collective abort of all ranks exit status of rank 1: return code 174 rank 0 in job 6 localhost.localdomain_43660 caused collective abort of all ranks exit status of rank 0: return code 174 I thought that it may be stack overflow, so I tried two ways wishing to solve the problem. One way is rearranging "ulimit -s unlimited" in the environment and another way is adding "-heap-arrays" in FFLAG. But both of them failed. Then I added "-check all" in FFLAG and the screen shows the errors below forrtl: severe (408): fort: (2): Subscript #1 of the array IBLKAD has value 6372 which is greater than the upper bound of 6371 Image PC Routine Line Source libifcoremt.so.5 00002B8870C2C949 for_emit_diagnost Unknown Unknown swash.exe 00000000029AAC1D swexchg_ 3152 swanparll.f swash.exe 000000000056D4E5 swashcheckprep_ 752 SwashCheckPrep.f90 swash.exe 0000000000404B9B swashmain_ 151 SwashMain.f90 swash.exe 0000000000404687 MAIN__ 63 Swash.f90 swash.exe 0000000000403DDE Unknown Unknown Unknown libc-2.17.so 00002B8872B19C05 __libc_start_main Unknown Unknown swash.exe 0000000000403CE9 Unknown Unknown Unknown The errors show that some variables in swanparll.f may have bound problems. I'm a beginner in SWASH and has never studied the detailed code. I don't know where the problem is when I use parallel mode. So, if you have any ideas or have ever encountered such problems, please help me! Thanks! ------------------------------------------------------ Mengjie Xiong Engineer Nanjing Hydraulic Research Institute  |