From: Benjamin S. K. <be...@cf...> - 2003-09-10 22:55:57
|
[I posted this to the devel-list so we would have it in the archives] How are these machines connected? Is it 100 Mbit ethernet, or something faster? A 480 Mb mesh is pretty big by my standards. It would be interesting to see how long it takes meshtool to simply _read_ that mesh, and see how that compares to the time it took to read the mesh in that parallel simulation. My guess is that all 15 processors were slamming a fileserver at the same time, requiring 15*480 = 7.2 Gb of data to be transfered. Really, this poor performance is due to a lack of imagination on my part. I was thinking of something like 50Mb as a big mesh. I will re-work the read() method so that processor 0 is the only one that opens and reads the file. It will then broadcast the data to all the other processors. This should be _much_ faster than the current implementation. If you could, though, see how long it takes meshtool to simply read the mesh. Another option would be to remove all the non-local elements on each processor. This could be a viable option for reducing memory overhead when AMR is not required. Once the mesh is parallelized it won't be such an issue. ...which brings me to your question: What are the future plans? 1.) Obviously, parallelizing the mesh is the biggest thing to allow scalability to _many_ (i.e. hundreds) of processors. This will require a fair amount of work, and I'm putting it off for now. 2.) Other than that, I think many major features are already implemented. There are some small improvements that can be made, like moving new points to a user-supplied geometry during AMR, robust smoothers, more shape functions, etc... A lot of good stuff is in there now. Maybe some of it could be optimized. 3.) Increase user base. I think that the library is pretty stable now. More users will help make it better and they will request more features. We should improve the web page to include links to presentations & stuff that use the library. You know, "eye candy." Stuff to get people excited by saying "yeah, this _can_ be used for big stuff." 4.) Whatever others want (within reason) :-) Let me know what you think. -Ben Daniel Dreyer wrote: >Hello there,Daniel Dreyer <d.d...@tu...> > > >In case of interest: find attached a log of a job i finished last >weekend, solving a linear complex-valued system with approx 1e6 dof >on 15 procs (workstations P4 1.4GHz, each 1GB ram, they were starting >to swap 200MB+). > > >First i used 16, but the disk of the 16th was _so_ slow (reading the >mesh) that it forced all other procs to wait at the next communication >barrier: >SystemBase/FrequencySystem::init(). I was wondering _all_ the time >what the heck took so long to initialize _two_ petsc vectors... > > >I am truly impressed by the performance of the assemble routine. >You guys did a _great_ job with libMesh. > > >Ben: thanks for the profiling hint again, but even from this log >(only proc 0) it appears that it's not worth spending time on re-working >the InfFE for speed. They are pretty much ok, i'd say. > > >Instead, the Mesh::read() (a Universal file, size 480MB -- all Tet4) >took _very_ long. I reworked the Importer a bit so that it does >not need a tmp-file any more. However, it still needs a std::map for >the foreign node ids. But it got faster, esp. with larger files, >20%+ or so. -- Just committed it a few minutes ago. I am satisfied >so far with this importer. In case anybody wants a faster one, she >should use the binary formats. Unv is more designed to have a comfy, >rather idiot-proof interface. > > >libMesh has come a long way. What are your future plans? > >Regards > Daniel > > >| >| Daniel Dreyer, SM (MIT) >| Mechanics and Ocean Engineering >| TUHH, Germany > >------------------------------------------------------------------------ > >time mpirun -machinefile /profile/cdd/mumcluster/15g1_active_machines -np 15 thea -i t627240_m6.in--inf-elem-order 8 --inf-elem-family j20 -v -ksp_type tfqmr -log_summary | tee --append nlog_np-15_f-j20_o-8_c-t627240_m6.in_s-tfqmr.log >. >-------------------------------------------------------------------------- >Petsc Version 2.1.6, Patch 1, Released Aug 06, 2003 > The PETSc Team > pet...@mc... > http://www.mcs.anl.gov/petsc/ >See docs/copyright.html for copyright information >See docs/changes/index.html for recent updates. >See docs/troubleshooting.html for problems. >See docs/manualpages/index.html for help. >Libraries linked from /mum/code/bibs_gcc-3.2.3/petsc-2.1.6/lib/libO_complex/cluster >-------------------------------------------------------------------------- > Input file name = t627240_m6.in > Parsing input file. > Finished parsing input file. > > Options from command line override input file data. > > Settings: > Mesh file = torpedo_627240.unv > Dimension = 3 > Boundary condition file = bcs_torpedo_627240_m6.unv > FE polynomial family = Lagrange > FE polynomial order = 1 > InfFE polynomial family = Jacobi(2,0) > InfFE polynomial order = 8 > Build infinite elements: > Mesh x-symmetric = 0 > Mesh y-symmetric = 0 > Mesh z-symmetric = 0 > Fluid density = 1026 > Wave speed = 1500 > Solution mode = one large matrix > Frequencies = 2064 > Equation systems file = c_ieo8_ief12.sys_t627240_m6.dat > Equation systems mode = BINARY > Output GMV file = c_ieo8_ief12.res_t627240_m6.gmv > > Mesh Information: > mesh_dimension()=3 > spatial_dimension()=3 > n_nodes()=627240 > n_elem()=3508416 > n_local_elem()=3508416 > n_subdomains()=1 > n_processors()=15 > processor_id()=0 > > MeshData Information: > object activated. > Element associated data initialized. > n_val_per_elem()=0 > n_elem_data()=0 > Node associated data initialized. > n_val_per_node()=3 > n_node_data()=30930 > > Reducing memory usage of MeshData object. > > Origin for Infinite Elements: > determined x-coordinate > determined y-coordinate > determined z-coordinate > coordinates: 0.00000 0.00000 0.00000 > > Verbose mode disabled in non-debug mode. > Mesh Information: > mesh_dimension()=3 > spatial_dimension()=3 > n_nodes()=681850 > n_elem()=3617632 > n_local_elem()=242699 > n_subdomains()=1 > n_processors()=15 > processor_id()=0 > > >Using PETSc as solver interface. > > EquationSystems > n_systems()=1 > System "Helmholtz" > Type "Frequency" > Variables="p" > Finite Element Types="0", "12" > Infinite Element Mapping="0" > Approximation Orders="1", "8" > n_dofs()=1064120 > n_local_dofs()=77590 > n_additional_vectors()=0 > n_additional_matrices()=0 > n_parameters()=7 > Parameters: > "current frequency"=2064 > "density"=1026 > "frequency 0000"=2064 > "linear solver maximum iterations"=50000 > "linear solver tolerance"=1e-10 > "n_frequencies"=1 > "speed"=1500 > > Solution phase for 1 frequencies: > Solving for frequency No. 0. > RHS Assembly: > prescribed normal velocity b.c. on 4609 faces > > > ---------------------------------------------------------------------------- >| Processor id: 0 >| Num Processors: 15 >| Time: Sun Sep 7 22:41:59 2003 >| OS: Linux >| HostName: amazonas >| OS Release 2.4.10-4GB-SMP >| OS Version: #1 SMP Sam Feb 16 20:03:19 CET 2002 >| Machine: i686 >| Username: cdd > ---------------------------------------------------------------------------- > ---------------------------------------------------------------------------- >| Helmholtz::assemble_large Performance: Alive time=116.976, Active time=109.275 > ---------------------------------------------------------------------------- >| Event nCalls Total Avg Percent of | >| Time Time Active Time | >|----------------------------------------------------------------------------| >| | >| fe_integration 234499 9.1792 0.000039 8.40 | >| fe_matrix_insertion 234499 2.0325 0.000009 1.86 | >| inf_fe_integration 8200 93.9242 0.011454 85.95 | >| inf_fe_matrix_insertion 8200 1.4653 0.000179 1.34 | >| init 1 0.2013 0.201329 0.18 | >| rhs 234499 2.4722 0.000011 2.26 | > ---------------------------------------------------------------------------- >| Totals: 719898 109.2747 100.00 | > ---------------------------------------------------------------------------- > > > Iterations : 2690 > Final preconditioned residual: 5.94419e-05 > > Finished solution for f = 2064.00. > > Writing equation systems to file c_ieo8_ief12.sys_t627240_m6.dat. > > Writing GMV data to file c_ieo8_ief12.res_t627240_m6.gmv. >************************************************************************************************************************ >*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** >************************************************************************************************************************ > >---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- > >/daten/cdd/cluster_t_240803_1/thea on a linux named amazonas with 15 processors, by cdd Sun Sep 7 23:22:54 2003 >Using Petsc Version 2.1.6, Patch 1, Released Aug 06, 2003 > > Max Max/Min Avg Total >Time (sec): 3.503e+03 1.10353 3.206e+03 >Objects: 0.000e+00 0.00000 0.000e+00 >Flops: 2.765e+11 1.30121 2.439e+11 3.659e+12 >Flops/sec: 8.699e+07 1.30635 7.608e+07 1.141e+09 >MPI Messages: 3.769e+04 3.50000 2.728e+04 4.092e+05 >MPI Message Lengths: 5.523e+08 2.00500 1.756e+04 7.185e+09 >MPI Reductions: 5.394e+02 1.00000 > >Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N --> 2N flops > and VecAXPY() for complex vectors of length N --> 8N flops > >Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total Avg %Total counts %Total > 0: Main Stage: 3.2058e+03 100.0% 3.6592e+12 100.0% 4.092e+05 100.0% 1.756e+04 100.0% 8.091e+03 100.0% > >------------------------------------------------------------------------------------------------------------------------ >See the 'Profiling' chapter of the users' manual for details on interpreting output. >Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this phase > %M - percent messages in this phase %L - percent message lengths in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) >------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > >Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >------------------------------------------------------------------------------------------------------------------------ > >--- Event Stage 0: Main Stage > >VecDot 5380 1.0 4.2393e+02 2.6 2.01e+07 2.8 0.0e+00 0.0e+00 5.4e+03 10 1 0 0 66 10 1 0 0 66 108 >VecNorm 2691 1.0 3.9828e+02 2.9 1.17e+07 3.0 0.0e+00 0.0e+00 2.7e+03 10 1 0 0 33 10 1 0 0 33 58 >VecCopy 4 1.0 7.9661e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >VecSet 5387 1.0 7.6299e+00 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >VecAXPY 10760 1.0 2.2394e+01 1.3 3.63e+08 1.3 0.0e+00 0.0e+00 0.0e+00 1 3 0 0 0 1 3 0 0 0 4090 >VecAYPX 5380 1.0 1.1660e+01 1.3 3.55e+08 1.2 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 3928 >VecWAXPY 10758 1.0 3.2496e+01 1.4 2.39e+08 1.3 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 2466 >VecAssemblyBegin 2 1.0 6.2421e-02 1.6 0.00e+00 0.0 7.6e+01 1.8e+03 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 >VecAssemblyEnd 2 1.0 8.2004e-05 6.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >VecScatterBegin 5381 1.0 2.7273e+0265.6 0.00e+00 0.0 4.1e+05 1.7e+04 0.0e+00 3 0100 99 0 3 0100 99 0 0 >VecScatterEnd 5381 1.0 3.3427e+0227.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0 >MatMult 5381 1.0 1.1103e+03 1.5 1.72e+08 1.8 4.1e+05 1.7e+04 0.0e+00 26 46100 99 0 26 46100 99 0 1532 >MatSolve 5381 1.0 6.0269e+02 1.5 3.03e+08 1.5 0.0e+00 0.0e+00 0.0e+00 15 46 0 0 0 15 46 0 0 0 2766 >MatLUFactorNum 1 1.0 1.8581e+00 1.7 3.51e+08 1.9 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2619 >MatILUFactorSym 1 1.0 4.5264e+0035.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 >MatAssemblyBegin 1 1.0 1.1639e+02353.6 0.00e+00 0.0 7.6e+01 9.3e+05 2.0e+00 2 0 0 1 0 2 0 0 1 0 0 >MatAssemblyEnd 1 1.0 3.5230e+00 1.4 0.00e+00 0.0 7.6e+01 5.7e+03 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 >MatGetOrdering 1 1.0 2.1650e-02 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 >MatZeroEntries 3 1.0 3.9576e+0019.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >PCSetUp 2 1.0 6.8812e+00 5.6 3.11e+08 6.6 0.0e+00 0.0e+00 5.0e+00 0 0 0 0 0 0 0 0 0 0 707 >PCSetUpOnBlocks 1 1.0 6.4398e+00 5.3 3.11e+08 6.3 0.0e+00 0.0e+00 5.0e+00 0 0 0 0 0 0 0 0 0 0 756 >PCApply 5381 1.0 6.1027e+02 1.5 2.97e+08 1.5 0.0e+00 0.0e+00 0.0e+00 15 46 0 0 0 15 46 0 0 0 2732 >SLESSetup 2 1.0 7.0445e+00 5.7 3.06e+08 6.5 0.0e+00 0.0e+00 5.0e+00 0 0 0 0 0 0 0 0 0 0 691 >SLESSolve 1 1.0 2.0283e+03 1.0 1.37e+08 1.3 4.1e+05 1.7e+04 8.1e+03 63100100 99100 63100100 99100 1802 >------------------------------------------------------------------------------------------------------------------------ > >Memory usage is given in bytes: > >Object Type Creations Destructions Memory Descendants' Mem. > >--- Event Stage 0: Main Stage > > Index Set 7 7 1373084 0 > Map 15 15 1920 0 > Vec 16 16 14903936 0 > Vec Scatter 1 1 221036 0 > Matrix 4 4 122404068 6.24484e+07 > Krylov Solver 2 2 88 1.11768e+07 > Preconditioner 2 2 184 1.21596e+08 > SLES 2 2 0 1.32773e+08 >======================================================================================================================== >Average time to get PetscTime(): 3.00157e-06 >Average time for MPI_Barrier(): 0.000647009 >Average time for zero size MPI_Send(): 0.000117798 >Compiled with FORTRAN kernels >Compiled with double precision matrices (default) >sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void *) 4 >Libraries compiled on Mit Aug 13 01:56:38 CEST 2003 on moldau >Machine characteristics: Linux moldau 2.4.10-4GB-SMP #1 SMP Fri Jan 25 23:09:50 CET 2002 i686 unknown >Using PETSc directory: /mum/code/bibs_gcc-3.2.3/petsc-2.1.6 >Using PETSc arch: cluster >----------------------------------------- >Using C compiler: /mum/code/compiler/gcc-3.2.3/bin/gcc-3.2.3 -fPIC -O2 -felide-constructors -DNDEBUG -I/mum/code/bibs_gcc-3.2.3/petsc-2.1.6 -I/mum/code/bibs_gcc-3.2.3/petsc-2.1.6/bmake/cluster -I/mum/code/bibs_gcc-3.2.3/petsc-2.1.6/include -I/mum/code/bibs_gcc2.95.3/mpich/include -DPETSC_HAVE_BLOCKSOLVE -DPETSC_HAVE_X11 -DPETSC_USE_DEBUG -DPETSC_USE_LOG -DPETSC_USE_BOPT_O -DPETSC_USE_COMPLEX -D__SDIR__='. ' >C Compiler version: >gcc-3.2.3 (GCC) 3.2.3 >Copyright (C) 2002 Free Software Foundation, Inc. >This is free software; see the source for copying conditions. There is NO >warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > >C++ Compiler version: >g++-3.2.3 (GCC) 3.2.3 >Copyright (C) 2002 Free Software Foundation, Inc. >This is free software; see the source for copying conditions. There is NO >warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > >Using Fortran compiler: /mum/code/compiler/gcc-3.2.3/bin/g77-3.2.3 -Wno-globals -O -I/mum/code/bibs_gcc-3.2.3/petsc-2.1.6 -I/mum/code/bibs_gcc-3.2.3/petsc-2.1.6/bmake/cluster -I/mum/code/bibs_gcc-3.2.3/petsc-2.1.6/include -I/mum/code/bibs_gcc2.95.3/mpich/include -DPETSC_HAVE_BLOCKSOLVE -DPETSC_HAVE_X11 -DPETSC_USE_DEBUG -DPETSC_USE_LOG -DPETSC_USE_BOPT_O -DPETSC_USE_COMPLEX >Fortran Compiler version: >GNU Fortran (GCC 3.2.3) 3.2.3 20030422 (release) >----------------------------------------- >Using PETSc flags: -DPETSC_USE_DEBUG -DPETSC_USE_LOG -DPETSC_USE_BOPT_O -DPETSC_USE_COMPLEX -DPETSC_HAVE_BLOCKSOLVE -DPETSC_HAVE_X11 >----------------------------------------- >Using configuration flags: >----------------------------------------- >Using include paths: -I/mum/code/bibs_gcc-3.2.3/petsc-2.1.6 -I/mum/code/bibs_gcc-3.2.3/petsc-2.1.6/bmake/cluster -I/mum/code/bibs_gcc-3.2.3/petsc-2.1.6/include -I/mum/code/bibs_gcc2.95.3/mpich/include >------------------------------------------ >Using C linker: /mum/code/compiler/gcc-3.2.3/bin/g++-3.2.3 -O2 -felide-constructors -DNDEBUG /mum/code/bibs_gcc-3.2.3/petsc-2.1.6/lib/libO_complex/cluster >Using Fortran linker: /mum/code/compiler/gcc-3.2.3/bin/g++-3.2.3 -O /mum/code/bibs_gcc-3.2.3/petsc-2.1.6/lib/libO_complex/cluster >Using libraries: -L/mum/code/bibs_gcc-3.2.3/petsc-2.1.6/lib/libO_complex/cluster -lpetscgsolver -lpetscgrid -lpetscmesh -lpetscts -lpetscsnes -lpetscsles -lpetscdm -lpetscmat -lpetscvec -lpetsc -L/mum/code/bibs_gcc2.95.3/BlockSolve95/lib/libO_complex/cluster -lBS95 -L/usr/X11R6/lib -lX11 -Wl,-Bstatic -L/mum/code/lapackblas/mkl/lib/32 -lmkl_lapack -lmkl_p4 -lguide -lpthread -L/mum/code/bibs_gcc2.95.3/mpich/lib -lmpich -ldl -L/mum/code/compiler/gcc-3.2.3/lib -lstdc++ -lfrtbegin -lc -lg2c -lm > > > > > > > > > > > > > > > > ---------------------------------------------------------------------------- >| libMesh Performance: Alive time=3527.7, Active time=3241.48 > ---------------------------------------------------------------------------- >| Event nCalls Total Avg Percent of | >| Time Time Active Time | >|----------------------------------------------------------------------------| >| | >| | >| DofMap | >| compute_sparsity() 1 44.6747 44.674699 1.38 | >| distribute_dofs() 1 5.4627 5.462732 0.17 | >| dof_indices() 7483594 47.8082 0.000006 1.47 | >| reinit() 1 6.6090 6.608989 0.20 | >| | >| FE | >| compute_face_map() 4609 0.0933 0.000020 0.00 | >| compute_map() 247308 2.6123 0.000011 0.08 | >| compute_shape_functions() 239108 1.7971 0.000008 0.06 | >| init_face_shape_functions() 4609 0.0779 0.000017 0.00 | >| init_shape_functions() 4611 0.2630 0.000057 0.01 | >| inverse_map() 18436 0.1951 0.000011 0.01 | >| | >| FrequencySystem | >| assemble() 1 48.8794 48.879438 1.51 | >| init() 1 28.1253 28.125289 0.87 | >| linear_equation_solve() 1 2091.2339 2091.233891 64.51 | >| user_pre_solve() 1 116.9776 116.977644 3.61 | >| | >| InfFE | >| combine_base_radial() 8200 1.6606 0.000203 0.05 | >| compute_shape_functions() 8200 1.4818 0.000181 0.05 | >| init_radial_shape_functions()1 0.0187 0.018681 0.00 | >| init_shape_functions() 1 0.0026 0.002573 0.00 | >| | >| Mesh | >| read() 1 470.1255 470.125486 14.50 | >| | >| MeshBase | >| add_elem() 109216 0.6331 0.000006 0.02 | >| add_point() 54610 0.3291 0.000006 0.01 | >| build_inf_elem() 1 14.3651 14.365079 0.44 | >| find_neighbors() 2 309.5482 154.774119 9.55 | >| renumber_nodes_and_elem() 1 0.7866 0.786616 0.02 | >| | >| MeshData | >| read() 1 2.5792 2.579232 0.08 | >| | >| MetisPartitioner | >| partition() 1 45.1431 45.143107 1.39 | >| | >| SystemBase | >| assemble() 1 0.0000 0.000006 0.00 | > ---------------------------------------------------------------------------- >| Totals: 8182518 3241.4833 100.00 | > ---------------------------------------------------------------------------- > >. > Startup time: 22:24:05 > Stop time : 23:22:57 >. > > |
From: Daniel D. <d.d...@tu...> - 2003-09-11 12:19:30
|
On Wed, 10 Sep 2003, Benjamin S. Kirk wrote: > [I posted this to the devel-list so we would have it in the archives] > > How are these machines connected? Is it 100 Mbit ethernet, or something > faster? 100MBit ethernet, with some eight++ Switches Allied Telesyn, already 3-4 years old. So: this network was __NOT__ designed for parallel computation. All in all, what do you think of PETSc performance? -- Currently, I don't know much where i could (and _should_) tweak... > > A 480 Mb mesh is pretty big by my standards. It would be interesting to > see how long it takes meshtool to simply _read_ that mesh, and see how > that compares to the time it took to read the mesh in that parallel > simulation. My guess is that all 15 processors were slamming a I already checked on that: libmesh/examples/ex1, with METHOD=opt, on one of the better workstations (suse linux 8.0, gcc 3.2.3): n_nodes n_elem timings [sec] size on HD [MB] read write find_neighbors() UNV XDR XDR UNV XDR 28k 28k 4.5 .28 .24 .6 7 1.5 220k 225k 37. 2.1 1.8 6.4 57 12.2 627k 4*10^6 331. 16.9 13.1 171. 481 79 > fileserver at the same time, requiring 15*480 = 7.2 Gb of data to be > transfered. No kiddin', i copied _all_ the data to local harddrive. The _poor_ performance is because the _old_ UNV importer was doing read/write/read/write all the time: // read in the UNV file, "buffer" in /tmp" while !eof.source_file { read 1 to 2 lines ASCII (i.e. 1 node/ 1 element) some string manipulation write 1 to 2 lines ASCII to /tmp/xyz } // read the converted file from /tmp" while !eof.tmp_file { read 1 to 2 lines ASCII from /tmp/xyz Elem::build or Node::build } Yes, I can almost see how you want to rip my head off... This is _badly_ coded. -- But i already changed it (the commits as of tuesday or so). The size of 500MB for mesh is because: - UNV is ASCII (the xdr file with _identical_ information contained is 79MB), - over 4 * 10^6 Tet4, each having two lines. This is not a typical mesh, i simply used meshtool refine twice on it to crank up the number of elements ;-) > Really, this poor performance is due to a lack of imagination on my > part. I was thinking of something like 50Mb as a big mesh. I will > re-work the read() method so that processor 0 is the only one that opens > and reads the file. It will then broadcast the data to all the other > processors. This should be _much_ faster than the current > implementation. If you could, though, see how long it takes meshtool to > simply read the mesh. Please see above table. As already said, all 15 procs were reading from the local disk, so this is no server congestion problem. If you look in detail in the MeshUnvInterface, you will see that there is a std::map for the foreign node ids. Perhaps this is also slowing down. But we really need it, you cannot be sure of node numbering in UNV. -- However, I am not really concerned with this performance of reading a mesh. Processing 480MB is quite a thing, and this is _definitely_ not a typical mesh: Tet4, where Tet10 would have been by far better. And also, the UNV format is not designed to be streamlined for efficient I/O, it is more of a "Product Lifecycle Management" or so file format. In case _anybody_ wants performance, use the _extremely_ efficient libMesh format XDR (see again above read/write statistics) > > Another option would be to remove all the non-local elements on each > processor. This could be a viable option for reducing memory overhead > when AMR is not required. Once the mesh is parallelized it won't be > such an issue. Actually, I have been thinking about this, too. This method should _only_ be available with disabled AMR. See my comments on this below. > > ...which brings me to your question: What are the future plans? > Currently coding a simulator for infinite elements in Time Domain, doing some performance comparisons (_small_ mesh) on HP-UX, and trying to finish my thesis within Sept/Oct??? -- Because, i will start working in November, will let you know... > 1.) Obviously, parallelizing the mesh is the biggest thing to allow > scalability to _many_ (i.e. hundreds) of processors. This will require > a fair amount of work, and I'm putting it off for now. So, you don't want to go for that? -- I can definitely understand, it is _hard_. Only trying to get find_neighbors() working correctly is difficult, you would either need som ghost elements, or communicating all the time... If you don't want to parallelize the AMR mesh soon, then it would be really viable to do one of the three ideas: 1. proc 0 (or all) read the mesh, distributes it. All procs > 0 _only_ have their local elements, while proc 0 has _all_ and is not taking part in the actual matrix solution. --- Straightforward to implement. 2. proc 0 (or all) reads the mesh, distributes it. All procs have their local elements, even proc 0 has only its local share. -- Would require more work than #1., but would make more sense. 3. All procs reads the mesh, partition it. Then all procs would write the non-local elements to a tmp file on their local hard drive, so in case they need the information of non_local elements, they simply read it from file. --- Would only make sense on distributed machines. > > 2.) Other than that, I think many major features are already > implemented. There are some small improvements that can be made, like > moving new points to a user-supplied geometry during AMR, robust Good point. tetgen would also go hand-in-hand with this feature. Actually, with employment coming along, i will have to admit that i probably won't find much time to include tetgen... Sorry. > smoothers, more shape functions, etc... A lot of good stuff is in there Steffen has quite _some_ shape functions in the pipe, i expect them to be ready within 1 to 1.5 months. :-) > now. Maybe some of it could be optimized. > > 3.) Increase user base. I think that the library is pretty stable now. > More users will help make it better and they will request more > features. We should improve the web page to include links to > presentations & stuff that use the library. You know, "eye candy." > Stuff to get people excited by saying "yeah, this _can_ be used for big > stuff." Perfect agreement. The web page is nice so far, but there should be links to already-computed things, and also the examples section... > > 4.) Whatever others want (within reason) :-) Here some of the aspects that I consider relevant: - libmesh-users e-mail list ;-) - re-working the FE::reinit() to have also a working Jacobian for dim-1 elements. This would enable to compute plate bending or so in 3D. I though one of my colleagues will work on that, but he will still have other things to do, for quite some time... - enable the solution of eigenproblems. I don't know what happened to the SLEPc effort, but for now two additional classes would be sufficient: EigenSolverInterface<T> : public ReferenceCountedObject<EigensolverInterface<T>> { }; EigenSystem : public SystemBase { }; For solving real-valued eigenproblems, I think i also found accidentally a quite interesting library (when trying to see what Python is) for the rather new Jacobi-Davidson (JD) method: http://people.web.psi.ch/geus/pyfemax/pysparse.html However, it has only a Python interface, but the results they compute look _good_. As far as i know, JD _is_ fast and good. So if we somehow get a working interface to this solver, we would be _well_ off w.r.t. EigenSystems. Perhaps we could also distribute this as the default eigensolver, just like Laspack is included. Eigenproblems would also attract quite some more researchers. ;-) - including tetgen ;-) Ok, good to see you back on development! Don't get me wrong, but it was getting boring to just commit "all the time", and no one there complaining about the weird stuff i did ;-) Daniel |
From: Steffen P. <ste...@tu...> - 2003-09-13 10:40:12
|
> [I posted this to the devel-list so we would have it in the archives] > > How are these machines connected? Is it 100 Mbit ethernet, or > something faster? > > A 480 Mb mesh is pretty big by my standards. It would be interesting > to see how long it takes meshtool to simply _read_ that mesh, and see > how that compares to the time it took to read the mesh in that > parallel simulation. My guess is that all 15 processors were slamming > a fileserver at the same time, requiring 15*480 = 7.2 Gb of data to be > transfered. > > Really, this poor performance is due to a lack of imagination on my > part. I was thinking of something like 50Mb as a big mesh. I will > re-work the read() method so that processor 0 is the only one that > opens and reads the file. It will then broadcast the data to all the > other processors. This should be _much_ faster than the current > implementation. If you could, though, see how long it takes meshtool > to simply read the mesh. > > Another option would be to remove all the non-local elements on each > processor. This could be a viable option for reducing memory overhead > when AMR is not required. Once the mesh is parallelized it won't be > such an issue. > > ...which brings me to your question: What are the future plans? > > 1.) Obviously, parallelizing the mesh is the biggest thing to allow > scalability to _many_ (i.e. hundreds) of processors. This will > require a fair amount of work, and I'm putting it off for now. > > 2.) Other than that, I think many major features are already > implemented. There are some small improvements that can be made, like > moving new points to a user-supplied geometry during AMR, robust > smoothers, more shape functions, etc... A lot of good stuff is in > there now. Maybe some of it could be optimized. > > 3.) Increase user base. I think that the library is pretty stable > now. More users will help make it better and they will request more > features. We should improve the web page to include links to > presentations & stuff that use the library. You know, "eye candy." > Stuff to get people excited by saying "yeah, this _can_ be > used for big > stuff." I like the documentation, but I think the main page could take a little upgrading. Some pictures and presentations sounds good and perhaps we could add some brief documetations on the available examples and applications, too. > 4.) Whatever others want (within reason) :-) Some things I would like to add in the near future (Daniel has already mentioned). Currently we are working on the Bernstein polynomials (not only Bernstein, we'll also do some other things, but so far Berstein seems most promising). When we're done with the new shapes (hopefully in a few weeks), I would like to add the slepc interface. If that is fine by you, I would then introduce something like an EigenSolver class and derive the slepc interface from that class. I would probably shift some things from SystemBase to SteadySystem, not to have unnecessary stuff in the EigenSystem. I'm sure the eigensolver is quite interesting for some (new) users. Once the slepc interface is implemented, other eigensolvers (focusing on effective solution of quadratic eigen problems) could possibly follow (e.g. stuff Daniel has mentioned). Again, I'm sure this will be interesting for quite some users. There are several more things I would like to do, e.g. some improvements and extensions regarding the ifems and perhaps some sort of optimization algorithm (at this, Tetgen or any other mesh generator could be used for shape optimization problems), but I'm not quite sure when I'll find time to do that. Steffen |