From: Yujie <rec...@gm...> - 2010-01-27 15:19:32
|
I also run the same codes (the same version of Libmesh, the same codes I wrote, the same number of CPUs, and the same version of PETSc) on another Cluster. The environment is Intel 32bit, RedHat Enterprise, GCC3.2 and MPICH127P1. The following is the cost time. The total cost time is about 88secs (3895secs in the previous case). You can find "find_global_indices()" spent little time. I don't know where the problem is. Could you give me some help? thanks a lot. Regards, Yujie ------------------------------------------------------------------------------------------------------------- | libMesh Performance: Alive time=94.0735, Active time=88.2455 | ------------------------------------------------------------------------------------------------------------- | Event nCalls Total Time Avg Time Total Time Avg Time % of Active Time | | w/o Sub w/o Sub With Sub With Sub w/o S With S | |-------------------------------------------------------------------------------------------------------------| | | | | | DofMap | | add_neighbors_to_send_list() 3 0.2863 0.095427 0.3427 0.114217 0.32 0.39 | | build_constraint_matrix() 33576 0.3030 0.000009 0.3030 0.000009 0.34 0.34 | | cnstrn_elem_mat_vec() 33576 0.2127 0.000006 0.2127 0.000006 0.24 0.24 | | compute_sparsity() 3 2.8864 0.962134 3.7640 1.254653 3.27 4.27 | | create_dof_constraints() 3 0.4083 0.136117 0.8024 0.267480 0.46 0.91 | | distribute_dofs() 3 0.6631 0.221018 1.4217 0.473890 0.75 1.61 | | dof_indices() 426112 3.7301 0.000009 3.7301 0.000009 4.23 4.23 | | enforce_constraints_exactly() 2 0.0124 0.006191 0.0124 0.006191 0.01 0.01 | | old_dof_indices() 67152 0.5640 0.000008 0.5640 0.000008 0.64 0.64 | | prepare_send_list() 3 0.0102 0.003389 0.0102 0.003389 0.01 0.01 | | reinit() 3 0.7196 0.239869 0.7196 0.239869 0.82 0.82 | | | | FE | | compute_affine_map() 162423 6.5415 0.000040 6.5415 0.000040 7.41 7.41 | | compute_face_map() 64735 3.1745 0.000049 3.1745 0.000049 3.60 3.60 | | compute_shape_functions() 162423 7.4944 0.000046 7.4944 0.000046 8.49 8.49 | | init_face_shape_functions() 54151 1.2186 0.000023 1.2186 0.000023 1.38 1.38 | | init_shape_functions() 115651 6.6277 0.000057 6.6277 0.000057 7.51 7.51 | | inverse_map() 521467 7.4752 0.000014 7.4752 0.000014 8.47 8.47 | | | | GMVIO | | write_nodal_data() 1 0.5900 0.590033 0.5900 0.590033 0.67 0.67 | | | | JumpErrorEstimator | | estimate_error() 2 7.4065 3.703241 30.4493 15.224661 8.39 34.51 | | | | LocationMap | | find() 50456 0.2714 0.000005 0.2714 0.000005 0.31 0.31 | | init() 4 0.0948 0.023701 0.0948 0.023701 0.11 0.11 | | | | Mesh | | contract() 2 0.1155 0.057756 0.1519 0.075933 0.13 0.17 | | find_neighbors() 3 4.9775 1.659166 4.9798 1.659928 5.64 5.64 | | read() 1 3.4427 3.442706 3.4427 3.442706 3.90 3.90 | | renumber_nodes_and_elem() 8 0.1274 0.015928 0.1274 0.015928 0.14 0.14 | | | | MeshCommunication | | broadcast_bcs() 1 0.0030 0.002954 0.0092 0.009157 0.00 0.01 | | broadcast_mesh() 1 0.1153 0.115250 0.1212 0.121190 0.13 0.14 | | compute_hilbert_indices() 4 1.1428 0.285697 1.1428 0.285697 1.30 1.30 | | find_global_indices() 4 0.3963 0.099081 1.7243 0.431078 0.45 1.95 | | parallel_sort() 4 0.1184 0.029595 0.1558 0.038955 0.13 0.18 | | | | MeshRefinement | | _coarsen_elements() 4 0.0447 0.011166 0.0481 0.012035 0.05 0.05 | | _refine_elements() 4 0.5717 0.142932 1.3206 0.330145 0.65 1.50 | | add_point() 50456 0.3634 0.000007 0.6900 0.000014 0.41 0.78 | | make_coarsening_compatible() 11 0.7757 0.070514 0.7757 0.070514 0.88 0.88 | | make_refinement_compatible() 11 0.1117 0.010153 0.1164 0.010585 0.13 0.13 | | | | MetisPartitioner | | partition() 3 1.9354 0.645147 3.3210 1.106993 2.19 3.76 | | | | Parallel | | allgather() 16 0.0124 0.000772 0.0124 0.000772 0.01 0.01 | | broadcast() 13 0.0119 0.000912 0.0119 0.000912 0.01 0.01 | | gather() 3 0.0001 0.000044 0.0001 0.000044 0.00 0.00 | | max() 267 0.0468 0.000175 0.0468 0.000175 0.05 0.05 | | min() 467 0.5519 0.001182 0.5519 0.001182 0.63 0.63 | | probe() 26 0.0155 0.000595 0.0155 0.000595 0.02 0.02 | | receive() 26 0.0095 0.000365 0.0250 0.000962 0.01 0.03 | | send() 26 0.0052 0.000201 0.0052 0.000201 0.01 0.01 | | send_receive() 34 0.0041 0.000121 0.0350 0.001030 0.00 0.04 | | sum() 20 0.0889 0.004443 0.0889 0.004443 0.10 0.10 | | wait() 26 0.0005 0.000021 0.0005 0.000021 0.00 0.00 | | | | Partitioner | | set_node_processor_ids() 3 0.5467 0.182219 0.5731 0.191044 0.62 0.65 | | set_parent_processor_ids() 3 0.0542 0.018076 0.0542 0.018076 0.06 0.06 | | | | PetscLinearSolver | | solve() 3 8.0743 2.691427 8.0764 2.692117 9.15 9.15 | | | | ProjectVector | | operator() 2 0.5417 0.270829 1.2560 0.627999 0.61 1.42 | | | | System | | assemble() 3 13.0582 4.352722 26.0303 8.676766 14.80 29.50 | | project_vector() 2 0.2916 0.145808 1.8486 0.924310 0.33 2.09 | ------------------------------------------------------------------------------------------------------------- | Totals: 1743206 88.2455 100.00 | ------------------------------------------------------------------------------------------------------------- On Tue, Jan 26, 2010 at 2:22 PM, Yujie <rec...@gm...> wrote: > Dear Libmesh developers, > > In previous emails, I met a problem likely about data communication between > nodes. However, when I run the codes on Master node with 2 CPUs. That means > that there is not data communication between nodes. The problem is always > there. the following is the table of cost time. You can find > "find_global_indices()" took a very long time. > > I am using AMD x86_64, Redhat Enterprise, GCC4.0 and MPICH127p1. Could you > give me some advice? Thanks a lot. > > ------------------------------------------------------------------------------------------------------------- > | libMesh Performance: Alive time=3921.43, Active > time=3895.64 | > > ------------------------------------------------------------------------------------------------------------- > | Event nCalls Total Time Avg Time Total > Time Avg Time % of Active Time | > | w/o Sub w/o Sub With > Sub With Sub w/o S With S | > > |-------------------------------------------------------------------------------------------------------------| > | > | > | > | > | > DofMap > | > | add_neighbors_to_send_list() 3 1.8758 0.625277 > 2.0094 0.669790 0.05 0.05 | > | build_constraint_matrix() 36160 1.1394 0.000032 > 1.1394 0.000032 0.03 0.03 | > | cnstrn_elem_mat_vec() 36160 1.0369 0.000029 > 1.0369 0.000029 0.03 0.03 | > | compute_sparsity() 3 66.9598 22.319928 > 69.6673 23.222438 1.72 1.79 | > | create_dof_constraints() 3 1.9375 0.645828 > 2.7731 0.924369 0.05 0.07 | > | distribute_dofs() 3 5.9749 1.991641 > 11.1858 3.728586 0.15 0.29 | > | dof_indices() 451121 10.4046 0.000023 > 10.4046 0.000023 0.27 0.27 | > | enforce_constraints_exactly() 2 0.0860 0.043013 > 0.0860 0.043013 0.00 0.00 | > | old_dof_indices() 72320 1.6502 0.000023 > 1.6502 0.000023 0.04 0.04 | > | prepare_send_list() 3 1.3888 0.462939 > 1.3888 0.462939 0.04 0.04 | > | reinit() 3 4.4227 1.474239 > 4.4227 1.474239 0.11 0.11 | > | > | > | > FE > | > | compute_affine_map() 166087 28.4779 0.000171 > 28.4779 0.000171 0.73 0.73 | > | compute_face_map() 65290 12.9293 0.000198 > 12.9293 0.000198 0.33 0.33 | > | compute_shape_functions() 166087 53.0771 0.000320 > 53.0771 0.000320 1.36 1.36 | > | init_face_shape_functions() 54525 6.8743 0.000126 > 6.8743 0.000126 0.18 0.18 | > | init_shape_functions() 116731 41.2603 0.000353 > 41.2603 0.000353 1.06 1.06 | > | inverse_map() 528671 15.4390 0.000029 > 15.4390 0.000029 0.40 0.40 | > | > | > | > GMVIO > | > | write_nodal_data() 1 2.0390 2.038986 > 2.0390 2.038986 0.05 0.05 | > | > | > | > JumpErrorEstimator > | > | estimate_error() 2 20.5754 10.287681 > 126.0162 63.008090 0.53 3.23 | > | > | > | > LocationMap > | > | find() 69104 0.9536 0.000014 > 0.9536 0.000014 0.02 0.02 | > | init() 4 0.4922 0.123059 > 0.4922 0.123059 0.01 0.01 | > | > | > | > Mesh > | > | contract() 2 0.6653 0.332647 > 1.1356 0.567810 0.02 0.03 | > | find_neighbors() 3 30.7003 10.233423 > 30.8006 10.266880 0.79 0.79 | > | read() 1 5.7922 5.792197 > 5.7922 5.792197 0.15 0.15 | > | renumber_nodes_and_elem() 8 1.9422 0.242779 > 1.9422 0.242779 0.05 0.05 | > | > | > | > MeshCommunication > | > | broadcast_bcs() 1 0.0604 0.060440 > 0.0743 0.074266 0.00 0.00 | > | broadcast_mesh() 1 1.0069 1.006910 > 1.0271 1.027131 0.03 0.03 | > | compute_hilbert_indices() 4 4.1264 1.031604 > 4.1264 1.031604 0.11 0.11 | > | find_global_indices() 4 3172.3789 793.094713 > 3412.5255 853.131373 81.43 87.60 | > | parallel_sort() 4 158.8466 39.711649 > 161.0895 40.272380 4.08 4.14 | > | > | > | > MeshRefinement > | > | _coarsen_elements() 4 0.4822 0.120546 > 0.4828 0.120710 0.01 0.01 | > | _refine_elements() 4 2.7347 0.683675 > 5.5430 1.385758 0.07 0.14 | > | add_point() 69104 1.3920 0.000020 > 2.5913 0.000037 0.04 0.07 | > | make_coarsening_compatible() 12 7.9254 0.660450 > 7.9254 0.660450 0.20 0.20 | > | make_refinement_compatible() 12 1.3526 0.112718 > 1.3618 0.113480 0.03 0.03 | > | > | > | > MetisPartitioner > | > | partition() 3 9.6725 3.224183 > 2854.3422 951.447412 0.25 73.27 | > | > | > | > Parallel > | > | allgather() 16 0.4336 0.027100 > 0.4336 0.027100 0.01 0.01 | > | broadcast() 13 0.0327 0.002513 > 0.0327 0.002513 0.00 0.00 | > | gather() 3 0.0007 0.000229 > 0.0007 0.000229 0.00 0.00 | > | max() 275 0.5796 0.002108 > 0.5796 0.002108 0.01 0.01 | > | min() 482 38.8301 0.080560 > 38.8301 0.080560 1.00 1.00 | > | probe() 26 56.6142 2.177470 > 56.6142 2.177470 1.45 1.45 | > | receive() 26 0.0334 0.001284 > 56.6479 2.178767 0.00 1.45 | > | send() 26 18.5027 0.711642 > 18.5027 0.711642 0.47 0.47 | > | send_receive() 34 0.0077 0.000225 > 75.1605 2.210604 0.00 1.93 | > | sum() 20 2.8493 0.142467 > 2.8493 0.142467 0.07 0.07 | > | wait() 26 0.0016 0.000061 > 0.0016 0.000061 0.00 0.00 | > | > | > | > Partitioner > | > | set_node_processor_ids() 3 3.7781 1.259356 > 4.5113 1.503763 0.10 0.12 | > | set_parent_processor_ids() 3 0.5565 0.185501 > 0.5565 0.185501 0.01 0.01 | > | > | > | > PetscLinearSolver > | > | solve() 3 27.1758 9.058608 > 27.1827 9.060906 0.70 0.70 | > | > | > | > ProjectVector > | > | operator() 2 2.2219 1.110940 > 4.1955 2.097752 0.06 0.11 | > | > | > | > System > | > | assemble() 3 57.2052 19.068412 > 125.2392 41.746413 1.47 3.21 | > | project_vector() 2 8.7408 4.370384 > 13.9501 6.975064 0.22 0.36 | > > ------------------------------------------------------------------------------------------------------------- > | Totals: 1832413 > 3895.6373 100.00 | > > ------------------------------------------------------------------------------------------------------------- > Regards, > Yujie > |
From: Roy S. <roy...@ic...> - 2010-01-27 16:29:47
|
On Wed, 27 Jan 2010, Yujie wrote: > When I sent the following email to libmesh mail list. I met one > problem because of the size of the email. Could you give me some > advice regarding this problem? thanks a lot. It looks like it made it through eventually; just a little late. I'm not sure if you'll get an answer, though. Ben is the one responsible for find_global_indices, and he's swamped with other things right now. It does a parallel sort, which can be very sensitive to MPI implementation. It only gets used for I/O and the cost should scale more slowly than solves, though; for large implicit 2D/3D problems it shouldn't be an issue even on inefficient MPI implementations. --- Roy |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2010-01-27 16:43:05
|
>> When I sent the following email to libmesh mail list. I met one >> problem because of the size of the email. Could you give me some >> advice regarding this problem? thanks a lot. > > It looks like it made it through eventually; just a little late. I had to approve it based on size, and it was originally sent late US time so I didn't get to it until this morning. This is the second approval I've had to make in 24 hours, I'll see if there is > I'm not sure if you'll get an answer, though. Ben is the one > responsible for find_global_indices, and he's swamped with other > things right now. It does a parallel sort, which can be very > sensitive to MPI implementation. > It only gets used for I/O and the cost should scale more slowly than > solves, though; for large implicit 2D/3D problems it shouldn't be an > issue even on inefficient MPI implementations. Yes, this issue is bizarre indeed. The code does not even do that much communication there... You might want to compile with METHOD=pro and run it through gprof - that will give you finer grained granularity as to what the issue may actually be. Can you confirm that the problem doesn't exist on one processor? What are the details of the mesh you are using?? -Ben |
From: Roy S. <roy...@ic...> - 2010-01-27 16:49:06
|
On Wed, 27 Jan 2010, Kirk, Benjamin (JSC-EG311) wrote: >> It only gets used for I/O and the cost should scale more slowly than >> solves, though; for large implicit 2D/3D problems it shouldn't be an >> issue even on inefficient MPI implementations. > > Yes, this issue is bizarre indeed. The code does not even do that much > communication there... You might want to compile with METHOD=pro and run it > through gprof - that will give you finer grained granularity as to what the > issue may actually be. > > Can you confirm that the problem doesn't exist on one processor? What are > the details of the mesh you are using?? You know, if we want to try repeating this ourselves, I believe Paul saw a relatively long find_global_indices() execution time by simply ultra-refining (~500K elements) a 1D mesh. I'd assumed that the key word there was "relatively" (we're not at all optimized for 1D, but 1D is still pretty fast to assemble and solve), but perhaps that case triggered a real problem. --- Ro |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2010-01-27 17:00:33
|
>> Can you confirm that the problem doesn't exist on one processor? What are >> the details of the mesh you are using?? > > You know, if we want to try repeating this ourselves, I believe Paul > saw a relatively long find_global_indices() execution time by simply > ultra-refining (~500K elements) a 1D mesh. I'd assumed that the key > word there was "relatively" (we're not at all optimized for 1D, but 1D > is still pretty fast to assemble and solve), but perhaps that case > triggered a real problem. Was this on one processor? I think paul set a record for a 1D problem mesh density!! -Ben |
From: Paul T. B. <ptb...@gm...> - 2010-01-27 17:05:48
|
On Wed, Jan 27, 2010 at 11:00 AM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > >> Can you confirm that the problem doesn't exist on one processor? What > are > >> the details of the mesh you are using?? > > > > You know, if we want to try repeating this ourselves, I believe Paul > > saw a relatively long find_global_indices() execution time by simply > > ultra-refining (~500K elements) a 1D mesh. I'd assumed that the key > > word there was "relatively" (we're not at all optimized for 1D, but 1D > > is still pretty fast to assemble and solve), but perhaps that case > > triggered a real problem. > > Was this on one processor? 1 and 2 on my mac. I was surprised the runtime for 2 procs was longer than 1 (on some of my smaller meshes) so I was increasing the problem size to try and rule out communcation overhead. I haven't investigated further yet. If there's anything interesting to report, I'll open up a new thread. > I think paul set a record for a 1D problem mesh > density!! > It got to machine precision though. :D Paul |
From: Yujie <rec...@gm...> - 2010-01-27 16:50:10
|
Dear Ben, Thank you very much for your reply. I will recompile the codes with "METHOD=pro". What is "gprof"? To AMD X86_64-based cluster, actually, I run the codes in Master and one slave nodes with 2 CPUs. The same problem is there. Regards, Yujie On Wed, Jan 27, 2010 at 10:42 AM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > >> When I sent the following email to libmesh mail list. I met one > >> problem because of the size of the email. Could you give me some > >> advice regarding this problem? thanks a lot. > > > > It looks like it made it through eventually; just a little late. > > I had to approve it based on size, and it was originally sent late US time > so I didn't get to it until this morning. This is the second approval I've > had to make in 24 hours, I'll see if there is > > > I'm not sure if you'll get an answer, though. Ben is the one > > responsible for find_global_indices, and he's swamped with other > > things right now. It does a parallel sort, which can be very > > sensitive to MPI implementation. > > > It only gets used for I/O and the cost should scale more slowly than > > solves, though; for large implicit 2D/3D problems it shouldn't be an > > issue even on inefficient MPI implementations. > > Yes, this issue is bizarre indeed. The code does not even do that much > communication there... You might want to compile with METHOD=pro and run it > through gprof - that will give you finer grained granularity as to what the > issue may actually be. > > Can you confirm that the problem doesn't exist on one processor? What are > the details of the mesh you are using?? > > -Ben > > |
From: Roy S. <roy...@ic...> - 2010-01-27 16:59:33
|
On Wed, 27 Jan 2010, Yujie wrote: > Thank you very much for your reply. I will recompile the codes with > "METHOD=pro". What is "gprof"? A userspace profiling utility; you've probably got it installed already. If you have a cluster to yourself you might also check out oprofile - a bit more of a hassle to use but gives finer-grained results. --- Roy |
From: Yujie <rec...@gm...> - 2010-01-27 17:10:39
|
Thanks, Roy. Do I need to compile PETSc with the same parameter, that is "-pg"? I meet the link errors when I compile LibMesh. Regards, Yujie On Wed, Jan 27, 2010 at 10:59 AM, Roy Stogner <roy...@ic...>wrote: > > On Wed, 27 Jan 2010, Yujie wrote: > > Thank you very much for your reply. I will recompile the codes with >> "METHOD=pro". What is "gprof"? >> > > A userspace profiling utility; you've probably got it installed > already. > > If you have a cluster to yourself you might also check out oprofile - > a bit more of a hassle to use but gives finer-grained results. > --- > Roy > |
From: Roy S. <roy...@ic...> - 2010-01-27 17:15:52
|
On Wed, 27 Jan 2010, Yujie wrote: > Thanks, Roy. Do I need to compile PETSc with the same parameter, > that is "-pg"? I don't believe so. You just need -pg as a compiler parameter on the objects you want profiling data from and on the linker, and setting METHOD should have done that. > I meet the link errors when I compile LibMesh. What's the error? --- Roy |
From: Yujie <rec...@gm...> - 2010-01-27 20:33:05
|
Dear Ben and Roy, I got the different cost time using "METHOD=pro" and "METHOD=dbg". You can find the details from the following tables. In "dbg", the problem is always there. However, in "pro", the problem disapears. Any advice? In this case, I run the codes for both in slave node. Thanks a lot. in METHOD=pro: ------------------------------------------------------------------------------------------------------------- | libMesh Performance: Alive time=22.2695, Active time=14.2938 | ------------------------------------------------------------------------------------------------------------- | Event nCalls Total Time Avg Time Total Time Avg Time % of Active Time | | w/o Sub w/o Sub With Sub With Sub w/o S With S | |-------------------------------------------------------------------------------------------------------------| | | | | | DofMap | | add_neighbors_to_send_list() 3 0.0997 0.033244 0.1065 0.035511 0.70 0.75 | | build_constraint_matrix() 33576 0.0186 0.000001 0.0186 0.000001 0.13 0.13 | | cnstrn_elem_mat_vec() 33576 0.0101 0.000000 0.0101 0.000000 0.07 0.07 | | compute_sparsity() 3 0.2933 0.097754 0.4115 0.137154 2.05 2.88 | | create_dof_constraints() 3 0.1100 0.036674 0.1637 0.054568 0.77 1.15 | | distribute_dofs() 3 0.1971 0.065701 0.6517 0.217241 1.38 4.56 | | dof_indices() 424766 0.4175 0.000001 0.4175 0.000001 2.92 2.92 | | enforce_constraints_exactly() 2 0.0055 0.002742 0.0055 0.002742 0.04 0.04 | | old_dof_indices() 67152 0.0644 0.000001 0.0644 0.000001 0.45 0.45 | | prepare_send_list() 3 0.0015 0.000506 0.0015 0.000506 0.01 0.01 | | reinit() 3 0.3766 0.125545 0.3766 0.125545 2.63 2.63 | | | | FE | | compute_affine_map() 161783 0.3852 0.000002 0.3852 0.000002 2.70 2.70 | | compute_face_map() 64674 0.1622 0.000003 0.1622 0.000003 1.13 1.13 | | compute_shape_functions() 161783 0.1218 0.000001 0.1218 0.000001 0.85 0.85 | | init_face_shape_functions() 54129 0.2135 0.000004 0.2135 0.000004 1.49 1.49 | | init_shape_functions() 115011 0.9611 0.000008 0.9611 0.000008 6.72 6.72 | | inverse_map() 519958 1.4425 0.000003 1.4425 0.000003 10.09 10.09 | | | | GMVIO | | write_nodal_data() 1 0.1555 0.155485 0.1555 0.155485 1.09 1.09 | | | | JumpErrorEstimator | | estimate_error() 2 1.0333 0.516627 3.9642 1.982106 7.23 27.73 | | | | LocationMap | | find() 50456 0.0286 0.000001 0.0286 0.000001 0.20 0.20 | | init() 4 0.0226 0.005662 0.0226 0.005662 0.16 0.16 | | | | Mesh | | contract() 2 0.0185 0.009264 0.0462 0.023103 0.13 0.32 | | find_neighbors() 3 0.5844 0.194807 0.6276 0.209214 4.09 4.39 | | read() 1 0.2718 0.271756 0.2718 0.271756 1.90 1.90 | | renumber_nodes_and_elem() 8 0.1015 0.012692 0.1015 0.012692 0.71 0.71 | | | | MeshCommunication | | broadcast_bcs() 1 0.0012 0.001206 0.0330 0.033009 0.01 0.23 | | broadcast_mesh() 1 0.0422 0.042237 0.0451 0.045126 0.30 0.32 | | compute_hilbert_indices() 4 1.6419 0.410467 1.6419 0.410467 11.49 11.49 | | find_global_indices() 4 0.1052 0.026299 1.7979 0.449477 0.74 12.58 | | parallel_sort() 4 0.0153 0.003821 0.0470 0.011757 0.11 0.33 | | | | MeshRefinement | | _coarsen_elements() 4 0.0276 0.006894 0.0278 0.006938 0.19 0.19 | | _refine_elements() 4 0.1460 0.036506 0.2486 0.062145 1.02 1.74 | | add_point() 50456 0.0483 0.000001 0.0890 0.000002 0.34 0.62 | | make_coarsening_compatible() 5 0.1862 0.037243 0.1862 0.037243 1.30 1.30 | | make_refinement_compatible() 5 0.0291 0.005817 0.0309 0.006181 0.20 0.22 | | | | MetisPartitioner | | partition() 3 0.3619 0.120617 1.7711 0.590353 2.53 12.39 | | | | Parallel | | allgather() 16 0.0735 0.004591 0.0735 0.004591 0.51 0.51 | | broadcast() 13 0.0346 0.002663 0.0346 0.002663 0.24 0.24 | | gather() 3 0.0001 0.000029 0.0001 0.000029 0.00 0.00 | | max() 30 0.0736 0.002454 0.0736 0.002454 0.52 0.52 | | min() 16 0.0107 0.000668 0.0107 0.000668 0.07 0.07 | | probe() 26 0.0213 0.000818 0.0213 0.000818 0.15 0.15 | | receive() 26 0.0033 0.000128 0.0246 0.000947 0.02 0.17 | | send() 26 0.0035 0.000136 0.0035 0.000136 0.02 0.02 | | send_receive() 34 0.0004 0.000012 0.0286 0.000842 0.00 0.20 | | sum() 20 0.1321 0.006607 0.1321 0.006607 0.92 0.92 | | wait() 26 0.0000 0.000001 0.0000 0.000001 0.00 0.00 | | | | Partitioner | | set_node_processor_ids() 3 0.1087 0.036232 0.1282 0.042741 0.76 0.90 | | set_parent_processor_ids() 3 0.0313 0.010417 0.0313 0.010417 0.22 0.22 | | | | PetscLinearSolver | | solve() 3 3.2520 1.083999 3.2520 1.083999 22.75 22.75 | | | | ProjectVector | | operator() 2 0.0847 0.042333 0.1667 0.083344 0.59 1.17 | | | | System | | assemble() 3 0.6752 0.225067 1.6206 0.540184 4.72 11.34 | | project_vector() 2 0.0870 0.043506 0.3003 0.150132 0.61 2.10 | ------------------------------------------------------------------------------------------------------------- | Totals: 1737648 14.2938 100.00 | ------------------------------------------------------------------------------------------------------------- in METHOD=dbg ------------------------------------------------------------------------------------------------------------- | libMesh Performance: Alive time=970.489, Active time=958.407 | ------------------------------------------------------------------------------------------------------------- | Event nCalls Total Time Avg Time Total Time Avg Time % of Active Time | | w/o Sub w/o Sub With Sub With Sub w/o S With S | |-------------------------------------------------------------------------------------------------------------| | | | | | DofMap | | add_neighbors_to_send_list() 3 0.4857 0.161916 0.5219 0.173970 0.05 0.05 | | build_constraint_matrix() 33576 0.1788 0.000005 0.1788 0.000005 0.02 0.02 | | cnstrn_elem_mat_vec() 33576 0.1048 0.000003 0.1048 0.000003 0.01 0.01 | | compute_sparsity() 3 16.2741 5.424711 16.9691 5.656377 1.70 1.77 | | create_dof_constraints() 3 0.4682 0.156067 0.6576 0.219187 0.05 0.07 | | distribute_dofs() 3 1.5506 0.516868 3.0141 1.004706 0.16 0.31 | | dof_indices() 424766 2.6192 0.000006 2.6192 0.000006 0.27 0.27 | | enforce_constraints_exactly() 2 0.0223 0.011134 0.0223 0.011134 0.00 0.00 | | old_dof_indices() 67152 0.4018 0.000006 0.4018 0.000006 0.04 0.04 | | prepare_send_list() 3 0.4080 0.135996 0.4080 0.135996 0.04 0.04 | | reinit() 3 1.1455 0.381827 1.1455 0.381827 0.12 0.12 | | | | FE | | compute_affine_map() 161783 7.5528 0.000047 7.5528 0.000047 0.79 0.79 | | compute_face_map() 64674 3.3548 0.000052 3.3548 0.000052 0.35 0.35 | | compute_shape_functions() 161783 14.1906 0.000088 14.1906 0.000088 1.48 1.48 | | init_face_shape_functions() 54129 1.8255 0.000034 1.8255 0.000034 0.19 0.19 | | init_shape_functions() 115011 11.1598 0.000097 11.1598 0.000097 1.16 1.16 | | inverse_map() 519958 3.8413 0.000007 3.8413 0.000007 0.40 0.40 | | | | GMVIO | | write_nodal_data() 1 0.5128 0.512829 0.5128 0.512829 0.05 0.05 | | | | JumpErrorEstimator | | estimate_error() 2 5.3186 2.659298 33.5234 16.761681 0.55 3.50 | | | | LocationMap | | find() 50456 0.1874 0.000004 0.1874 0.000004 0.02 0.02 | | init() 4 0.1253 0.031314 0.1253 0.031314 0.01 0.01 | | | | Mesh | | contract() 2 0.1673 0.083652 0.2817 0.140854 0.02 0.03 | | find_neighbors() 3 7.1693 2.389767 7.6922 2.564061 0.75 0.80 | | read() 1 1.5032 1.503193 1.5032 1.503193 0.16 0.16 | | renumber_nodes_and_elem() 8 0.4307 0.053843 0.4307 0.053843 0.04 0.04 | | | | MeshCommunication | | broadcast_bcs() 1 0.0165 0.016476 0.0202 0.020218 0.00 0.00 | | broadcast_mesh() 1 0.2634 0.263384 0.2666 0.266577 0.03 0.03 | | compute_hilbert_indices() 4 0.9906 0.247642 0.9906 0.247642 0.10 0.10 | | find_global_indices() 4 746.7912 186.697788 837.7610 209.440249 77.92 87.41 | | parallel_sort() 4 44.3904 11.097589 45.6040 11.400992 4.63 4.76 | | | | MeshRefinement | | _coarsen_elements() 4 0.1212 0.030298 0.1414 0.035350 0.01 0.01 | | _refine_elements() 4 0.5802 0.145040 1.1861 0.296525 0.06 0.12 | | add_point() 50456 0.2761 0.000005 0.5098 0.000010 0.03 0.05 | | make_coarsening_compatible() 11 1.9512 0.177386 1.9512 0.177386 0.20 0.20 | | make_refinement_compatible() 11 0.3092 0.028110 0.3611 0.032829 0.03 0.04 | | | | MetisPartitioner | | partition() 3 2.4345 0.811508 693.1930 231.064329 0.25 72.33 | | | | Parallel | | allgather() 16 0.0325 0.002033 0.0325 0.002033 0.00 0.00 | | broadcast() 13 0.0066 0.000511 0.0066 0.000511 0.00 0.00 | | gather() 3 0.0001 0.000037 0.0001 0.000037 0.00 0.00 | | max() 267 0.3244 0.001215 0.3244 0.001215 0.03 0.03 | | min() 467 10.4474 0.022371 10.4474 0.022371 1.09 1.09 | | probe() 26 29.8630 1.148579 29.8630 1.148579 3.12 3.12 | | receive() 26 0.0065 0.000250 29.8696 1.148832 0.00 3.12 | | send() 26 14.6244 0.562477 14.6244 0.562477 1.53 1.53 | | send_receive() 34 0.0025 0.000073 44.4968 1.308729 0.00 4.64 | | sum() 20 1.2742 0.063712 1.2742 0.063712 0.13 0.13 | | wait() 26 0.0001 0.000004 0.0001 0.000004 0.00 0.00 | | | | Partitioner | | set_node_processor_ids() 3 0.9364 0.312139 1.1704 0.390143 0.10 0.12 | | set_parent_processor_ids() 3 0.1398 0.046587 0.1398 0.046587 0.01 0.01 | | | | PetscLinearSolver | | solve() 3 3.9902 1.330075 3.9911 1.330380 0.42 0.42 | | | | ProjectVector | | operator() 2 0.5253 0.262668 0.9876 0.493799 0.05 0.10 | | | | System | | assemble() 3 15.0199 5.006632 33.1146 11.038210 1.57 3.46 | | project_vector() 2 2.0911 1.045574 3.3312 1.665623 0.22 0.35 | ------------------------------------------------------------------------------------------------------------- | Totals: 1738348 958.4074 100.00 | ------------------------------------------------------------------------------------------------------------- Regards, Yujie On Wed, Jan 27, 2010 at 10:42 AM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > >> When I sent the following email to libmesh mail list. I met one > >> problem because of the size of the email. Could you give me some > >> advice regarding this problem? thanks a lot. > > > > It looks like it made it through eventually; just a little late. > > I had to approve it based on size, and it was originally sent late US time > so I didn't get to it until this morning. This is the second approval I've > had to make in 24 hours, I'll see if there is > > > I'm not sure if you'll get an answer, though. Ben is the one > > responsible for find_global_indices, and he's swamped with other > > things right now. It does a parallel sort, which can be very > > sensitive to MPI implementation. > > > It only gets used for I/O and the cost should scale more slowly than > > solves, though; for large implicit 2D/3D problems it shouldn't be an > > issue even on inefficient MPI implementations. > > Yes, this issue is bizarre indeed. The code does not even do that much > communication there... You might want to compile with METHOD=pro and run it > through gprof - that will give you finer grained granularity as to what the > issue may actually be. > > Can you confirm that the problem doesn't exist on one processor? What are > the details of the mesh you are using?? > > -Ben > > |
From: Roy S. <roy...@ic...> - 2010-01-27 20:40:44
|
On Wed, 27 Jan 2010, Yujie wrote: > In "dbg", the problem is always there. I hadn't noticed before that you were running in debug mode. Bad performance is inherent to METHOD=dbg; we turn off optimization, we test every assertion, we sometimes even add extra "double-checking" libMesh code, GNU libstdc++ turns every std::vector index into a checked index... worst of all, the GNU libstdc++ debug mode checks on std::set actually have asymptotically greater cost than the operations they're checking! Performance with METHOD=dbg will always be lousy. The solution is, when you need performance, compile with METHOD=opt. --- Roy |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2010-01-27 20:42:20
|
> I got the different cost time using "METHOD=pro" and "METHOD=dbg". You can > find the details from the following tables. In "dbg", the problem is always > there. However, in "pro", the problem disapears. Any advice? In this case, I > run the codes for both in slave node. Thanks a lot. Whoops, probably should have asked this long ago - you were always running with METHOD=dbg? What about METHOD=opt?? It very well could be that some pedantic error checking in that method are causing the problem. G++ with pedantic internal C++-library error checking can turn order log(N) operations to linear(N), and I've run into this before.... -Ben |
From: Yujie <rec...@gm...> - 2010-01-28 14:58:00
|
Sorry for reply late. However, why is it ok for Intel 32bits Debug mode? There is not problem on 32bits-based cluster. Thanks a lot. Regards, Yujie On Wed, Jan 27, 2010 at 2:42 PM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > > I got the different cost time using "METHOD=pro" and "METHOD=dbg". You > can > > find the details from the following tables. In "dbg", the problem is > always > > there. However, in "pro", the problem disapears. Any advice? In this > case, I > > run the codes for both in slave node. Thanks a lot. > > Whoops, probably should have asked this long ago - you were always running > with METHOD=dbg? What about METHOD=opt?? > > It very well could be that some pedantic error checking in that method are > causing the problem. G++ with pedantic internal C++-library error checking > can turn order log(N) operations to linear(N), and I've run into this > before.... > > -Ben > > |