Thread: [Libmesh-users] small doc and efficiency update for ex 3 and 4

Brought to you by: benkirk, jwpeterson, roystgnr

libmesh-users

[Libmesh-users] small doc and efficiency update for ex 3 and 4

From: edgar <edg...@cr...> - 2021-06-10 02:11:20

Hi,

I am humbly sharing something which I think would improve the 
documentation and the logic of examples 3 and 4 a bit. I think that this 
would apply to other examples as well. (I was planning to keep learning 
from the examples, and have a more substantial contribution at the end, 
but it has been like a month since I last touched libMesh, and I it 
seems that I am going to be very busy in the next couple of months).

Thanks.

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

From: John P. <jwp...@gm...> - 2021-06-10 16:01:20

On Wed, Jun 9, 2021 at 9:11 PM edgar <edg...@cr...> wrote:

> Hi,
>
> I am humbly sharing something which I think would improve the
> documentation and the logic of examples 3 and 4 a bit. I think that this
> would apply to other examples as well. (I was planning to keep learning
> from the examples, and have a more substantial contribution at the end,
> but it has been like a month since I last touched libMesh, and I it
> seems that I am going to be very busy in the next couple of months).
>
> Thanks.

Hi Edgar,

I agree with the updates to the code comments in both files, so thanks for
those. In the ex4 diff, it looks like you move the Ke, Fe declarations from
outside the for-loop over elements to inside? This is not likely to be an
optimization, though, since creating that storage once and "resizing" it
many times in the loop avoids dynamic memory allocations... the resizes are
no-ops if the same Elem type is used at each iteration of the for-loop. If
you have some performance profiling for this example that suggests
otherwise, I'd be happy to take a look.

-- 
John

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

From: edgar <edg...@cr...> - 2021-06-10 17:05:27

On 2021-06-10 16:00, John Peterson wrote:
> On Wed, Jun 9, 2021 at 9:11 PM edgar <edg...@cr...> wrote:
> 
>> Hi,
>> 
>> I am humbly sharing something which I think would improve the
>> documentation and the logic of examples 3 and 4 a bit. I think that 
>> this
>> would apply to other examples as well. (I was planning to keep 
>> learning
>> from the examples, and have a more substantial contribution at the 
>> end,
>> but it has been like a month since I last touched libMesh, and I it
>> seems that I am going to be very busy in the next couple of months).
>> 
>> Thanks.
> 
> 
> Hi Edgar,
> 
> I agree with the updates to the code comments in both files, so thanks 
> for
> those. In the ex4 diff, it looks like you move the Ke, Fe declarations 
> from
> outside the for-loop over elements to inside? This is not likely to be 
> an
> optimization, though, since creating that storage once and "resizing" 
> it
> many times in the loop avoids dynamic memory allocations... the resizes 
> are
> no-ops if the same Elem type is used at each iteration of the for-loop. 
> If
> you have some performance profiling for this example that suggests
> otherwise, I'd be happy to take a look.

In all honesty, John, I did run a performance log on them, and the 
modification was faster, but I don't have it anymore. As I implied, my 
intention was to implement the changes in most examples, but I just 
haven't had the time. I can reproduce the logs, but I don't know when I 
will have the time for that :S (sorry :( ). I guess that the reduced 
time comes from the compiler recognising the variable as short-lived 
within the loop and avoiding the resizing of the matrices for each loop.

It may take some days before I reply.

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

From: John P. <jwp...@gm...> - 2021-06-10 19:28:20

On Thu, Jun 10, 2021 at 12:05 PM edgar <edg...@cr...> wrote:

> On 2021-06-10 16:00, John Peterson wrote:
> > On Wed, Jun 9, 2021 at 9:11 PM edgar <edg...@cr...> wrote:
> >
> >> Hi,
> >>
> >> I am humbly sharing something which I think would improve the
> >> documentation and the logic of examples 3 and 4 a bit. I think that
> >> this
> >> would apply to other examples as well. (I was planning to keep
> >> learning
> >> from the examples, and have a more substantial contribution at the
> >> end,
> >> but it has been like a month since I last touched libMesh, and I it
> >> seems that I am going to be very busy in the next couple of months).
> >>
> >> Thanks.
> >
> >
> > Hi Edgar,
> >
> > I agree with the updates to the code comments in both files, so thanks
> > for
> > those. In the ex4 diff, it looks like you move the Ke, Fe declarations
> > from
> > outside the for-loop over elements to inside? This is not likely to be
> > an
> > optimization, though, since creating that storage once and "resizing"
> > it
> > many times in the loop avoids dynamic memory allocations... the resizes
> > are
> > no-ops if the same Elem type is used at each iteration of the for-loop.
> > If
> > you have some performance profiling for this example that suggests
> > otherwise, I'd be happy to take a look.
>
> In all honesty, John, I did run a performance log on them, and the
> modification was faster, but I don't have it anymore. As I implied, my
> intention was to implement the changes in most examples, but I just
> haven't had the time. I can reproduce the logs, but I don't know when I
> will have the time for that :S (sorry :( ). I guess that the reduced
> time comes from the compiler recognising the variable as short-lived
> within the loop and avoiding the resizing of the matrices for each loop.
>

I recorded the "Active time" for the "Matrix Assembly Performance" PerfLog
in introduction_ex4 running "./example-opt -d 3 -n 40" for both the
original codepath and your proposed change, averaging the results over 5
runs. The results were:

Original code, "./example-opt -d 3 -n 40"
import numpy as np
np.mean([3.91801, 3.93206, 3.94358, 3.97729, 3.90512]) = 3.93

Patch, "./example-opt -d 3 -n 40"
import numpy as np
np.mean([4.10462, 4.06232, 3.95176, 3.92786, 3.97992]) = 4.00

so I'd say the original code path is marginally (but still statistically
significantly) faster, although keep in mind that matrix assembly is only
about 21% of the total time for this example while the solve is about 71%.

-- 
John

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

From: edgar <edg...@cr...> - 2021-06-10 22:55:59

On 2021-06-10 19:27, John Peterson wrote:
> I recorded the "Active time" for the "Matrix Assembly Performance" 
> PerfLog
> in introduction_ex4 running "./example-opt -d 3 -n 40" for both the
> original codepath and your proposed change, averaging the results over 
> 5
> runs. The results were:
> 
> Original code, "./example-opt -d 3 -n 40"
> import numpy as np
> np.mean([3.91801, 3.93206, 3.94358, 3.97729, 3.90512]) = 3.93
> 
> Patch, "./example-opt -d 3 -n 40"
> import numpy as np
> np.mean([4.10462, 4.06232, 3.95176, 3.92786, 3.97992]) = 4.00
> 
> so I'd say the original code path is marginally (but still 
> statistically
> significantly) faster, although keep in mind that matrix assembly is 
> only
> about 21% of the total time for this example while the solve is about 
> 71%.

Superinteresting, I am sending you my benchmarks. I must say that I had 
initially run only 2 benchmarks, and both came out faster with the 
modifications. Now, I found that
- The original code is more efficient with `-n 40'
- The modified code is more efficient with `-n 15' and `mpirun -np 4'
- That I ran the 5-test trial several times and some times, the original 
code was more efficient with `-n 15', but the first and second run with 
the modified code were always faster (my computer heating up?)

The gains are really marginal in any case. It would be interesting to 
run with -O3... (I just did [1]).
It seems that the differences are now a little bit more substantial, and 
that the modified code would be faster. I hope not to have made any 
mistakes.

The code and the benchmarks are in the attached file.
- examples
|- introduction
  |- ex4                    (original code)
   |- output_*_.txt.bz2     (running -n 40 with -O2)
   |- output_15_*_.txt.bz2     (running -n 15 with -O2)
   |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
  |- ex4_mod                (modified code)
   |- output_*_.txt.bz2     (running -n 40 with -O2)
   |- output_15_*_.txt.bz2     (running -n 15 with -O2)
   |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)


[1] I manually compiled like this (added -O3 instead of -O2; disregard 
the CCFLAGS et al):

     $ mpicxx -std=gnu++17 -DNDEBUG -march=amdfam10 -O3 
-felide-constructors -funroll-loops -fstrict-aliasing 
-Wdisabled-optimization -fopenmp -I/usr/include -I/usr/include/curl -I 
-I/usr/include -I/usr/include/eigen3 -I/usr/include/vtk 
-I/usr/local/petsc/linux-c-opt/include 
-I/usr/local/petsc/linux-c-opt//include -I/usr/include/superlu 
-I/usr/local/include -I/usr/include/scotch -I/usr/include/tirpc -c 
exact_solution.C -o exact_solution.x86_64-pc-linux-gnu.opt.o

     $ mpicxx -std=gnu++17 -DNDEBUG -march=amdfam10 -O3 
-felide-constructors -funroll-loops -fstrict-aliasing 
-Wdisabled-optimization -fopenmp -I/usr/include -I/usr/include/curl -I 
-I/usr/include -I/usr/include/eigen3 -I/usr/include/vtk 
-I/usr/local/petsc/linux-c-opt/include 
-I/usr/local/petsc/linux-c-opt//include -I/usr/include/superlu 
-I/usr/local/include -I/usr/include/scotch -I/usr/include/tirpc -c 
introduction_ex4.C -o introduction_ex4.x86_64-pc-linux-gnu.opt.o

     $ mpicxx -std=gnu++17 -march=amdfam10 -O3 -felide-constructors 
-funroll-loops -fstrict-aliasing -Wdisabled-optimization -fopenmp 
exact_solution.x86_64-pc-linux-gnu.opt.o 
introduction_ex4.x86_64-pc-linux-gnu.opt.o -o example-opt -Wl,-rpath 
-Wl,/usr/lib -Wl,-rpath -Wl,/lib -Wl,-rpath -Wl,/usr/lib -Wl,-rpath 
-Wl,/usr/local/petsc/linux-c-opt/lib -Wl,-rpath -Wl,/usr/local/lib 
-Wl,-rpath -Wl,/usr/include/scotch -Wl,-rpath -Wl,/usr/lib/openmpi 
-Wl,-rpath -Wl,/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0 
/usr/lib/libHYPRE.so  -L/usr/lib -lmesh_opt -ltimpi_opt -L/lib 
-L/usr/local/petsc/linux-c-opt/lib -L/usr/local/lib 
-L/usr/include/scotch -L/usr/lib/openmpi 
-L/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0 -lhdf5_cpp -lcurl -lnlopt 
-lglpk -lvtkIOCore -lvtkCommonCore -lvtkCommonDataModel -lvtkFiltersCore 
-lvtkIOXML -lvtkImagingCore -lvtkIOImage -lvtkImagingMath 
-lvtkIOParallelXML -lvtkParallelMPI -lvtkParallelCore 
-lvtkCommonExecutionModel -lpetsc -lcmumps -ldmumps -lsmumps -lzmumps 
-lmumps_common -lpord -lscalapack -lumfpack -lklu -lcholmod -lbtf 
-lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -lsuperlu 
-lfftw3_mpi -lfftw3 -llapack -lblas -lopenblas -lesmumps -lptscotch 
-lptscotcherr -lscotch -lscotcherr -lbz2 -lcgns -lmedC -lmed 
-lhdf5hl_fortran -lhdf5_fortran -lhdf5_hl -lhdf5 -lmetis -lz -lOpenCL 
-lyaml -lhwloc -lX11 -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh 
-lmpi -lgfortran -lm -lgcc_s -lpthread -lquadmath -lstdc++ -ldl -ltirpc 
-fopenmp

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

From: John P. <jwp...@gm...> - 2021-06-18 21:45:41

On Thu, Jun 10, 2021 at 5:55 PM edgar <edg...@cr...> wrote:

> On 2021-06-10 19:27, John Peterson wrote:
> > I recorded the "Active time" for the "Matrix Assembly Performance"
> > PerfLog
> > in introduction_ex4 running "./example-opt -d 3 -n 40" for both the
> > original codepath and your proposed change, averaging the results over
> > 5
> > runs. The results were:
> >
> > Original code, "./example-opt -d 3 -n 40"
> > import numpy as np
> > np.mean([3.91801, 3.93206, 3.94358, 3.97729, 3.90512]) = 3.93
> >
> > Patch, "./example-opt -d 3 -n 40"
> > import numpy as np
> > np.mean([4.10462, 4.06232, 3.95176, 3.92786, 3.97992]) = 4.00
> >
> > so I'd say the original code path is marginally (but still
> > statistically
> > significantly) faster, although keep in mind that matrix assembly is
> > only
> > about 21% of the total time for this example while the solve is about
> > 71%.
>
> Superinteresting, I am sending you my benchmarks. I must say that I had
> initially run only 2 benchmarks, and both came out faster with the
> modifications. Now, I found that
> - The original code is more efficient with `-n 40'
> - The modified code is more efficient with `-n 15' and `mpirun -np 4'
> - That I ran the 5-test trial several times and some times, the original
> code was more efficient with `-n 15', but the first and second run with
> the modified code were always faster (my computer heating up?)
>
> The gains are really marginal in any case. It would be interesting to
> run with -O3... (I just did [1]).
> It seems that the differences are now a little bit more substantial, and
> that the modified code would be faster. I hope not to have made any
> mistakes.
>
> The code and the benchmarks are in the attached file.
> - examples
> |- introduction
>   |- ex4                    (original code)
>    |- output_*_.txt.bz2     (running -n 40 with -O2)
>    |- output_15_*_.txt.bz2     (running -n 15 with -O2)
>    |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
>   |- ex4_mod                (modified code)
>    |- output_*_.txt.bz2     (running -n 40 with -O2)
>    |- output_15_*_.txt.bz2     (running -n 15 with -O2)
>    |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
>
>
> [1] I manually compiled like this (added -O3 instead of -O2; disregard
> the CCFLAGS et al):
>
>      $ mpicxx -std=gnu++17 -DNDEBUG -march=amdfam10 -O3
>


Your compiler flags are definitely far more advanced/aggressive than mine,
which are just on the default of -O2. However, I think what we should
conclude from your results is that there is something slower than it needs
to be with DenseMatrix::resize(), not that we should move the DenseMatrix
creation/destruction inside the loop over elements. What I tried (see
attached patch or the "dense_matrix_resize_no_virtual" branch in my fork)
is avoiding the virtual function call to DenseMatrix::zero() which is
currently made from DenseMatrix::resize(). In my testing, this change did
not seem to make much of a difference but I'm curious about what you would
get with your compiler args, this patch, and the unpatched ex4.

-- 
John

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

From: edgar <edg...@cr...> - 2021-06-19 02:54:38

On 2021-06-18 21:45, John Peterson wrote:
> On Thu, Jun 10, 2021 at 5:55 PM edgar <edg...@cr...> wrote:
> 
>> On 2021-06-10 19:27, John Peterson wrote:
>> > I recorded the "Active time" for the "Matrix Assembly Performance"
>> > PerfLog
>> > in introduction_ex4 running "./example-opt -d 3 -n 40" for both the
>> > original codepath and your proposed change, averaging the results over
>> > 5
>> > runs. The results were:
>> >
>> > Original code, "./example-opt -d 3 -n 40"
>> > import numpy as np
>> > np.mean([3.91801, 3.93206, 3.94358, 3.97729, 3.90512]) = 3.93
>> >
>> > Patch, "./example-opt -d 3 -n 40"
>> > import numpy as np
>> > np.mean([4.10462, 4.06232, 3.95176, 3.92786, 3.97992]) = 4.00
>> >
>> > so I'd say the original code path is marginally (but still
>> > statistically
>> > significantly) faster, although keep in mind that matrix assembly is
>> > only
>> > about 21% of the total time for this example while the solve is about
>> > 71%.
>> 
>> Superinteresting, I am sending you my benchmarks. I must say that I 
>> had
>> initially run only 2 benchmarks, and both came out faster with the
>> modifications. Now, I found that
>> - The original code is more efficient with `-n 40'
>> - The modified code is more efficient with `-n 15' and `mpirun -np 4'
>> - That I ran the 5-test trial several times and some times, the 
>> original
>> code was more efficient with `-n 15', but the first and second run 
>> with
>> the modified code were always faster (my computer heating up?)
>> 
>> The gains are really marginal in any case. It would be interesting to
>> run with -O3... (I just did [1]).
>> It seems that the differences are now a little bit more substantial, 
>> and
>> that the modified code would be faster. I hope not to have made any
>> mistakes.
>> 
>> The code and the benchmarks are in the attached file.
>> - examples
>> |- introduction
>>   |- ex4                    (original code)
>>    |- output_*_.txt.bz2     (running -n 40 with -O2)
>>    |- output_15_*_.txt.bz2     (running -n 15 with -O2)
>>    |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
>>   |- ex4_mod                (modified code)
>>    |- output_*_.txt.bz2     (running -n 40 with -O2)
>>    |- output_15_*_.txt.bz2     (running -n 15 with -O2)
>>    |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
>> 
>> 
>> [1] I manually compiled like this (added -O3 instead of -O2; disregard
>> the CCFLAGS et al):
>> 
>>      $ mpicxx -std=gnu++17 -DNDEBUG -march=amdfam10 -O3
>> 
> 
> 
> Your compiler flags are definitely far more advanced/aggressive than 
> mine,
> which are just on the default of -O2. However, I think what we should
> conclude from your results is that there is something slower than it 
> needs
> to be with DenseMatrix::resize(), not that we should move the 
> DenseMatrix
> creation/destruction inside the loop over elements. What I tried (see
> attached patch or the "dense_matrix_resize_no_virtual" branch in my 
> fork)
> is avoiding the virtual function call to DenseMatrix::zero() which is
> currently made from DenseMatrix::resize(). In my testing, this change 
> did
> not seem to make much of a difference but I'm curious about what you 
> would
> get with your compiler args, this patch, and the unpatched ex4.

I will surely test it. I will have more time next week. Sorry for the 
delay.

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

From: edgar <edg...@cr...> - 2021-06-28 03:24:44

On 2021-06-18 21:45, John Peterson wrote:
> Your compiler flags are definitely far more advanced/aggressive than 
> mine,

I cannot take credit for that, really. I only modified the -O2 to -O3, 
made sure that -funroll_loops was there and customised to my processor 
(amdfam10). All the other flags come directly from the Makefile provided 
by libMesh.

> which are just on the default of -O2. However, I think what we should
> conclude from your results is that there is something slower than it 
> needs
> to be with DenseMatrix::resize(), not that we should move the 
> DenseMatrix
> creation/destruction inside the loop over elements. What I tried (see
> attached patch or the "dense_matrix_resize_no_virtual" branch in my 
> fork)
> is avoiding the virtual function call to DenseMatrix::zero() which is
> currently made from DenseMatrix::resize(). In my testing, this change 
> did
> not seem to make much of a difference but I'm curious about what you 
> would
> get with your compiler args, this patch, and the unpatched ex4.

There _is_ something consistently different for sure. I only ran the 
case with `mpirun -np 4' and `-n 40'. The difference of the sums of 
times is in the order of 1 second. For five tests of this size and my 
rather limited system, I would say that your change yields marginally 
faster computation, and should be used. In which case, my modifications 
should be avoided.

In the interest of completeness, I need to say that I had to rebuild 
libMesh, because of compilation errors. I don't quite remember what 
version it is right now, but it is not the updated master branch (due to 
some issues that I am having with my Internet connection). Although this 
may not affect the comparison, it should be noted.

The results are shown below and in examples/introduction/sums.org

#+name: tbl-results
#+caption: The first two columns correspond to the (patched) original 
code. The last pair are the results with my modification (also with 
patch). In each case, the first of the columns is alive time, and the 
second one is active time. Data was copied from the .bz2 files.
| 3.65205 |   1.292 | 3.63248 | 1.31057 |
| 4.82533 | 1.76303 | 5.31107 | 1.95794 |
| 5.05955 | 1.84457 | 5.26696 |   1.964 |
| 3.86126 | 1.40952 | 3.53834 | 1.29313 |
| 3.58892 | 1.30998 |   4.369 | 1.59834 |

#+caption: calculate the sums of each column
#+begin_src python :var data=tbl-results
   ex4_alive = sum((I[0] for I in data))
   ex4_active = sum((I[1] for I in data))
   ex4_mod_alive = sum((I[2] for I in data))
   ex4_mod_active = sum((I[3] for I in data))
   return [["ex4_alive", "ex4_active", "ex4_mod_alive", 
"ex4_mod_active"],
           None,
           [ex4_alive, ex4_active, ex4_mod_alive, ex4_mod_active]]
#+end_src

#+RESULTS:
| ex4_alive |         ex4_active |      ex4_mod_alive | ex4_mod_active |
|-----------+--------------------+--------------------+----------------|
|  20.98711 | 7.6190999999999995 | 22.117849999999997 |        8.12398 |

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

From: John P. <jwp...@gm...> - 2021-06-28 18:45:23

Hi Edgar,

Including your previous results, the ex4 patch alone still has the fastest
"Active" time on average:

# previous results
upstream: AVG= 1.51775
ex4 patch: AVG= 1.29828

# current results
virtual function patch: AVG= 1.52382
virtual function + ex4 patches: AVG= 1.6248

But since you rebuilt libmesh between then and now, I'm not sure we should
really compare the two. Another thing to mention is that running the test
in parallel is probably counter productive to our goals since:
1.) It reduces the overall active time, and the shorter the duration of the
thing you are trying to time, the more it is affected by the timing code
itself.
2.) It will introduce more variability in the results. Currently,
the coefficient of variation (mean divided by stddev) for these results is
on the order of 15-18%, a fair bit larger than the differences in the times
themselves.

If you are interested in investigating this further, I would suggest that
you re-check the previous results on your current libmesh build, but I'd
also run everything in serial to try and reduce the variation to the point
where one of the four possible versions is statistically faster than all
the others...

--
John


On Sun, Jun 27, 2021 at 10:03 PM edgar <edg...@cr...> wrote:

> On 2021-06-18 21:45, John Peterson wrote:
> > Your compiler flags are definitely far more advanced/aggressive than
> > mine,
>
> I cannot take credit for that, really. I only modified the -O2 to -O3,
> made sure that -funroll_loops was there and customised to my processor
> (amdfam10). All the other flags come directly from the Makefile provided
> by libMesh.
>
> > which are just on the default of -O2. However, I think what we should
> > conclude from your results is that there is something slower than it
> > needs
> > to be with DenseMatrix::resize(), not that we should move the
> > DenseMatrix
> > creation/destruction inside the loop over elements. What I tried (see
> > attached patch or the "dense_matrix_resize_no_virtual" branch in my
> > fork)
> > is avoiding the virtual function call to DenseMatrix::zero() which is
> > currently made from DenseMatrix::resize(). In my testing, this change
> > did
> > not seem to make much of a difference but I'm curious about what you
> > would
> > get with your compiler args, this patch, and the unpatched ex4.
>
> There _is_ something consistently different for sure. I only ran the
> case with `mpirun -np 4' and `-n 40'. The difference of the sums of
> times is in the order of 1 second. For five tests of this size and my
> rather limited system, I would say that your change yields marginally
> faster computation, and should be used. In which case, my modifications
> should be avoided.
>
> In the interest of completeness, I need to say that I had to rebuild
> libMesh, because of compilation errors. I don't quite remember what
> version it is right now, but it is not the updated master branch (due to
> some issues that I am having with my Internet connection). Although this
> may not affect the comparison, it should be noted.
>
> The results are shown below and in examples/introduction/sums.org
>
> #+name: tbl-results
> #+caption: The first two columns correspond to the (patched) original
> code. The last pair are the results with my modification (also with
> patch). In each case, the first of the columns is alive time, and the
> second one is active time. Data was copied from the .bz2 files.
> | 3.65205 |   1.292 | 3.63248 | 1.31057 |
> | 4.82533 | 1.76303 | 5.31107 | 1.95794 |
> | 5.05955 | 1.84457 | 5.26696 |   1.964 |
> | 3.86126 | 1.40952 | 3.53834 | 1.29313 |
> | 3.58892 | 1.30998 |   4.369 | 1.59834 |
>
> #+caption: calculate the sums of each column
> #+begin_src python :var data=tbl-results
>    ex4_alive = sum((I[0] for I in data))
>    ex4_active = sum((I[1] for I in data))
>    ex4_mod_alive = sum((I[2] for I in data))
>    ex4_mod_active = sum((I[3] for I in data))
>    return [["ex4_alive", "ex4_active", "ex4_mod_alive",
> "ex4_mod_active"],
>            None,
>            [ex4_alive, ex4_active, ex4_mod_alive, ex4_mod_active]]
> #+end_src
>
> #+RESULTS:
> | ex4_alive |         ex4_active |      ex4_mod_alive | ex4_mod_active |
> |-----------+--------------------+--------------------+----------------|
> |  20.98711 | 7.6190999999999995 | 22.117849999999997 |        8.12398 |



-- 
John

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

From: edgar <edg...@cr...> - 2021-06-29 02:32:59

On 2021-06-28 18:45, John Peterson wrote:
> Hi Edgar,
> 
> Including your previous results, the ex4 patch alone still has the 
> fastest
> ---8<--- snip

Great! it means that your change really works.

> But since you rebuilt libmesh between then and now, I'm not sure we 
> should
> really compare the two.

Precisely.

> ---8<--- snip
> in parallel is probably counter productive to our goals since:
> 1.) It reduces the overall active time,
> ---8<-- snip
> it is affected by the timing code
> itself.
> 2.) It will introduce more variability in the results. Currently,
> the coefficient of variation (mean divided by stddev) for these results 
> is
> on the order of 15-18%, a fair bit larger than the differences in the 
> times
> themselves.

Right.

> If you are interested in investigating this further, I would suggest 
> that
> ---8<--- snip

Although it would be interesting, it sounds like a task for someone with 
a tad more time than me :P (at the moment). At the end, I am quite sure 
that the changes which you introduced make the code more efficient. This 
task would surely bring more formality to it.

Thanks!