Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 2021-06-18 21:45, John Peterson wrote:
> On Thu, Jun 10, 2021 at 5:55 PM edgar <edg...@cr...> wrote:
> 
>> On 2021-06-10 19:27, John Peterson wrote:
>> > I recorded the "Active time" for the "Matrix Assembly Performance"
>> > PerfLog
>> > in introduction_ex4 running "./example-opt -d 3 -n 40" for both the
>> > original codepath and your proposed change, averaging the results over
>> > 5
>> > runs. The results were:
>> >
>> > Original code, "./example-opt -d 3 -n 40"
>> > import numpy as np
>> > np.mean([3.91801, 3.93206, 3.94358, 3.97729, 3.90512]) = 3.93
>> >
>> > Patch, "./example-opt -d 3 -n 40"
>> > import numpy as np
>> > np.mean([4.10462, 4.06232, 3.95176, 3.92786, 3.97992]) = 4.00
>> >
>> > so I'd say the original code path is marginally (but still
>> > statistically
>> > significantly) faster, although keep in mind that matrix assembly is
>> > only
>> > about 21% of the total time for this example while the solve is about
>> > 71%.
>> 
>> Superinteresting, I am sending you my benchmarks. I must say that I 
>> had
>> initially run only 2 benchmarks, and both came out faster with the
>> modifications. Now, I found that
>> - The original code is more efficient with `-n 40'
>> - The modified code is more efficient with `-n 15' and `mpirun -np 4'
>> - That I ran the 5-test trial several times and some times, the 
>> original
>> code was more efficient with `-n 15', but the first and second run 
>> with
>> the modified code were always faster (my computer heating up?)
>> 
>> The gains are really marginal in any case. It would be interesting to
>> run with -O3... (I just did [1]).
>> It seems that the differences are now a little bit more substantial, 
>> and
>> that the modified code would be faster. I hope not to have made any
>> mistakes.
>> 
>> The code and the benchmarks are in the attached file.
>> - examples
>> |- introduction
>>   |- ex4                    (original code)
>>    |- output_*_.txt.bz2     (running -n 40 with -O2)
>>    |- output_15_*_.txt.bz2     (running -n 15 with -O2)
>>    |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
>>   |- ex4_mod                (modified code)
>>    |- output_*_.txt.bz2     (running -n 40 with -O2)
>>    |- output_15_*_.txt.bz2     (running -n 15 with -O2)
>>    |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
>> 
>> 
>> [1] I manually compiled like this (added -O3 instead of -O2; disregard
>> the CCFLAGS et al):
>> 
>>      $ mpicxx -std=gnu++17 -DNDEBUG -march=amdfam10 -O3
>> 
> 
> 
> Your compiler flags are definitely far more advanced/aggressive than 
> mine,
> which are just on the default of -O2. However, I think what we should
> conclude from your results is that there is something slower than it 
> needs
> to be with DenseMatrix::resize(), not that we should move the 
> DenseMatrix
> creation/destruction inside the loop over elements. What I tried (see
> attached patch or the "dense_matrix_resize_no_virtual" branch in my 
> fork)
> is avoiding the virtual function call to DenseMatrix::zero() which is
> currently made from DenseMatrix::resize(). In my testing, this change 
> did
> not seem to make much of a difference but I'm curious about what you 
> would
> get with your compiler args, this patch, and the unpatched ex4.

I will surely test it. I will have more time next week. Sorry for the 
delay.