From: edgar <edg...@cr...> - 2021-06-19 02:54:38
|
On 2021-06-18 21:45, John Peterson wrote: > On Thu, Jun 10, 2021 at 5:55 PM edgar <edg...@cr...> wrote: > >> On 2021-06-10 19:27, John Peterson wrote: >> > I recorded the "Active time" for the "Matrix Assembly Performance" >> > PerfLog >> > in introduction_ex4 running "./example-opt -d 3 -n 40" for both the >> > original codepath and your proposed change, averaging the results over >> > 5 >> > runs. The results were: >> > >> > Original code, "./example-opt -d 3 -n 40" >> > import numpy as np >> > np.mean([3.91801, 3.93206, 3.94358, 3.97729, 3.90512]) = 3.93 >> > >> > Patch, "./example-opt -d 3 -n 40" >> > import numpy as np >> > np.mean([4.10462, 4.06232, 3.95176, 3.92786, 3.97992]) = 4.00 >> > >> > so I'd say the original code path is marginally (but still >> > statistically >> > significantly) faster, although keep in mind that matrix assembly is >> > only >> > about 21% of the total time for this example while the solve is about >> > 71%. >> >> Superinteresting, I am sending you my benchmarks. I must say that I >> had >> initially run only 2 benchmarks, and both came out faster with the >> modifications. Now, I found that >> - The original code is more efficient with `-n 40' >> - The modified code is more efficient with `-n 15' and `mpirun -np 4' >> - That I ran the 5-test trial several times and some times, the >> original >> code was more efficient with `-n 15', but the first and second run >> with >> the modified code were always faster (my computer heating up?) >> >> The gains are really marginal in any case. It would be interesting to >> run with -O3... (I just did [1]). >> It seems that the differences are now a little bit more substantial, >> and >> that the modified code would be faster. I hope not to have made any >> mistakes. >> >> The code and the benchmarks are in the attached file. >> - examples >> |- introduction >> |- ex4 (original code) >> |- output_*_.txt.bz2 (running -n 40 with -O2) >> |- output_15_*_.txt.bz2 (running -n 15 with -O2) >> |- output_40_O3_*_.txt.bz2 (running -n 40 with -O3) >> |- ex4_mod (modified code) >> |- output_*_.txt.bz2 (running -n 40 with -O2) >> |- output_15_*_.txt.bz2 (running -n 15 with -O2) >> |- output_40_O3_*_.txt.bz2 (running -n 40 with -O3) >> >> >> [1] I manually compiled like this (added -O3 instead of -O2; disregard >> the CCFLAGS et al): >> >> $ mpicxx -std=gnu++17 -DNDEBUG -march=amdfam10 -O3 >> > > > Your compiler flags are definitely far more advanced/aggressive than > mine, > which are just on the default of -O2. However, I think what we should > conclude from your results is that there is something slower than it > needs > to be with DenseMatrix::resize(), not that we should move the > DenseMatrix > creation/destruction inside the loop over elements. What I tried (see > attached patch or the "dense_matrix_resize_no_virtual" branch in my > fork) > is avoiding the virtual function call to DenseMatrix::zero() which is > currently made from DenseMatrix::resize(). In my testing, this change > did > not seem to make much of a difference but I'm curious about what you > would > get with your compiler args, this patch, and the unpatched ex4. I will surely test it. I will have more time next week. Sorry for the delay. |