Elk / Discussion / Elk Developers: Bug in openmp

Bug in openmp

Forum: Elk Developers

Creator: Youzhao Lan

Created: 2014-12-24

Updated: 2015-03-16

Youzhao Lan - 2014-12-24

Dear all,
In the last post, I reported a bug in openmp, that is, different results for each repeated calculation.
I try to debug the code and find that disabling -fopenmp of gencore and rhomag calls can temporarily fix this bug.
However, new issue appears, results based on openmp calculation are different from those based on serial calculation
A possible reason is a bug in generations of (evalfv,evecfv, and evecsv), but disabling -fopenmp of relevant calls
cannot fix this bug.
To reproduce the mentioned bugs, run the program and check the totenergy.out, or output the evalfv.

Note that I only test task 0,120,and 125. see example GaAs-NLO

Merry Christmas and Happy New Year
Youzhao Lan

Last edit: Youzhao Lan 2014-12-24

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

J. K. Dewhurst - 2015-03-10

Hi!

How different are the results? Most parallel calculations which involve floating point are not perfectly reproducible. This is because the internal FP registers on Intel chips are 80 bit and the main memory is 64 bit which results in a loss of precision as the numbers are randomly swapped in and out.

Regards,
Kay.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- martin_frbg - 2015-03-10
  
  Note that this only applies if you use the "legacy" x87 fpu instead of sse, so should not be an issue on even moderately recent systems (unless you actually force the compiler to still generate x87 code)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Youzhao Lan - 2015-03-10

Dear Kay,
Thanks for your attention.
This bug is still in ELK 3.0.4.
As I now know, the bug is shown in calculation of "diagonal" momentum matrix element (i.e., P_ii)
To reproduce the bug, please
1, Run GaAs-NLO in examples directory, for simplicity, please reduce the number of kpoints
e.g. 2 2 2
2, Run ELK with openmp full support, you will obtain different CHI_INTRA2w_123.OUT results for each run with the same input
3, Uncommenting !OMP of gencore.f90 and rhomag.f90 calls and recompiling ELK can temporarily fix this bug.
that is, each run with the same input will output exactly the same CHI_INTRA2w_123.OUT.
However, this CHI_INTRA2w_123.OUT is different from that based on ELK serial calculation (i.e., disable fullly openmp)
And acceptable differences exist between other NLO outputs (CHI_INTER2w_123.OUT,CHI_INTERw_123.OUT,CHI_INTRAw_123.OUT)

Note that only CHI_INTRA2w_123.OUT depends on the "diagonal" momentum matrix elements
If you need more information, please let me know. I have checked this bug in both windows and linux OS.

Best regards
Youzhao Lan

Last edit: Youzhao Lan 2015-03-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

J. K. Dewhurst - 2015-03-11

Dear Youzhao and Martin,

There is no bug in the main code with OpenMP or (so far as I know) anything else.

The problem is with the routine nonlinopt, which is not surprising because the formula for second-harmonic generation is quite complicated.

What is happening is that there are degeneracies in the Kohn-Sham eigenvalues, and therefore the corresponding eigenstates are defined only up to a unitary transformation. This does not affect any observable though. When you run Elk twice then the eigenvectors can change randomly because of the register-memory swapping effect. As Martin mentioned, if the code used only SSE then you wouldn't have noticed - although the results would still have been wrong.

In short, the problem is in nonlinopt which we will try and fix ASAP.

Regards,
Kay.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Youzhao Lan - 2015-03-11

Dear Kay,
Thanks for your reply.
But for the ELK serial calculation, CHI_INTRA2w_123.OUT is exactly the same for each repeated running with the same input. Any more information will be appreciated.

Best regards
Youzhao Lan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sangeeta Sharma - 2015-03-16

Hi

The bug is not in the parallel version of the code.

The non-linear optics formulae are derived using non-degenerate perturbation theory. This leads to Intraband 2w term and modulation term to depend upon the diagonal momentum matrix elememts. In case of degeneracy these elements can be any linear combination. This is what leads to different results on each run, as it should do.

The way to fix it is to break the degeneracy slightly or to re-derive the whole formalism using degenerate perturbation theory. For the former one can simply shift the k-point grid by a small number breaking the symmetry.

For example please try
vkloff
0.015 0.005 0.022

Best
Sangeeta

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Youzhao Lan - 2015-03-16

Dear Sangeeta,
Thanks for your help. All are fine now.
Can I understand that the present calculation is appropriate for
the low symmetric system, but not for the high symmetric system, unless I break the degeneracy slighly?

Any help will be appreciated.
Youzhao Lan

Last edit: Youzhao Lan 2015-03-17

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.