Hi,
I'd like to report a bug in the "ssyrk" function for K of the form 4 n + 2 and N >= 56, and when the source matrix is transposed.
I've encountered this in the version of ATLAS that is available with Debian Linux, in one of the following packages:
ii libatlas-dev 3.8.4-9+deb7u1 all Automatically Tuned Linear Algebra Software, C header files
ii libatlas3-base 3.8.4-9+deb7u1 amd64 Automatically Tuned Linear Algebra Software, generic shared
ii libatlas3gf-base 3.8.4-9+deb7u1 all Transitional package to libatlas3-base
[uname -a -> Linux a01 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux ]
The function 'ssyrk' seems to reliably produce a NaN in the first element of the output matrix under the following conditions:
cblas_ssyrk(CblasRowMajor, CblasLower, CblasTrans, N, K, [etc.] )
where K is of the form 4 n + 2 and N >= 56. It seems to be important that the tranpose-ness be "CblasTrans".
For reference, the function call is as below.
void cblas_ssyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo,
const enum CBLAS_TRANSPOSE Trans, const int N, const int K,
const float alpha, const float A, const int lda,
const float beta, float C, const int ldc)
Dan
The following gdb trace may tell you something also, although I'm not sure what.
step
kaldi::cblas_Xsyrk (trans=kaldi::kTrans, dim_c=193, other_dim_a=26, alpha=0.300000012, A=0x722e40, a_stride=196, beta=1.75432\
003, C=0x7ffff7fb1010,
c_stride=196) at ../matrix/cblas-wrappers.h:286
286 dim_c, other_dim_a, alpha, A, a_stride, beta, C, c_stride);
(gdb) list
list
281 const MatrixTransposeType trans, const MatrixIndexT dim_c,
282 const MatrixIndexT other_dim_a, const float alpha, const float A,
283 const MatrixIndexT a_stride, const float beta, float C,
284 const MatrixIndexT c_stride) {
285 cblas_ssyrk(CblasRowMajor, CblasLower, static_cast<CBLAS_TRANSPOSE>(trans),
286 dim_c, other_dim_a, alpha, A, a_stride, beta, C, c_stride);
287 }
288
a01:matrix: ldd ./matrix-lib-test
ldd ./matrix-lib-test
linux-vdso.so.1 => (0x00007fff457b4000)
libfst.so.1 => /home/dpovey/kaldi-pure/tools/openfst/lib/libfst.so.1 (0x00007f2c94dc6000)
libatlas.so.3 => /usr/lib/atlas-base/libatlas.so.3 (0x00007f2c947ca000)
libf77blas.so.3 => /usr/lib/atlas-base/libf77blas.so.3 (0x00007f2c945ad000)
libcblas.so.3 => /usr/lib/atlas-base/libcblas.so.3 (0x00007f2c94390000)
liblapack_atlas.so.3 => /usr/lib/atlas-base/liblapack_atlas.so.3 (0x00007f2c94171000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2c93f3d000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2c93d39000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2c93a31000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2c937af000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2c93599000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2c9320e000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f2c92ef8000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2c951db000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f2c92cc2000)
The 3.8 series hasn't been supported since 07/10/12, when 3.10 became the new stable. Can you see if this is still a problem in a modern ATLAS?
Thanks,
Clint
I was able to replicate it with 3.10.0.
BTW, the 4 n + 2 thing turned out not to be necessary. The main thing is that the size of the matrix should be >= 56, and the argument should be transposed.
It doesn't happen every time.
Here is a stack trace for where the NaN appears.
LOG (UnitTestSymAddMat2():matrix-lib-test.cc:2792) M sum is -26.2716
Hardware watchpoint 2: * $2
Old value = 0.751994371
New value = -nan(0x7fffff)
0x0000000000920035 in ATL_spputblk_diag ()
(gdb) bt
bt
0 0x0000000000920035 in ATL_spputblk_diag ()
1 0x00000000009193b8 in ATL_sprk_kmm ()
2 0x0000000000741195 in ATL_ssprk_rK ()
3 0x0000000000523b73 in ATL_ssyrk ()
4 0x000000000048a690 in kaldi::cblas_Xsyrk (trans=kaldi::kTrans, dim_c=74, other_dim_a=11, alpha=0.300000012, A=0xb66680, a\
_stride=76,
beta=1.75432003, C=0xb60e90, c_stride=76) at ../matrix/cblas-wrappers.h:286
5 0x00000000004a6eec in kaldi::MatrixBase<float>::SymAddMat2 (this=0x7fffffffd510, alpha=0.300000012, A=..., transA=kaldi::\
kTrans, beta=1.75432003)
at kaldi-matrix.cc:242
6 0x00000000004425fd in kaldi::UnitTestSymAddMat2<float> () at matrix-lib-test.cc:2795
7 0x000000000044207c in main () at matrix-lib-test.cc:4249
(gdb)
Great. Can you post a main program that generates the error so I can use it to track down the problem and confirm the bug (if you can do so with the ATLAS testers, just give me the commandline to use)?
If your present tester is a huge thing, maybe you can cut it down to something simple so I can figure it out quickly? If its not hard for you, something in C is preferable to c++.
Very much appreciated,
Clint
Hi,
I just noticed this (for some reason it did not go to my email),
I've asked someone to try to prepare a simple test program and I'll post it here when it's done.
Dan
-- mode: compilation; default-directory: "/home/dpovey/" --
Compilation started at Tue May 13 13:41:23
Hi,
With the help of Shi Wei (shiwei@sz.pku.edu.cn) I managed to get something that somewhat reproduces the bug.
In a stand-alone C++ program we were not able to reproduce the NaN, which depends on the exact order things are allocated in, but we did manage to get Valgrind to show that there was an uninitialized value being propagated.
I've tried to get the same showing in a "C" program but was unable to.
I'm pretty sure this example is not minimal- there is some crud- but it should suffice.
From the valgrind errors, it seems that the code in src/blas/pklevel3/sprk/ATL_prk_kmm.c mallocs something, and it may be used uninitialized in some code path.
Dan
-- mode: compilation; default-directory: "/home/dpovey/" --
Compilation started at Tue May 13 14:00:44
g++ -c ssyrk_nan_bug_test.cpp -I/home/dpovey/kaldi-clean/tools/ATLAS/include -o tmp.o; g++ tmp.o /usr/lib/atlas-base/libatlas.so.3.0 /usr/lib/\ atlas-base/libf77blas.so.3.0 /usr/lib/atlas-base/libcblas.so.3 /usr/lib/atlas-base/liblapack_atlas.so.3 -o tmp; valgrind --track-origins=yes ./\ tmp
==12714== Memcheck, a memory error detector
==12714== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==12714== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==12714== Command: ./tmp
==12714==
kNoTrans: 111, kTrans: 112
dimM = 193, dimN = 2
N sum is : 386.000000
M sum is : 37249.000000
==12714== Conditional jump or move depends on uninitialised value(s)
==12714== at 0x401105: UnitTestSymAddMat2() (in /home/dpovey/tmp)
==12714== by 0x4011CB: main (in /home/dpovey/tmp)
==12714== Uninitialised value was created by a heap allocation
==12714== at 0x4C28BED: malloc (vg_replace_malloc.c:263)
==12714== by 0x4FCF772: ATL_sprk_kmm (in /usr/lib/atlas-base/libatlas.so.3.0)
==12714== by 0x4FD0F4A: ATL_ssprk_rK (in /usr/lib/atlas-base/libatlas.so.3.0)
==12714== by 0x4FD01D4: ATL_ssprk (in /usr/lib/atlas-base/libatlas.so.3.0)
==12714== by 0x4FD14E5: ATL_ssyrk (in /usr/lib/atlas-base/libatlas.so.3.0)
==12714== by 0x5651CCC: cblas_ssyrk (in /usr/lib/atlas-base/libcblas.so.3.0)
==12714== by 0x400D47: cblas_Xsyrk(MatrixTransposeType, int, int, float, float const, int, float, float, int) (in /home/dpovey/tmp)
==12714== by 0x400E4B: SymAddMat2(float, Matrix const, MatrixTransposeType, Matrix, float) (in /home/dpovey/tmp)
==12714== by 0x4010C3: UnitTestSymAddMat2() (in /home/dpovey/tmp)
==12714== by 0x4011CB: main (in /home/dpovey/tmp)
==12714==
Holy mother of god. That is pretty convulated as a demonstration case, particularly for someone who doesn't remember a lot of C++. Can you decode it for me and tell me what the actual arguments to SYRK are, so I can see if I can reproduce in something I understand?
Unfortunately, the error message:
Conditional jump or move depends on uninitialised value(s)
is often generated completely spuriously by valgrind, so this'll only be the original problem if we are lucky.
I have removed the "in developer series" tag, because the developer series does not call this code anymore.
Hi, actually the code is almost C-style. By minor changes, like removing the static_cast operator and declaring variables out of for loop, you can build the code with a C compiler. see the attachment. It's true that different platforms may have different appearance for this "bug". In my situation, CentOS 6.3 with GCC 4.4.7, I can produce NaN with both the C and C++ code. And also a friend's machine with Ubuntu 12.04, GCC 4.6.3 and ATLAS 3.10.0. But the elements that appear to be NaN(to be more precisely, -nan) were diferent.
Last edit: Wei Shi 2014-05-14
Can you confirm that this occurs with 3.10.1? I build the debug version, and the only valgrind probs I get are:
==1127== Conditional jump or move depends on uninitialised value(s)
==1127== at 0x401148: UnitTestSymAddMat2() (ssyrk_nan_bug_test.cpp:221)
==1127== by 0x401214: main (ssyrk_nan_bug_test.cpp:233)
==1127==
==1127== Conditional jump or move depends on uninitialised value(s)
==1127== at 0x401153: UnitTestSymAddMat2() (ssyrk_nan_bug_test.cpp:221)
==1127== by 0x401214: main (ssyrk_nan_bug_test.cpp:233)
which is that (often spurious) error coming from your tester, not ATLAS.
Since I'm not confident these jump/move messages mean anything, can you go back to the code you had that generated NaNs, and tell me the precise call to SYRK you made to get the NaNs (i.e., give the precise value of each parameter)? Make sure you are linking with 3.10.1 installed from my tarfile, which is the code I can use & support.
If that doesn't allow me to repeat the bug, it may either be not an ATLAS bug, or just as likely an ATLAS bug that only happens due to the tuning on a given system. So, attach the error report as described here:
http://math-atlas.sourceforge.net/faq.html#help
that will allow me to see the exact machine we are talking about, which may be key in tracking things further.
Thanks,
Clint
I cannot reproduce this problem using 3.10.1 on my Corei264AVX, at least.
With the latest tester, the only valgrind errors I get come from your tester, not ATLAS.
Your tester does print some NaNs though, and so now I have to try to figure out whether those come from the tester or ATLAS. This would be hugely easier for me to do if the tester was not so extremely complicated (defining its own types, writing wrappers to the BLAS, etc.).
I now will have to cut the tester down to the one case causing the NaNs, and then get rid of all the cruft in the tester until I can actually comprehend your call to the BLAS, and make sure it is well-formed.
If you are fast at this, please post a simplified tester.
Thanks,
Clint
I also am awaiting confirmation that this error occurs with 3.10.1 from the sourceforge tarfile.
OK, changed your tester so that the kloop is only run once wt k=0, and still got error. Then I changed ATLAS's syrk to compute SYRK with another method, and the NaNs went away, which suggests the error is in ATLAS 3.10.1.
You seem to be using beta=1; usually NaNs come from beta=0, so this would seem to indicate something funky going on. Need to simplify tester or reproduce problem with ATLAS tester to ensure problem is not in call itself (with two different implementations handling buggy input differently).
OK, when I tell l3blastst to call the exact case you use, I get the correct answer. This leads me to suspect the error is in your call, not in ATLAS, though this is not for sure.
Can you please simplify the tester to not use any derived types, but instead just directly allocate the space, initialize with simple loops, and directly call the blas, and then look for errors?
I have to be able examine the code to make sure that the memory you have allocated is correctly described by the call you are making, and the present code is too convulated for me to follow easily.
Thanks,
Clint
Hi, I just post a reply to you, but it's missing! wired....Ok, I will post it again. I was just testing the tester we sent to you and I found the same place that Dan found where might be the problem. I used gdb and here's part of the backtrack:
19 0x00007ffff6162b64 in _int_malloc () from /lib64/libc.so.6
20 0x00007ffff6163911 in malloc () from /lib64/libc.so.6
21 0x00007ffff6ff51c4 in ATL_sprk_kmm () from /usr/lib64/atlas/libatlas.so.3
22 0x00007ffff7007b34 in ATL_ssprk_rK () from /usr/lib64/atlas/libatlas.so.3
23 0x00007ffff700b353 in ATL_ssyrk () from /usr/lib64/atlas/libatlas.so.3
24 0x0000000000400b9d in cblas_Xsyrk (trans=kTrans, dim_c=193, other_dim_a=2, alpha=1, A=0x603700, a_stride=196,
25 0x0000000000400c98 in SymAddMat2 (alpha=1, A=0x7fffffffe4b0, transA=kTrans, C=0x7fffffffe490, beta=1)
26 0x0000000000400f5f in UnitTestSymAddMat2 () at ssyrk_nan_bug_test_dp.c:233
27 0x0000000000401053 in main () at ssyrk_nan_bug_test_dp.c:250
So can you check that in ATL_sprk_kmm () there's no uninitialized memory usage allocated by ATL_sprk_kmm () ? or maybe use calloc() rather than malloc() ?
As I explained before, when I run valgrind using your tester, the only
issues come from your tester, and I don't get any from the places your
trace shows.
That's unfortunate, because I build my ATLAS with debug, so that I could
get line numbers, which would tell us where this is coming from.
I also suspect a memory error, and if you will simplify your tester
enough that I can understand it, I can rule it out in the tester, which
will mean it is in ATLAS. The other possibility is that you are not
allocating memory that matches your BLAS call, and this is causing ATLAS
to have memory errors (the details of what happens varies due to linking
differences, which is why I'm not getting exact same problem as you guys).
Once your tester is a simple: allocate, call blas with native types, I
should be able to determine if the call is well-formed, and if it is,
ATLAS has a memory bug that is just not triggered by my own tester.
Thanks,
Clint
On 05/14/2014 09:53 AM, Wei Shi wrote:
--
R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley
Hi, Clint. I just downloaded the latest stable version 3.10.1 and compile ATLAS in debug mode. Then via GDB I managed to found the exact line of code that may be the problem. It's in ./src/blas/pklevel3/sprk/ATL_prk_kmm.c:141. So I changed that line, from:
if (i <= ATL_pkMaxMalloc || K <= NB) vC = malloc(i);
to:
if (i <= ATL_pkMaxMalloc || K <= NB) vC = calloc(1, i);
and the NaNs that would produced by the tester had gone (I checked this for thousands of times with random matrix dimensions).
Later I will compile our project Kaldi against with the modified ATLAS to make sure that NaN will not appear neither.
Regarding to "a simplified tester", do you think it's still necessary? If yes, I would like to do it in some other time. Thanks !
Wei
Yes, I still need a simplified test case, because we still don't know where the bug is. Just because you can change something in ATLAS and it goes away, does not mean (with memory errors) that the problem is in ATLAS (memory errors can be fixed with many random changes that just change memory a bit). So, I still need a call I can understand by inspection to help me figure out what is going on.
No problem to not do this now. I will return to it when you post a simplified test case, or when I myself have time to try to look more deeply. I was trying to do this now only because I was considering issuing a new stable patch that fixes several known bugs, and thought it would be nice to get this one in there if we could track it down quickly.
I can also simplify your tester, but I'm hoping you will since you already understand all that indirection, whereas I will have to wade through without doing so . . .
Many thanks,
Clint
Clint, please check the simplified tester.
Wei
Sorry for the looong delay, I am just now getting back to ATLAS after some travel.
Thank you very much for the simplified tester. These are notes to self I'll want when I get home so I can try run on machine I used before.
In Col-major:
UPLO=Lower,
TA=NoTrans,
N=193,
K=2,
alpha=1.0,
lda=193,
beta=1.0,
ldc=193
---> Basic call looks sound.
-> Overly complicated, but Transpose looks sound on first glance
===> Case looks roughly right to me:
(1) Make sure prob shows up as-is
(2) Change init of C to 0, check prob goes away
(3) Use debug from there.
Confirmed on home machine: simplified tester produces NaNs
Ticket moved from /p/math-atlas/support-requests/941/