run_query: mysql_query: MySQL server has gone away
Status: Beta
Brought to you by:
cinergi
This problem happens with 0.5 and 20060915:
The amount of looked up id's via MySQL are very low on the servers with the problems. It may or may not have something to do with mysql client library version - most 4.x are OK (but do getents often), one 5.0.18 is ok (also heavy on the getents), and 6 boxes running 5.x client libs with little-to-no getents per day are the ones having issues. I tried cronning an hourly "getent passwd" and that doesn't seem to have fixed it (wait timeout is 28800 (8 hours)).
Thanks for any help you can provide.
Dale
Logged In: YES
user_id=550319
Originator: NO
It's not clear if a program is failing due to this issue.
NSVSD maintains a set of connections to the MySQL server. I'm actually not sure offhand how each connection is used (randomly, cyclical order, or FIFO) -- that would affect my response but it should be normal behavior and should cause no problems. You can reduce the number of threads that NSVSD uses to reduce the number of error messages and the chances of it happening. A low-usage server can probably run fine with 1 thread, 2 max.
Logged In: YES
user_id=741802
Originator: YES
Yes, programs will start failing after nsvsd has been running a while. Samba stops serving /home/$user maps and 'id username' fails. A restart of nsvsd and all is back to normal. I started it with debug logging to foreground the other day. The last output on my screen was this before I killed it since it wasn't working:
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
main_loop: Thread 3 waiting ...
termination_handler: Cancelling threads ...
termination_handler: Thread 1 canceled.
termination_handler: Thread 2 canceled.
termination_handler: Thread 3 canceled.
termination_handler: Thread 4 canceled.
termination_handler: Thread 5 canceled.
termination_handler: Thread 6 canceled.
termination_handler: Thread 7 canceled.
termination_handler: Thread 8 canceled.
termination_handler: Thread 9 canceled.
termination_handler: Thread 10 canceled.
termination_handler: Thread 11 canceled.
termination_handler: Thread 12 canceled.
termination_handler: Thread 13 canceled.
termination_handler: Thread 14 canceled.
termination_handler: Thread 15 canceled.
termination_handler: Thread 16 canceled.
termination_handler: Thread 17 canceled.
termination_handler: Thread 18 canceled.
termination_handler: Thread 19 canceled.
termination_handler: cleaning up ...
When first srated main_loop lists all 20 threads randomly in "waiting ..." state.
This is using the 20060915 snapshot.
Logged In: YES
user_id=741802
Originator: YES
oops, first srated = first started (not even close!)
Logged In: YES
user_id=550319
Originator: NO
Oh wow, look at that -- the threads are hung huh? How long does it take for it to a) start failing and b) get to the point where 19 out of 20 threads are hung? I'll start digging into this now that I have a sense of what's going on. I'm going to move this into the bugs section.
Logged In: YES
user_id=741802
Originator: YES
a) no longer than an hour for "thread 1" to stop showing up in "Thread X waiting" (line 7304 of 22000 in the debug log).
3, 6, and 12 are also dead around 19000 lines.
14 and 17 also dead around 7300 lines.
b) it only took a few days to lose 19 of the 20 threads. I'll keep an eye on this and report back when it happens.
Logged In: YES
user_id=741802
Originator: YES
I'm down to two threads left (10 and 15). Way less than wait_timeout seconds, too.
Logged In: YES
user_id=741802
Originator: YES
OK, we're down to one thread left, but something is fishy - Every so often the thread id changes from 10&15 to 3 to 12 to 2 and it's negatively answering queries now.
Logged In: YES
user_id=550319
Originator: NO
When the system starts negatively answering queries, does debug show any activity other than the "waiting ..." messages?
I'm testing with 5.0.37 (I know you're on 5.0.18) and I'm unable to get a "MySQL server has gone away" message. I'm killing the connections with mysqladmin and they come right back.. I've restarted the MySQL server and it reconnected. 0.5 definitely breaks but the current snapshot is working for me.
As for the thread hangs -- I'm having problems reproducing that, too :-( It'd be easiest to debug this issue if you set threads=1 and ran in debug mode and strace'd it when it started failing.
I'm going to leave mine running and come back to it later and see how it's doing ...
On the failing boxes, what version of 5.0.18 are you using (RPM, tar, gcc, intel; MySQL or distro-provided, etc)?
Logged In: YES
user_id=741802
Originator: YES
Yes, debug shows "run_query: mysql_query: MySQL server has gone away" when it stops answering queries.
Actually, I'm on 5.0.15 (client libs) on the server in question. The NSS MySQL server is still MySQL 4.x though.
I'm using a distro (Arch Linux) provided version of 5.0.15 (there's only one server left that's failing, snapshot fixed the rest).
mysql Ver 14.12 Distrib 5.0.15, for pc-linux-gnu (i686) using readline 5.0
Unfortunately it's a pretty important server so I can't risk breaking the NFS exports that it serves. The NSS lookups aren't so important and can be experimented with, however.
Right now it's stuck at "Thread 7 waiting..." but it's still answering queries. I'll set it to threads=1 and run it in debug mode and provide you the strace when it starts failing.
Logged In: YES
user_id=550319
Originator: NO
Can you run an 'lsof' on the failing nsvsd process (lsof -p pid) and see if anything looks out of whack?
Logged In: YES
user_id=550319
Originator: NO
I'm a little confused -- is this correct:
1) Your MySQL server is 4.x and is not the server with the failing nsvsd process
2) The server with the failing nsvsd process is running on (and compiled against?) a box with MySQL 5.0.15 libs and is connecting back to the server stated above
Do you have any other 5.0.15-based nsvsd processes that /are/ working? The reconnect fix occurred in 5.0.13 so I'm starting getting a little suspicious of the 5.0.15 version.
Can you run this command (correct the path to the include files) on the server where the failing nsvsd was compiled:
grep MYSQL_VERSION_ID /usr/local/mysql/include/mysql_version.h
and let me know what it returns?
Logged In: YES
user_id=741802
Originator: YES
1) Yes, server is 4.1.8 and it's clients don't fail.
2) Yes, the nsvsd snapshot was compiled on the failing server.
No, I don't have any more servers actually running 5.0.15 so I can't tell if that might be the problem.
$ grep MYSQL_VERSION_ID /usr/include/mysql/mysql_version.h
#define MYSQL_VERSION_ID 50015
And actually, that server isn't failing now (although it's stuck on thread 2 waiting only) and it's been running for 4 days. I think I'm going to check if the "working" servers have this thread sticking problem too.
I'll run lsof on it when it starts failing.
Thanks
Logged In: YES
user_id=741802
Originator: YES
Yes, even my servers that started working correctly with the snapshot seem to be having problems with the threads (queries still are OK).
main_loop: Thread 6 waiting ...
main_loop: Thread 18 waiting ...
main_loop: Thread 14 waiting ...
main_loop: Thread 6 waiting ...
main_loop: Thread 18 waiting ...
main_loop: Thread 14 waiting ...
main_loop: Thread 6 waiting ...
main_loop: Thread 18 waiting ...
Logged In: YES
user_id=741802
Originator: YES
For some reason the problematic server is still issuing successful NSS lookups. Both servers running nsvsd in debug mode are down to 1 thread (bounces to 2-3 and back to 1).
Logged In: YES
user_id=741802
Originator: YES
Hrm, both have been going strong for almost 2 weeks now - even though they only have 1-3 threads running at any time. Is this something to worry about?