|
From: David R. <dr...@gm...> - 2005-06-23 21:52:41
|
I just upgraded to 1.6.0 from 1.4.8 yesterday and have had a couple
problems with it.
I noticed this morning that sqlgrey appeared to have stopped
responding, then found these errors in /var/log/maillog:
Jun 22 16:45:21 forty sqlgrey: dbaccess: error: couldn't get reconnect
delay: ERROR: prepared statement "dbdpg_17" does not exist
Jun 22 16:45:21 forty sqlgrey: warning: Use of uninitialized value in
concatenation (.) or string at /usr/sbin/sqlgrey line 1985.
Jun 22 16:45:21 forty sqlgrey: dbaccess: warning: couldn't do query:
INSERT INTO from_awl (sender_name, sender_domain, src, first_seen,
last_seen) VALUES('tcg
rant','alumni.princeton.edu','64.97.160.33','sql error',NOW()): ERROR:
invalid input syntax for type timestamp: "sql error" , reconnecting
to DB
Jun 22 16:45:21 forty sqlgrey: warning: Use of uninitialized value in
concatenation (.) or string at /usr/sbin/sqlgrey line 2008.
Then it appears to have tried to restart itself later and I found
these messages:
Jun 23 02:58:53 forty sqlgrey: mail: failed to send: No child processes
Jun 23 02:58:53 forty sqlgrey: dbaccess: error: couldn't get reconnect
delay: ERROR: prepared statement "dbdpg_6" does not exist
Jun 23 02:58:53 forty sqlgrey: warning: Use of uninitialized value in
concatenation (.) or string at /usr/sbin/sqlgrey line 1985.
I tried killing the sqlgrey process, but would only respond to a kill -9.
Any ideas? I'm running on Fedora Core 4.
-Dave
|
|
From: Lionel B. <lio...@bo...> - 2005-06-24 11:43:23
|
David Rees wrote the following on 23.06.2005 23:52 :
>I just upgraded to 1.6.0 from 1.4.8 yesterday and have had a couple
>problems with it.
>
>I noticed this morning that sqlgrey appeared to have stopped
>responding, then found these errors in /var/log/maillog:
>
>Jun 22 16:45:21 forty sqlgrey: dbaccess: error: couldn't get reconnect
>delay: ERROR: prepared statement "dbdpg_17" does not exist
>
>
That is really puzzling. Which version of:
- perl,
- DBI,
- postgresql DBD driver,
are you using, are they all coming with your distribution ?
This is the first time ever this error gets reported. This kind of error
happens either if the prepare_cached() fails or the execute() fails. In
both cases this means that either:
- there was an error in the SQL syntax,
- the connection with the database is borked.
Either way, the code is designed to get around this (let the message
pass if we can't get a straight answer from the database) and reconnect
to the database hoping it is a temporary DB problem.
>Jun 22 16:45:21 forty sqlgrey: warning: Use of uninitialized value in
>concatenation (.) or string at /usr/sbin/sqlgrey line 1985.
>Jun 22 16:45:21 forty sqlgrey: dbaccess: warning: couldn't do query:
>INSERT INTO from_awl (sender_name, sender_domain, src, first_seen,
>last_seen) VALUES('tcg
>rant','alumni.princeton.edu','64.97.160.33','sql error',NOW()): ERROR:
> invalid input syntax for type timestamp: "sql error" , reconnecting
>to DB
>Jun 22 16:45:21 forty sqlgrey: warning: Use of uninitialized value in
>concatenation (.) or string at /usr/sbin/sqlgrey line 2008.
>
>
>
Given the first error, this is expected. I'll have to clean up a little
to avoid some of these errors though.
>Then it appears to have tried to restart itself later and I found
>these messages:
>
>Jun 23 02:58:53 forty sqlgrey: mail: failed to send: No child processes
>Jun 23 02:58:53 forty sqlgrey: dbaccess: error: couldn't get reconnect
>delay: ERROR: prepared statement "dbdpg_6" does not exist
>
>
Surprising again. AFAIK SQLgrey doesn't play with prepared statements in
a way that could explain this. If a prepared statement doesn't exist,
the call to prepare_cached() should automatically create it. So this
error is really unexpected.
>Jun 23 02:58:53 forty sqlgrey: warning: Use of uninitialized value in
>concatenation (.) or string at /usr/sbin/sqlgrey line 1985.
>
>I tried killing the sqlgrey process, but would only respond to a kill -9.
>
>
Even more surprising! I'm betting on a distro specific problem. A wild
guess: is SELinux active? What happens if you deactivate it?
>Any ideas? I'm running on Fedora Core 4.
>
>
Hum, are there any other FC4 users around?
Lionel.
|
|
From: David R. <dr...@gm...> - 2005-06-24 19:28:10
|
On 6/24/05, Lionel Bouton <lio...@bo...> wrote: >=20 > That is really puzzling. Which version of: > - perl, > - DBI, > - postgresql DBD driver, > are you using, are they all coming with your distribution ? perl-5.8.6-15 perl-DBI-1.48-4 perl-DBD-Pg-1.41-2 perl-Net-Server-0.87-4 All packages are shipped with FC4 except for perl-Net-Server which is shipped by Fedora Extras. Running postgresql 8 as well. > - the connection with the database is borked. > Either way, the code is designed to get around this (let the message > pass if we can't get a straight answer from the database) and reconnect > to the database hoping it is a temporary DB problem. I'm guessing that the connection got borked, but for some reason it didn't recover. While sqlgrey has been running pretty smoothly since yesterday, I still see a couple messages which it appears to recover from: Jun 24 00:03:54 forty sqlgrey: grey: domain awl match: updating 216.155.201(216.155.201.61), returns.groups.yahoo.com Jun 24 00:05:04 forty sqlgrey: mail: failed to send: No child processes Jun 24 00:05:04 forty sqlgrey: dbaccess: error: couldn't get reconnect delay: ERROR: prepared statement "dbdpg_73" does not exist Jun 24 00:05:04 forty sqlgrey: warning: Use of uninitialized value in concatenation (.) or string at /usr/sbin/sqlgrey line 1985. Jun 24 00:05:04 forty sqlgrey: grey: reconnect ok: 68.142.198(68.142.198.207), ted...@sb... -> dr...@gr... () Jun 24 00:05:04 forty sqlgrey: mail: failed to send: No child processes Jun 24 00:05:04 forty sqlgrey: grey: from awl: 68.142.198, ted...@sb... added Jun 24 00:05:04 forty sqlgrey: mail: failed to send: No child processes Jun 24 00:05:04 forty sqlgrey: dbaccess: warning: couldn't do query: INSERT INTO from_awl (sender_name, sender_domain, src, first_seen, last_seen) VALUES('ted.rees','sbcglobal.net','68.142.198','sql error',NOW()): ERROR: invalid input syntax for type timestamp: "sql error" , reconnecting to DB Jun 24 00:05:05 forty sqlgrey: mail: failed to send: No child processes Jun 24 00:05:05 forty sqlgrey: warning: Use of uninitialized value in concatenation (.) or string at /usr/sbin/sqlgrey line 2008. > Even more surprising! I'm betting on a distro specific problem. A wild > guess: is SELinux active? What happens if you deactivate it? Nope, SELinux is disabled... -Dave |
|
From: David R. <dr...@gm...> - 2005-06-26 16:53:30
|
Hmm, it's still getting stuck running 1.6.1 (didn't expect this to change based on the changelogs). When it gets stuck, I notice that the sqlgrey process has a zombie process attached to it, perhaps it's getting stuck reaping a child or something? The last thing in the log this time is: Jun 26 03:30:30 forty sqlgrey: perf: spent 0s cleaning: from_awl (0) domain_awl (0) connect (3) I've now turned up the log level to debug to try and trace this further. Any other things I can try to trace this problem down? -Dave |
|
From: Edward S. (s. list) <sq...@cr...> - 2005-06-26 19:55:53
|
David Rees wrote: >Hmm, it's still getting stuck running 1.6.1 (didn't expect this to >change based on the changelogs). When it gets stuck, I notice that >the sqlgrey process has a zombie process attached to it, perhaps it's >getting stuck reaping a child or something? The last thing in the log >this time is: > >Jun 26 03:30:30 forty sqlgrey: perf: spent 0s cleaning: from_awl (0) >domain_awl (0) connect (3) > >I've now turned up the log level to debug to try and trace this >further. Any other things I can try to trace this problem down? > > I'm experiencing the the problem now as well, and this just started Friday. I thought it might have had something to do with the upgrade to 1.7, but after reverting back to 1.6 (which I had no problems with), the problem persists. I've also turned up the logging level trying to track this down. I'm using Debian Sid/unstable and PostgreSQL (7.4.8) as the DB system. Hopefully we can track this problem down fairly easily... |
|
From: Edward S. (s. list) <sq...@cr...> - 2005-06-27 03:35:51
|
Edward Shornock (sqlgrey list) wrote: >I'm experiencing the the problem now as well, and this just started >Friday. I thought it might have had something to do with the upgrade to >1.7, but after reverting back to 1.6 (which I had no problems with), the >problem persists. I've also turned up the logging level trying to track >this down. I'm using Debian Sid/unstable and PostgreSQL (7.4.8) as the >DB system. > >Hopefully we can track this problem down fairly easily... > > More information (I hope it helps), just before the process hangs: Jun 26 20:01:47 darkside postfix/smtpd[18921]: connect from russian-caravan.cloud9.net[168.100.1.4] Jun 26 20:01:48 darkside sqlgrey: 2005/06/26-20:01:48 CONNECT TCP Peer: "127.0.0.1:56048" Local: "127.0.0.1:2501" Jun 26 20:01:48 darkside sqlgrey: optin: greylisting active for pos...@cr... Jun 26 20:01:48 darkside sqlgrey: grey: unknown pattern: russian-caravan.cloud9.net, 168.100.1.4: using C-class (168.100.1). Jun 26 20:01:48 darkside sqlgrey: system: forked cleanup child (18925) Jun 26 20:01:48 darkside sqlgrey: warning: ERROR: prepared statement "dbdpg_1" does not exist Jun 26 20:01:48 darkside sqlgrey: warning: message type 0x43 arrived from server while idle Jun 26 20:01:48 darkside sqlgrey: warning: message type 0x5a arrived from server while idle Jun 26 20:01:48 darkside sqlgrey: grey: domain awl match: updating 168.100.1(168.100.1.4), postfix.org ...and at this point, sqlgrey has given up. Jun 26 20:03:20 darkside sqlgrey: optin: greylisting active for pos...@cr... Jun 26 20:03:20 darkside sqlgrey: grey: unknown pattern: camomile.cloud9.net, 168.100.1.3: using C-class (168.100.1). Jun 26 20:03:20 darkside sqlgrey: perf: spent 0s cleaning: from_awl (0) domain_awl (0) connect (0) Jun 26 20:03:20 darkside sqlgrey: system: cleanup child exit (18925) Jun 26 20:04:46 darkside postfix/anvil[18364]: statistics: max connection rate 1/60s for (smtp:146.82.138.6) at Jun 26 19:54:46 Jun 26 20:04:46 darkside postfix/anvil[18364]: statistics: max connection count 1 for (smtp:146.82.138.6) at Jun 26 19:54:46 Jun 26 20:04:46 darkside postfix/anvil[18364]: statistics: max cache size 1 at Jun 26 19:54:46 Jun 26 20:05:00 darkside postfix/smtpd[18921]: warning: timeout on 127.0.0.1:2501 while reading input attribute name Jun 26 20:05:00 darkside postfix/smtpd[18921]: warning: problem talking to server 127.0.0.1:2501: Connection timed out Jun 26 20:06:41 darkside postfix/smtpd[18921]: warning: timeout on 127.0.0.1:2501 while reading input attribute name Jun 26 20:06:41 darkside postfix/smtpd[18921]: warning: problem talking to server 127.0.0.1:2501: Connection timed out Jun 26 20:06:41 darkside postfix/smtpd[18921]: NOQUEUE: reject: RCPT from camomile.cloud9.net[168.100.1.3]: 450 Server configuration problem; from=<own...@po...> to=<pos...@cr...> proto=ESMTP helo=<camomile.cloud9.net> Jun 26 20:06:41 darkside postfix/cleanup[19161]: 8BD20D6B082: message-id=<200...@da...> Jun 26 20:06:41 darkside postfix/qmgr[3099]: 8BD20D6B082: from=<dou...@da...>, size=1086, nrcpt=1 (queue active) Jun 26 20:06:41 darkside postfix/smtpd[18921]: disconnect from camomile.cloud9.net[168.100.1.3] # ps aux |grep sqlgrey sqlgrey 3754 0.0 0.9 13660 8684 ? Ss 15:43 0:00 /usr/bin/perl -w /usr/sbin/sqlgrey -d postgres 3755 0.0 1.0 18076 9096 ? S 15:43 0:00 postgres: sqlgrey sqlgrey 127.0.0.1 idle sqlgrey 18925 0.0 0.0 0 0 ? Z 20:01 0:00 [sqlgrey] <defunct> root 26635 0.0 0.0 1624 452 pts/2 S+ 22:42 0:00 grep sqlgrey strace on PID 3754 just sat there. I restarted PostgreSQL, and then the strace finally showed activity. It seems that even though sqlgrey wasn't accepting connections The logs at this time show: Jun 26 22:56:51 darkside postfix/pickup[27380]: 08C7AD6B082: uid=132 from=<sqlgrey> Jun 26 22:56:51 darkside postfix/qmgr[26777]: 08C7AD6B082: from=<sq...@da...>, size=485, nrcpt=1 (queue active) Jun 26 22:56:51 darkside amavis[15461]: (15461-06-3) LMTP::10024 /var/lib/amavis/tmp/amavis-20050626T190955-15461: <sq...@da...> -> <pos...@cr...> Received: SIZE=485 from darkside.crazeecanuck.homelinux.net ([127.0.0.1]) by localhost (darkside [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 15461-06-3 for <pos...@cr...>; Sun, 26 Jun 2005 22:56:51 -0400 (EDT) Jun 26 22:56:51 darkside sqlgrey: dbaccess: error: couldn't get now() from DB: FATAL: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. Jun 26 22:56:51 darkside sqlgrey: system: forked cleanup child (27706) Jun 26 22:56:51 darkside amavis[15461]: (15461-06-3) Checking: <sq...@da...> -> <pos...@cr...> Jun 26 22:56:51 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: the database system is shutting down Jun 26 22:56:51 darkside sqlgrey: dbaccess: error: couldn't access domain_awl table: no connection to the server Jun 26 22:56:51 darkside sqlgrey: grey: domain awl match: updating 168.100.1(168.100.1.3), postfix.org Jun 26 22:56:51 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: the database system is shutting down Jun 26 22:56:51 darkside sqlgrey: dbaccess: warning: couldn't do query: UPDATE domain_awl SET last_seen = NOW(), first_seen = first_seen WHERE sender_domain = NULL AND src = NULL: FATAL: the database system is shutting down , reconnecting to DB JJun 26 22:56:53 darkside sqlgrey: 2005/06/26-22:56:53 CONNECT TCP Peer: "127.0.0.1:52685" Local: "127.0.0.1:2501" Jun 26 22:56:53 darkside sqlgrey: optin: greylisting active for deb...@cr... Jun 26 22:56:53 darkside sqlgrey: grey: unknown pattern: murphy.debian.org, 146.82.138.6: using C-class (146.82.138). Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: the database system is shutting down Jun 26 22:56:53 darkside sqlgrey: dbaccess: error: couldn't get now() from DB: FATAL: the database system is shutting down Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: the database system is shutting down Jun 26 22:56:53 darkside sqlgrey: dbaccess: error: couldn't access domain_awl table: FATAL: the database system is shutting down Jun 26 22:56:53 darkside sqlgrey: grey: domain awl match: updating 146.82.138(146.82.138.6), lists.debian.org Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: the database system is shutting down Jun 26 22:56:53 darkside sqlgrey: dbaccess: warning: couldn't do query: UPDATE domain_awl SET last_seen = NOW(), first_seen = first_seen WHERE sender_domain = NULL AND src = NULL: FATAL: the database system is shutting down , reconnecting to DB Jun 26 22:56:53 darkside sqlgrey: 2005/06/26-22:56:53 CONNECT TCP Peer: "127.0.0.1:52693" Local: "127.0.0.1:2501" Jun 26 22:56:53 darkside sqlgrey: optin: greylisting active for sp...@cr... Jun 26 22:56:53 darkside sqlgrey: grey: unknown pattern: cherry.ease.lsoft.com, 209.119.0.109: using C-class (209.119.0). Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: the database system is shutting down Jun 26 22:56:53 darkside sqlgrey: dbaccess: error: couldn't get now() from DB: FATAL: the database system is shutting down Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: could not connect to server: Connection refused Is the server running on host "localhost" and accepting TCP/IP connections on port 5432? Jun 26 22:56:53 darkside sqlgrey: dbaccess: error: couldn't access domain_awl table: could not connect to server: Connection refused Is the server running on host "localhost" and accepting TCP/IP connections on port 5432? Jun 26 22:56:53 darkside sqlgrey: grey: domain awl match: updating 209.119.0(209.119.0.109), peach.ease.lsoft.com the PostgreSQL logs showed: 2005-06-26 20:01:48 [18926] LOG: connection authorized: user=sqlgrey database=sqlgrey 2005-06-26 20:01:48 [3755] ERROR: prepared statement "dbdpg_1" does not exist After restarting PGSQL, sqlgrey is working again. Note that I could connect to the sqlgrey database just fine. I'm now strace'ing the sqlgrey process, and dumping the output to a log file, hopefully that log will give a glue as to what's happening. at the time of hanging--I have no idea at this point. If there's anything else that I can do to help track down this bug, please let me know... |
|
From: Lionel B. <lio...@bo...> - 2005-06-28 11:08:00
|
One possible culprit could be the way SQLgrey handles the database cleanups since 1.6.0 (the zombie process points to a forked child problem). SQLgrey fork another processus which makes its own database connection in order to asynchronously delete old entries (expired AWL and connect entries). If you want to make sure this isn't the problem, you can simply replace the "$self->fork_cleanup()" calls by "$self->cleanup()". This will revert to 1.4.x behaviour. If it solves the problem, then I'll have to look more closely on how DBI reacts to forks... Lionel. |
|
From: David R. <dr...@gm...> - 2005-06-28 15:13:43
|
On 6/28/05, Lionel Bouton <lio...@bo...> wrote: > One possible culprit could be the way SQLgrey handles the database > cleanups since 1.6.0 (the zombie process points to a forked child problem= ). FWIW, Since FC4 released perl-Net-Server 0.88 I haven't had any problems with sqlgrey locking up... -Dave |
|
From: Edward J S. <ed...@cr...> - 2005-06-28 15:28:24
|
David Rees wrote: > On 6/28/05, Lionel Bouton <lio...@bo...> wrote: > >>One possible culprit could be the way SQLgrey handles the database >>cleanups since 1.6.0 (the zombie process points to a forked child problem). > > > FWIW, Since FC4 released perl-Net-Server 0.88 I haven't had any > problems with sqlgrey locking up... > Hmm...I'm still on 0.87 yet. I've tried the "workaround" that Mr. Bouton suggested, I'll post a follow-up if changing the "$self->fork_cleanup()" calls to "$self->cleanup()" resolves it. If it does, I'll try upgrading to Net::Server 0.88 as well. Thanks guys, --Ed |
|
From: Edward J S. <ed...@cr...> - 2005-06-29 12:38:48
|
Edward J Shornock wrote: > David Rees wrote: > >>On 6/28/05, Lionel Bouton <lio...@bo...> wrote: >> >> >>>One possible culprit could be the way SQLgrey handles the database >>>cleanups since 1.6.0 (the zombie process points to a forked child problem). >> >> >>FWIW, Since FC4 released perl-Net-Server 0.88 I haven't had any >>problems with sqlgrey locking up... >> > > > Hmm...I'm still on 0.87 yet. I've tried the "workaround" that Mr. Bouton > suggested, I'll post a follow-up if changing the "$self->fork_cleanup()" > calls to "$self->cleanup()" resolves it. If it does, I'll try upgrading > to Net::Server 0.88 as well. > Update: Reverting back to $self->cleanup() seemed to work, there weren't any lockups after 15 hours or so. I upgraded Net::Server to 0.88 and changed it back to $self->fork_cleanup(). I restarted sqlgrey in within a few hours the lockup happened. I've since changed back to $self->cleanup() in the code, maybe I got lucky before... I'll post another update if this *really* stops the lockups, after giving it a few days of operation. |
|
From: Lionel B. <lio...@bo...> - 2005-06-29 15:51:50
|
Edward Shornock (sqlgrey list) wrote the following on 27.06.2005 05:35 : >Edward Shornock (sqlgrey list) wrote: > > > >>I'm experiencing the the problem now as well, and this just started >>Friday. I thought it might have had something to do with the upgrade to >>1.7, but after reverting back to 1.6 (which I had no problems with), the >>problem persists. I've also turned up the logging level trying to track >>this down. I'm using Debian Sid/unstable and PostgreSQL (7.4.8) as the >>DB system. >> >>Hopefully we can track this problem down fairly easily... >> >> >> >> >More information (I hope it helps), just before the process hangs: > >Jun 26 20:01:47 darkside postfix/smtpd[18921]: connect from >russian-caravan.cloud9.net[168.100.1.4] >Jun 26 20:01:48 darkside sqlgrey: 2005/06/26-20:01:48 CONNECT TCP Peer: >"127.0.0.1:56048" Local: "127.0.0.1:2501" >Jun 26 20:01:48 darkside sqlgrey: optin: greylisting active for >pos...@cr... >Jun 26 20:01:48 darkside sqlgrey: grey: unknown pattern: >russian-caravan.cloud9.net, 168.100.1.4: using C-class (168.100.1). >Jun 26 20:01:48 darkside sqlgrey: system: forked cleanup child (18925) >Jun 26 20:01:48 darkside sqlgrey: warning: ERROR: prepared statement >"dbdpg_1" does not exist >Jun 26 20:01:48 darkside sqlgrey: warning: message type 0x43 arrived >from server while idle >Jun 26 20:01:48 darkside sqlgrey: warning: message type 0x5a arrived >from server while idle >Jun 26 20:01:48 darkside sqlgrey: grey: domain awl match: updating >168.100.1(168.100.1.4), postfix.org > >...and at this point, sqlgrey has given up. >Jun 26 20:03:20 darkside sqlgrey: optin: greylisting active for >pos...@cr... >Jun 26 20:03:20 darkside sqlgrey: grey: unknown pattern: >camomile.cloud9.net, 168.100.1.3: using C-class (168.100.1). >Jun 26 20:03:20 darkside sqlgrey: perf: spent 0s cleaning: from_awl (0) >domain_awl (0) connect (0) >Jun 26 20:03:20 darkside sqlgrey: system: cleanup child exit (18925) >Jun 26 20:04:46 darkside postfix/anvil[18364]: statistics: max >connection rate 1/60s for (smtp:146.82.138.6) at Jun 26 19:54:46 >Jun 26 20:04:46 darkside postfix/anvil[18364]: statistics: max >connection count 1 for (smtp:146.82.138.6) at Jun 26 19:54:46 >Jun 26 20:04:46 darkside postfix/anvil[18364]: statistics: max cache >size 1 at Jun 26 19:54:46 >Jun 26 20:05:00 darkside postfix/smtpd[18921]: warning: timeout on >127.0.0.1:2501 while reading input attribute name >Jun 26 20:05:00 darkside postfix/smtpd[18921]: warning: problem talking >to server 127.0.0.1:2501: Connection timed out >Jun 26 20:06:41 darkside postfix/smtpd[18921]: warning: timeout on >127.0.0.1:2501 while reading input attribute name >Jun 26 20:06:41 darkside postfix/smtpd[18921]: warning: problem talking >to server 127.0.0.1:2501: Connection timed out >Jun 26 20:06:41 darkside postfix/smtpd[18921]: NOQUEUE: reject: RCPT >from camomile.cloud9.net[168.100.1.3]: 450 Server configuration problem; >from=<own...@po...> >to=<pos...@cr...> proto=ESMTP >helo=<camomile.cloud9.net> >Jun 26 20:06:41 darkside postfix/cleanup[19161]: 8BD20D6B082: >message-id=<200...@da...> >Jun 26 20:06:41 darkside postfix/qmgr[3099]: 8BD20D6B082: >from=<dou...@da...>, size=1086, >nrcpt=1 (queue active) >Jun 26 20:06:41 darkside postfix/smtpd[18921]: disconnect from >camomile.cloud9.net[168.100.1.3] > > > ># ps aux |grep sqlgrey >sqlgrey 3754 0.0 0.9 13660 8684 ? Ss 15:43 0:00 >/usr/bin/perl -w /usr/sbin/sqlgrey -d >postgres 3755 0.0 1.0 18076 9096 ? S 15:43 0:00 >postgres: sqlgrey sqlgrey 127.0.0.1 >idle >sqlgrey 18925 0.0 0.0 0 0 ? Z 20:01 0:00 >[sqlgrey] <defunct> >root 26635 0.0 0.0 1624 452 pts/2 S+ 22:42 0:00 grep >sqlgrey > > >strace on PID 3754 just sat there. I restarted PostgreSQL, and then the >strace finally showed activity. > >It seems that even though sqlgrey wasn't accepting connections The logs >at this time show: >Jun 26 22:56:51 darkside postfix/pickup[27380]: 08C7AD6B082: uid=132 >from=<sqlgrey> >Jun 26 22:56:51 darkside postfix/qmgr[26777]: 08C7AD6B082: >from=<sq...@da...>, size=485, nrcpt=1 >(queue active) >Jun 26 22:56:51 darkside amavis[15461]: (15461-06-3) LMTP::10024 >/var/lib/amavis/tmp/amavis-20050626T190955-15461: ><sq...@da...> -> ><pos...@cr...> Received: SIZE=485 from >darkside.crazeecanuck.homelinux.net ([127.0.0.1]) by localhost (darkside >[127.0.0.1]) (amavisd-new, port 10024) with LMTP id 15461-06-3 for ><pos...@cr...>; Sun, 26 Jun 2005 22:56:51 -0400 >(EDT) >Jun 26 22:56:51 darkside sqlgrey: dbaccess: error: couldn't get now() >from DB: FATAL: terminating connection due to administrator command >server closed the connection unexpectedly This probably means the >server terminated abnormally before or while processing the request. >Jun 26 22:56:51 darkside sqlgrey: system: forked cleanup child (27706) >Jun 26 22:56:51 darkside amavis[15461]: (15461-06-3) Checking: ><sq...@da...> -> ><pos...@cr...> >Jun 26 22:56:51 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: >the database system is shutting down >Jun 26 22:56:51 darkside sqlgrey: dbaccess: error: couldn't access >domain_awl table: no connection to the server >Jun 26 22:56:51 darkside sqlgrey: grey: domain awl match: updating >168.100.1(168.100.1.3), postfix.org >Jun 26 22:56:51 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: >the database system is shutting down >Jun 26 22:56:51 darkside sqlgrey: dbaccess: warning: couldn't do query: >UPDATE domain_awl SET last_seen = NOW(), first_seen = first_seen WHERE >sender_domain = NULL AND src = NULL: FATAL: the database system is >shutting down , reconnecting to DB >JJun 26 22:56:53 darkside sqlgrey: 2005/06/26-22:56:53 CONNECT TCP Peer: >"127.0.0.1:52685" Local: "127.0.0.1:2501" >Jun 26 22:56:53 darkside sqlgrey: optin: greylisting active for >deb...@cr... >Jun 26 22:56:53 darkside sqlgrey: grey: unknown pattern: >murphy.debian.org, 146.82.138.6: using C-class (146.82.138). >Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: >the database system is shutting down >Jun 26 22:56:53 darkside sqlgrey: dbaccess: error: couldn't get now() >from DB: FATAL: the database system is shutting down >Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: >the database system is shutting down >Jun 26 22:56:53 darkside sqlgrey: dbaccess: error: couldn't access >domain_awl table: FATAL: the database system is shutting down >Jun 26 22:56:53 darkside sqlgrey: grey: domain awl match: updating >146.82.138(146.82.138.6), lists.debian.org >Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: >the database system is shutting down >Jun 26 22:56:53 darkside sqlgrey: dbaccess: warning: couldn't do query: >UPDATE domain_awl SET last_seen = NOW(), first_seen = first_seen WHERE >sender_domain = NULL AND src = NULL: FATAL: the database system is >shutting down , reconnecting to DB >Jun 26 22:56:53 darkside sqlgrey: 2005/06/26-22:56:53 CONNECT TCP Peer: >"127.0.0.1:52693" Local: "127.0.0.1:2501" >Jun 26 22:56:53 darkside sqlgrey: optin: greylisting active for >sp...@cr... >Jun 26 22:56:53 darkside sqlgrey: grey: unknown pattern: >cherry.ease.lsoft.com, 209.119.0.109: using C-class (209.119.0). >Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: FATAL: >the database system is shutting down >Jun 26 22:56:53 darkside sqlgrey: dbaccess: error: couldn't get now() >from DB: FATAL: the database system is shutting down >Jun 26 22:56:53 darkside sqlgrey: dbaccess: can't connect to DB: could >not connect to server: Connection refused Is the server running on >host "localhost" and accepting TCP/IP connections on port 5432? >Jun 26 22:56:53 darkside sqlgrey: dbaccess: error: couldn't access >domain_awl table: could not connect to server: Connection refused Is >the server running on host "localhost" and accepting TCP/IP >connections on port 5432? >Jun 26 22:56:53 darkside sqlgrey: grey: domain awl match: updating >209.119.0(209.119.0.109), peach.ease.lsoft.com > >the PostgreSQL logs showed: >2005-06-26 20:01:48 [18926] LOG: connection authorized: user=sqlgrey >database=sqlgrey >2005-06-26 20:01:48 [3755] ERROR: prepared statement "dbdpg_1" does not >exist > >After restarting PGSQL, sqlgrey is working again. Note that I could >connect to the sqlgrey database just fine. I'm now strace'ing the >sqlgrey process, and dumping the output to a log file, hopefully that >log will give a glue as to what's happening. at the time of hanging--I >have no idea at this point. > >If there's anything else that I can do to help track down this bug, >please let me know... > > You may want to check the number of connections to the database. Do something like "netstat -anp | grep <sqlgrey-pid>" to check open TCP connections. You should only have one conection to the PostgreSQL server. I'm currently testing a 1.6.2 release (tar.bz2 at http://sqlgrey.bouton.name/sqlgrey-1.6.2.tar.bz2 until I consider stable enough to put it on sourceforge) with a safenet for the database connection handling: SQLgrey will now explicitly call disconnect on a buggy connection before reconnecting. All database connections are set up to be explicitly disconnected in this version too: SQLgrey doesn't rely on the DBD driver for reaping inactive connections anymore. Hope it can make a difference. If all fails, I'll make the fork configurable and inactive by default. Lionel. |
|
From: Ed S. (s. list) <sq...@cr...> - 2005-07-05 01:29:53
|
Lionel Bouton wrote: >You may want to check the number of connections to the database. Do >something like "netstat -anp | grep <sqlgrey-pid>" to check open TCP >connections. You should only have one conection to the PostgreSQL server. > > darkside:~# ps aux |grep sqlgrey sqlgrey 9082 0.0 0.9 13872 8792 ? Ss 11:50 0:01 /usr/bin/perl -w /usr/sbin/sqlgrey -d postgres 9085 0.0 0.9 18064 8948 ? S 11:50 0:00 postgres: sqlgrey sqlgrey 127.0.0.1 idle sqlgrey 31292 0.0 0.0 0 0 ? Z 18:49 0:00 [sqlgrey] <defunct> root 6135 0.0 0.0 1620 516 pts/1 S+ 21:16 0:00 grep sqlgrey darkside:~# netstat -anp |grep 9082 tcp 0 0 127.0.0.1:2501 0.0.0.0:* LISTEN 9082/perl tcp 0 0 127.0.0.1:36435 127.0.0.1:5432 ESTABLISHED9082/perl tcp 1 0 127.0.0.1:2501 127.0.0.1:34774 CLOSE_WAIT 9082/perl unix 3 [ ] STREAM CONNECTED 29082 9199/master unix 3 [ ] STREAM CONNECTED 28680 9082/perl >Hope it can make a difference. If all fails, I'll make the fork >configurable and inactive by default. > > While reverting back to self->cleanup() fixed the lockup problem with 1.6.0 and 1.7.0, I have the same lockup problem with 1.6.2. Cheers, -- Ed |