From: Alex <mys...@gm...> - 2014-06-30 14:09:04
|
Hi, I'm using sqlgrey with postfix on three servers, configured using the DBCLUSTER layout as defined in the README. However, when one machine goes down, all three fail, with the following postfix message: Jun 30 06:14:00 mail03 postfix/smtpd[32601]: NOQUEUE: reject: RCPT from bmail.bridgemailsystem.com[66.206.172.149]: 451 4.3.5 Server configuration problem; from=<mar...@lo...> to=<Rya...@ex...> proto=ESMTP helo=<bmail.bridgemailsystem.com> Isn't it supposed to continue running on the remaining systems when one of them becomes disconnected? This is my sqlgrey.conf config on one of the slave machines: loglevel = 3 log_override = whitelist:1,grey:3,spam:2 reconnect_delay = 5 db_type = mysql db_name = sqlgrey db_host = ns1.example.com db_port = default db_user = sqlgrey db_pass = mypass db_cleanup_hostname=ns1.example.com db_cleandelay = 1800 clean_method = sync db_cluster = on read_hosts=localhost,mail02.example.com,mail03.example.com, mail01.example.com prepend = 1 admin_mail = my...@me... Any ideas greatly appreciated. Thanks, Alex |
From: Jernej P. <jer...@ar...> - 2014-06-30 14:32:19
|
Dear Alex, You could use hapolicy instead (http://postfwd.org/hapolicy/index.html) and run multiple instances of sqlgrey on multiple machines. I am not sure, whether I completely understand your setup: you have three node cluster with MySQL master-master replication? We have successfully deployed sqlgrey with mysql master-slave configuration, where reads were performed into slave nodes, while SQL writes were done on the master node. After a while, we ditched sqlgrey in favour of posftwd2 and hapolicy... cheers, Jernej On 30/06/14 16:08, Alex wrote: > Hi, > > I'm using sqlgrey with postfix on three servers, configured using the > DBCLUSTER layout as defined in the README. However, when one machine > goes down, all three fail, with the following postfix message: > > Jun 30 06:14:00 mail03 postfix/smtpd[32601]: NOQUEUE: reject: RCPT from > bmail.bridgemailsystem.com > <http://bmail.bridgemailsystem.com>[66.206.172.149]: 451 4.3.5 Server > configuration problem; from=<mar...@lo... > <mailto:mar...@lo...>> to=<Rya...@ex... > <mailto:Rya...@ex...>> proto=ESMTP > helo=<bmail.bridgemailsystem.com <http://bmail.bridgemailsystem.com>> > > Isn't it supposed to continue running on the remaining systems when one > of them becomes disconnected? > > This is my sqlgrey.conf config on one of the slave machines: > > loglevel = 3 > log_override = whitelist:1,grey:3,spam:2 > reconnect_delay = 5 > db_type = mysql > db_name = sqlgrey > db_host = ns1.example.com <http://ns1.example.com> > db_port = default > db_user = sqlgrey > db_pass = mypass > db_cleanup_hostname=ns1.example.com <http://ns1.example.com> > db_cleandelay = 1800 > clean_method = sync > db_cluster = on > read_hosts=localhost,mail02.example.com > <http://mail02.example.com>,mail03.example.com > <http://mail03.example.com>,mail01.example.com <http://mail01.example.com> > prepend = 1 > admin_mail = my...@me... > <mailto:my...@me...> > > Any ideas greatly appreciated. > Thanks, > Alex > > > > ------------------------------------------------------------------------------ > Open source business process management suite built on Java and Eclipse > Turn processes into business applications with Bonita BPM Community Edition > Quickly connect people, data, and systems into organized workflows > Winner of BOSSIE, CODIE, OW2 and Gartner awards > http://p.sf.net/sfu/Bonitasoft > > > > _______________________________________________ > Sqlgrey-users mailing list > Sql...@li... > https://lists.sourceforge.net/lists/listinfo/sqlgrey-users > |
From: Alex <mys...@gm...> - 2014-06-30 19:20:02
|
Hi, > You could use hapolicy instead (http://postfwd.org/hapolicy/index.html) > and run multiple instances of sqlgrey on multiple machines. If it wasn't already clear, I am running an instance of sqlgrey on each machine, which all talk to one master, the one that happened to go down this morning. This resulted in none of them apparently being able to talk to their own sqlgrey service and just started rejecting mail. > I am not sure, whether I completely understand your setup: you have > three node cluster with MySQL master-master replication? I'm a mysql novice, but I think it's just a slave-master situation. They all should have their own copies of the complete greylist. > We have successfully deployed sqlgrey with mysql master-slave > configuration, where reads were performed into slave nodes, while SQL > writes were done on the master node. After a while, we ditched sqlgrey > in favour of posftwd2 and hapolicy... So did you ditch it for this reason? That sounds like how I have it set up here. Is it not possible to create a fault-tolerant sqlgrey system on its own? Would you be able to send your postfwd2 and hapolicy configs for a reference to get started? I also realized I made a typo in the configuration file I posted here, which doesn't exist on my production system. Here are the relevant bits. This one has the db_host properly, in case that matters for reference here: loglevel = 3 log_override = whitelist:1,grey:3,spam:2 reconnect_delay = 5 db_type = mysql db_name = sqlgrey db_host = mail01.example.com db_port = default db_user = sqlgrey db_pass = mypass db_cleanup_hostname=mail01.example.com db_cleandelay = 1800 clean_method = sync db_cluster = on read_hosts=localhost,mail01.example.com,mail02.example.com, mail03.example.com prepend = 1 admin_mail = my...@me... Thanks, Alex |
From: Jernej P. <jer...@ar...> - 2014-07-01 08:59:22
|
Dear Alex, On 30/06/14 21:19, Alex wrote: > > > We have successfully deployed sqlgrey with mysql master-slave > > configuration, where reads were performed into slave nodes, while SQL > > writes were done on the master node. After a while, we ditched sqlgrey > > in favour of posftwd2 and hapolicy... > > So did you ditch it for this reason? That sounds like how I have it set > up here. Is it not possible to create a fault-tolerant sqlgrey system on > its own? I think that you would need a multi-master SQL setup to be able to use sqlgrey in a way you are trying to set it up. The problem is that when mysql master goes down, sqlgrey is unable to update database and it fails. I dont know sqlgreys' DB_CLUSTER setup anymore, but as I recall it only stands for offloading the read queries to other slaves as well, while still relying on write master to be up all the time. If write node is down, sqlgrey is failing... IMHO, you have two options: - using hapolicy with sqlgrey with default settings to DUNNO, which will say DUNNO if sqlgrey fails after MySQL failure - set up HA mysql setup where write node will never fail (multi-master etc. - there are few options available) I would go with hapolicy: easier to maintain and no real hassle if sqlgrey is failing... > > Would you be able to send your postfwd2 and hapolicy configs for a > reference to get started? Sorry, my setup is site specific, however both tools have good documentation and live support mailing lists, so no worries. If you hit a barrier, just ask a question there... cheers, Jernej |
From: Lionel B. <lio...@bo...> - 2014-07-01 09:39:51
|
Le 01/07/2014 10:59, Jernej Porenta a écrit : > Dear Alex, > > On 30/06/14 21:19, Alex wrote: >> > We have successfully deployed sqlgrey with mysql master-slave >> > configuration, where reads were performed into slave nodes, while SQL >> > writes were done on the master node. After a while, we ditched sqlgrey >> > in favour of posftwd2 and hapolicy... >> >> So did you ditch it for this reason? That sounds like how I have it set >> up here. Is it not possible to create a fault-tolerant sqlgrey system on >> its own? > I think that you would need a multi-master SQL setup to be able to use > sqlgrey in a way you are trying to set it up. The problem is that when > mysql master goes down, sqlgrey is unable to update database and it fails. It shouldn't. I didn't write the cluster support but with a single server, I coded SQLgrey to handle database failures gracefully and stop greylisting until the database server restarts. There's one exception : SQLgrey doesn't start correctly if the database server is unavailable, once it runs it should not fail. You can consider this a bug in the cluster support (and might want to test SQLgrey without a cluster setup). Best regards, Lionel. |
From: Jernej P. <jer...@ar...> - 2014-07-01 09:50:33
|
On 01/07/14 11:21, Lionel Bouton wrote: > Le 01/07/2014 10:59, Jernej Porenta a écrit : >> Dear Alex, >> >> On 30/06/14 21:19, Alex wrote: >>> > We have successfully deployed sqlgrey with mysql master-slave >>> > configuration, where reads were performed into slave nodes, while SQL >>> > writes were done on the master node. After a while, we ditched sqlgrey >>> > in favour of posftwd2 and hapolicy... >>> >>> So did you ditch it for this reason? That sounds like how I have it set >>> up here. Is it not possible to create a fault-tolerant sqlgrey system on >>> its own? >> I think that you would need a multi-master SQL setup to be able to use >> sqlgrey in a way you are trying to set it up. The problem is that when >> mysql master goes down, sqlgrey is unable to update database and it fails. > > It shouldn't. I didn't write the cluster support but with a single > server, I coded SQLgrey to handle database failures gracefully and stop > greylisting until the database server restarts. > There's one exception : SQLgrey doesn't start correctly if the database > server is unavailable, once it runs it should not fail. Does "stop greylisting" means responding with DUNNO or does nor responds at all? If it responds with DUNNO, then postfix continues working normally otherwise postfix issues an Server configuration error and defers a mail. This happens with all non-responsive policy servers... I know that sqlgrey does great job at reconnecting to failing mysql servers, however I don't know the details behind it... cheers, J. |
From: Lionel B. <lio...@bo...> - 2014-07-01 10:29:01
|
Le 01/07/2014 11:50, Jernej Porenta a écrit : > On 01/07/14 11:21, Lionel Bouton wrote: >> It shouldn't. I didn't write the cluster support but with a single >> server, I coded SQLgrey to handle database failures gracefully and >> stop greylisting until the database server restarts. There's one >> exception : SQLgrey doesn't start correctly if the database server is >> unavailable, once it runs it should not fail. > Does "stop greylisting" means responding with DUNNO or does nor responds > at all? DUNNO. > > If it responds with DUNNO, then postfix continues working normally That's the expected behaviour. Lionel |
From: Dan F. <da...@ha...> - 2014-07-01 12:31:44
|
Hi Alex. I wrote the DBCluster code. I've just tested the scenario where the sql-master dies and in my test it continues as expected, simply allowing everything through. I've tested both running sql loosing master and restarting sqlgrey with master gone. If the master-db doesn't work, it simply seems to keep calm and carry on. In both cases. When sqlgrey tries to reconnect, it fails and will keep failing. When this happens i can clearly see it in my logfile: sqlgrey: dbaccess: Using DBIx:DBCluster sqlgrey: warning: Could not connect to any server in WRITE_HOSTS at ./sqlgrey line 833 sqlgrey: dbaccess: can't connect to DB: Can't connect to MySQL server on '127.0.0.2' (111) sqlgrey: dbaccess: error: couldn't access optout_domain table: Can't connect to MySQL server on '127.0.0.2' (111) Perhaps you can find any sqlgrey log-output on this issue, as the postfix error you quoted, isnt telling me much. In fact, the ONLY way i have been able to get a "Server configuration problem" in my tests, is if i point the db_host to a server that behind a firewall that DROPS packages. This makes "connect" hang for a very long time, which makes postfix drop the connection due to timeout and cry "Server configuration problem". - Dan Alex wrote: > Hi, > > >> You could use hapolicy instead (http://postfwd.org/hapolicy/index.html) >> and run multiple instances of sqlgrey on multiple machines. > > If it wasn't already clear, I am running an instance of sqlgrey on each > machine, which all talk to one master, the one that happened to go down > this morning. This resulted in none of them apparently being able to talk > to their own sqlgrey service and just started rejecting mail. > >> I am not sure, whether I completely understand your setup: you have >> three node cluster with MySQL master-master replication? > > I'm a mysql novice, but I think it's just a slave-master situation. They > all should have their own copies of the complete greylist. > >> We have successfully deployed sqlgrey with mysql master-slave >> configuration, where reads were performed into slave nodes, while SQL >> writes were done on the master node. After a while, we ditched sqlgrey >> in favour of posftwd2 and hapolicy... > > So did you ditch it for this reason? That sounds like how I have it set > up here. Is it not possible to create a fault-tolerant sqlgrey system on > its own? > > Would you be able to send your postfwd2 and hapolicy configs for a > reference to get started? > > I also realized I made a typo in the configuration file I posted here, > which doesn't exist on my production system. Here are the relevant bits. > This one has the db_host properly, in case that matters for reference > here: > > > loglevel = 3 log_override = whitelist:1,grey:3,spam:2 reconnect_delay = 5 > db_type = mysql db_name = sqlgrey db_host = mail01.example.com db_port = > default db_user = sqlgrey db_pass = mypass > db_cleanup_hostname=mail01.example.com db_cleandelay = 1800 clean_method = > sync db_cluster = on > read_hosts=localhost,mail01.example.com,mail02.example.com, > mail03.example.com prepend = 1 admin_mail = my...@me... > > Thanks, > Alex > -------------------------------------------------------------------------- > ---- > Open source business process management suite built on Java and Eclipse > Turn processes into business applications with Bonita BPM Community > Edition > Quickly connect people, data, and systems into organized workflows > Winner of BOSSIE, CODIE, OW2 and Gartner awards > http://p.sf.net/sfu/Bonitasoft____________________________________________ > ___ > Sqlgrey-users mailing list > Sql...@li... > https://lists.sourceforge.net/lists/listinfo/sqlgrey-users > > |
From: Alex <mys...@gm...> - 2014-07-01 20:24:33
|
Hi, > I wrote the DBCluster code. I've just tested the scenario where the > sql-master dies and in my test it continues as expected, simply allowing > everything through. That's good to know. I'd definitely like to see about getting sqlgrey working properly before trying alternatives, so I very much appreciate your help. > I've tested both running sql loosing master and restarting sqlgrey with > master gone. If the master-db doesn't work, it simply seems to keep calm > and carry on. In both cases. > > When sqlgrey tries to reconnect, it fails and will keep failing. When this > happens i can clearly see it in my logfile: > sqlgrey: dbaccess: Using DBIx:DBCluster > sqlgrey: warning: Could not connect to any server in WRITE_HOSTS at > ./sqlgrey line 833 > sqlgrey: dbaccess: can't connect to DB: Can't connect to MySQL server on > '127.0.0.2' (111) > sqlgrey: dbaccess: error: couldn't access optout_domain table: Can't > connect to MySQL server on '127.0.0.2' (111) Yes, that is the very same messages I receive: Jun 30 06:35:16 mail03 sqlgrey: warning: Could not connect to any server in WRITE_HOSTS at /usr/sbin/sqlgrey line 827. Jun 30 06:35:16 mail03 sqlgrey: dbaccess: can't connect to DB: Can't connect to MySQL server on 'mail02.example.com' (113) Jun 30 06:35:16 mail03 sqlgrey: dbaccess: error: couldn't access config table: Can't connect to MySQL server on 'mail02.example.com' (113) Jun 30 06:35:16 mail03 sqlgrey: mail: failed to send: Jun 30 06:35:16 mail03 sqlgrey: fatal: setconfig error at /usr/sbin/sqlgrey line 195. I have sqlgrey defined as such in master.cf: greylist unix - n n - 0 spawn user=nobody argv=/usr/bin/perl /usr/sbin/sqlgrey and "check_policy_service inet:127.0.0.1:2501" in main.cf. > In fact, the ONLY way i have been able to get a "Server configuration > problem" in my tests, is if i point the db_host to a server that behind a > firewall that DROPS packages. This makes "connect" hang for a very long > time, which makes postfix drop the connection due to timeout and cry > "Server configuration problem". Have I configured postfix incorrectly? I'll include my sqlgrey.conf again, in hopes it helps. loglevel = 3 log_override = whitelist:1,grey:3,spam:2 reconnect_delay = 5 db_type = mysql db_name = sqlgrey db_host = mail02.example.com db_port = default db_user = sqlgrey db_pass = mypass db_cleanup_hostname=mail02.example.com db_cleandelay = 1800 clean_method = sync db_cluster = on read_hosts=localhost,mail02.example.com,mail03.example.com, mail01.example.com prepend = 1 admin_mail = my...@me... Thanks again, Alex |
From: Dan F. <da...@ha...> - 2014-07-03 14:49:00
|
Alex wrote: > Yes, that is the very same messages I receive: > > > Jun 30 06:35:16 mail03 sqlgrey: warning: Could not connect to any server > in WRITE_HOSTS at /usr/sbin/sqlgrey line 827. > Jun 30 06:35:16 mail03 sqlgrey: dbaccess: can't connect to DB: Can't > connect to MySQL server on 'mail02.example.com' (113) Jun 30 06:35:16 > mail03 sqlgrey: dbaccess: error: couldn't access config table: Can't > connect to MySQL server on 'mail02.example.com' (113) Jun 30 06:35:16 > mail03 sqlgrey: mail: failed to send: Jun 30 06:35:16 mail03 sqlgrey: > fatal: setconfig error at /usr/sbin/sqlgrey > line 195. I believe the error 113 means "no route to host" and that should fail instantly. Which means you are probably not having timeout issues. And then I'm at a loss, since i cannot reproduce the issue here. And if theres nothing else in the logs from sqlgrey indicating errors, well... I'd go with Lionel's suggestion to try and run sqlgrey without db_clustering to simplify the setup. Though i dont think itll show any difference, it should be an easy test and it will rule out (or confirm) that it has something to do with db_clustering. Then I'd try the same with the "spawn" setup you described below. Does it make any difference if you comment out that line, and simply run it using $ /usr/sbin/sqlgrey -d I'm just thinking, that if I cannot reproduce your error, it must be something specific to your setup. - Dan > > I have sqlgrey defined as such in master.cf: > > > greylist unix - n n - 0 spawn user=nobody > argv=/usr/bin/perl /usr/sbin/sqlgrey > > and "check_policy_service inet:127.0.0.1:2501" in main.cf. > >> In fact, the ONLY way i have been able to get a "Server configuration >> problem" in my tests, is if i point the db_host to a server that behind >> a firewall that DROPS packages. This makes "connect" hang for a very >> long time, which makes postfix drop the connection due to timeout and >> cry "Server configuration problem". >> > > Have I configured postfix incorrectly? I'll include my sqlgrey.conf > again, in hopes it helps. > > loglevel = 3 log_override = whitelist:1,grey:3,spam:2 reconnect_delay = 5 > db_type = mysql db_name = sqlgrey db_host = mail02.example.com db_port = > default db_user = sqlgrey db_pass = mypass > db_cleanup_hostname=mail02.example.com db_cleandelay = 1800 clean_method = > sync db_cluster = on > read_hosts=localhost,mail02.example.com,mail03.example.com, > mail01.example.com prepend = 1 admin_mail = my...@me... > > Thanks again, > Alex > -------------------------------------------------------------------------- > ---- > Open source business process management suite built on Java and Eclipse > Turn processes into business applications with Bonita BPM Community > Edition > Quickly connect people, data, and systems into organized workflows > Winner of BOSSIE, CODIE, OW2 and Gartner awards > http://p.sf.net/sfu/Bonitasoft____________________________________________ > ___ > Sqlgrey-users mailing list > Sql...@li... > https://lists.sourceforge.net/lists/listinfo/sqlgrey-users > > |
From: Alex <mys...@gm...> - 2014-07-03 17:29:55
|
Hi, > > Jun 30 06:35:16 mail03 sqlgrey: warning: Could not connect to any server > > in WRITE_HOSTS at /usr/sbin/sqlgrey line 827. > > Jun 30 06:35:16 mail03 sqlgrey: dbaccess: can't connect to DB: Can't > > connect to MySQL server on 'mail02.example.com' (113) Jun 30 06:35:16 > > mail03 sqlgrey: dbaccess: error: couldn't access config table: Can't > > connect to MySQL server on 'mail02.example.com' (113) Jun 30 06:35:16 > > mail03 sqlgrey: mail: failed to send: Jun 30 06:35:16 mail03 sqlgrey: > > fatal: setconfig error at /usr/sbin/sqlgrey > > line 195. > > I believe the error 113 means "no route to host" and that should fail > instantly. Which means you are probably not having timeout issues. And > then I'm at a loss, since i cannot reproduce the issue here. And if theres > nothing else in the logs from sqlgrey indicating errors, well... Yes, the server was unreachable because it was down. > I'd go with Lionel's suggestion to try and run sqlgrey without > db_clustering to simplify the setup. Though i dont think itll show any > difference, it should be an easy test and it will rule out (or confirm) > that it has something to do with db_clustering. I don't really see how that's an option, though, because a client could conceivably have to try three different servers before being allowed to connect, meaning up to a fifteen minute delay before the mail is even accepted, assuming the client even retries that many times, which I doubt it would. That's the whole reason for clustering in the first place. > Then I'd try the same with the "spawn" setup you described below. Does it > make any difference if you comment out that line, and simply run it using > $ /usr/sbin/sqlgrey -d That is how I'm running it. > I'm just thinking, that if I cannot reproduce your error, it must be > something specific to your setup. The configuration isn't all that complex. Have you tested your environment and know that yours works properly? Could you post your config so I can compare with mine? How did you set up mysql? I'd really like to stick with sqlgrey if at all possible, so I'd really appreciate your help in figuring this out. Thanks, Alex |
From: Lionel B. <lio...@bo...> - 2014-07-01 21:13:30
|
Hi, Just a heads-up about this: > I have sqlgrey defined as such in master.cf <http://master.cf>: > > greylist unix - n n - 0 spawn > user=nobody argv=/usr/bin/perl /usr/sbin/sqlgrey > That's odd. I didn't expect to see it run like that and I'm not sure under which circumstances it makes sense. spawn daemons are only launched if something tries to connect to them and they expect communication on STDIN/OUT/ERR: http://www.postfix.org/spawn.8.html SQLgrey wasn't designed to work like that and should be launched has a separate service. I'm not sure how you make it work at all unless this configuration in master.cf is simply not used and SQLgrey is started separately. Best regards, Lionel |
From: Alex <mys...@gm...> - 2014-07-01 21:18:31
|
Hi, Just a heads-up about this: > > I have sqlgrey defined as such in master.cf: > > greylist unix - n n - 0 spawn > user=nobody argv=/usr/bin/perl /usr/sbin/sqlgrey > > > That's odd. I didn't expect to see it run like that and I'm not sure under > which circumstances it makes sense. > spawn daemons are only launched if something tries to connect to them and > they expect communication on STDIN/OUT/ERR: > http://www.postfix.org/spawn.8.html > > SQLgrey wasn't designed to work like that and should be launched has a > separate service. I'm not sure how you make it work at all unless this > configuration in master.cf is simply not used and SQLgrey is started > separately. > Ugh, you're right. I'm starting sqlgrey separately as a standalone program. I recall now that I was experimenting with this early on, trying to get it to work the way postfwd works, but had abandoned it. Thanks, Alex |
From: <da...@ha...> - 2014-07-03 20:59:10
|
On 2014-07-03 19:29, Alex wrote: > > I believe the error 113 means "no route to host" and that should fail > > Yes, the server was unreachable because it was down. > Yes, sorry, my point was perhaps unclear. I was just trying to say, that out of the many errors you could have gotten, you got 113. And that 113 should fail fast and not hang. But in the meantime, id's like to revise that statement. I have actaully gotten a 113 that hangs now. I finally succeeded in getting it to do so, by entering db_host as 192.188.1.3, which for me apparently cannot be routed. And now im seeing delays which may support to my original "timeout" theory. So i need you to test something. change your: db_host = mail02.example.com to: db_host = mail02.example.com;mysql_connect_timeout=1 (same line, no extra spaces) and the restart sqlgrey and see if it helps. Also. What version of sqlgrey are you running? > > I'd go with Lionel's suggestion to try and run sqlgrey without > > db_clustering to simplify the setup. Though i dont think itll show any > > difference, it should be an easy test and it will rule out (or confirm) > > that it has something to do with db_clustering. > > I don't really see how that's an option, though, because a client > could conceivably have to try three different servers before being > allowed to connect, meaning up to a fifteen minute delay before the > mail is even accepted, assuming the client even retries that many > times, which I doubt it would. That's the whole reason for clustering > in the first place. Well.. No. The reason you mention, is the reason for using a central sql-server. The reason for db-clustering, is the performance of the central sql-server. All your mail-nodes use the same write-host. And so the write host will have the same data as your readhosts. Theres no technical reason why all your mailservers couldnt use one central database, like so: [mail1] ---> [db] <---- [mail*] The reason i created dbclustering, was because i had some 10 mailservers at the time, with one central database and bot-nets were hammering sqlgrey, causing the db to hang sometimes, due to the sheer amount of lookups. So i setup a mysql-slave on each mailserver, had them replicate data from the master and made sqlgrey read from localhost only. This removed all the "read" load from the db-master. Under normal load, i can easily point all queries to the db-master, without any problems. I just tested with db_cluster=off and i can see select queries going to the master now, instead of localhost. And everything else works fine. > > > Then I'd try the same with the "spawn" setup you described below. > Does it > > make any difference if you comment out that line, and simply run it > using > > $ /usr/sbin/sqlgrey -d > > That is how I'm running it. Ah yes. I misread your reply to Lionel. Sorry. > The configuration isn't all that complex. Have you tested your > environment and know that yours works properly? Yes. I take a node out of my production environment temporarily for testing on and everthing i test acts as expected. > Could you post your config so I can compare with mine? loglevel = 2 reconnect_delay = 5 max_connect_age = 3 connect_src_throttle = 15 awl_age = 32 group_domain_level = 10 db_type = mysql db_name = sqlgrey db_host = dbmaster.example.com db_user = sqlgreyuser db_pass = password db_cleandelay = 60 db_cluster = on read_hosts=localhost prepend = 0 optmethod = optout discrimination = on discrimination_add_rulenr = on reject_first_attempt = immed reject_early_reconnect = immed reject_code = 451 > How did you set up mysql? 1 master and many slaves replicating. Each slave lives on the mailserver-node, together with postfix and sqlgrey. All sqlgrey's use localhost for read, master for write. > > > > > > > > > ------------------------------------------------------------------------------ > Open source business process management suite built on Java and Eclipse > Turn processes into business applications with Bonita BPM Community Edition > Quickly connect people, data, and systems into organized workflows > Winner of BOSSIE, CODIE, OW2 and Gartner awards > http://p.sf.net/sfu/Bonitasoft > > > _______________________________________________ > Sqlgrey-users mailing list > Sql...@li... > https://lists.sourceforge.net/lists/listinfo/sqlgrey-users |
From: Alex <mys...@gm...> - 2014-07-04 01:44:47
|
Hi, On Thu, Jul 3, 2014 at 4:59 PM, <da...@ha...> wrote: > On 2014-07-03 19:29, Alex wrote: > > > I believe the error 113 means "no route to host" and that should fail > > Yes, the server was unreachable because it was down. > > Yes, sorry, my point was perhaps unclear. > I was just trying to say, that out of the many errors you could have > gotten, you got 113. And that 113 should fail fast and not hang. > I'm not sure it actually hung. I realized it was a problem when every mail that was being received was immediately rejected due to "Server configuration error". The messages weren't queued or delayed in any way. All mail on all three systems were immediately being rejected, for more than an hour before I was able to bring the server back and restart sqlgrey on each system. > But in the meantime, id's like to revise that statement. I have actaully > gotten a 113 that hangs now. > I finally succeeded in getting it to do so, by entering db_host as > 192.188.1.3, which for me apparently cannot be routed. > Okay, maybe your definition of "hang" is different than mine, but perhaps we're really talking about the same thing. In any case, when my system fails, it just outright rejects mail across all systems, apparently because it can't talk to the master. > > And now im seeing delays which may support to my original "timeout" > theory. So i need you to test something. > change your: > db_host = mail02.example.com > to: > db_host = mail02.example.com;mysql_connect_timeout=1 > > (same line, no extra spaces) and the restart sqlgrey and see if it helps. > Please confirm that you think I should do this, given the new information about failures above. > > Also. What version of sqlgrey are you running? > sqlgrey-1.8.0 compiled here locally. > > I'd go with Lionel's suggestion to try and run sqlgrey without > > db_clustering to simplify the setup. Though i dont think itll show any > > difference, it should be an easy test and it will rule out (or confirm) > > that it has something to do with db_clustering. > > I don't really see how that's an option, though, because a client could > conceivably have to try three different servers before being allowed to > connect, meaning up to a fifteen minute delay before the mail is even > accepted, assuming the client even retries that many times, which I doubt > it would. That's the whole reason for clustering in the first place. > > Well.. No. The reason you mention, is the reason for using a central > sql-server. The reason for db-clustering, is the performance of the central > sql-server. > All your mail-nodes use the same write-host. And so the write host will > have the same data as your readhosts. > > Theres no technical reason why all your mailservers couldnt use one > central database, like so: > > [mail1] ---> [db] <---- [mail*] > > The reason i created dbclustering, was because i had some 10 mailservers > at the time, with one central database and bot-nets were hammering sqlgrey, > causing the db to hang sometimes, due to the sheer amount of lookups. > So i setup a mysql-slave on each mailserver, had them replicate data from > the master and made sqlgrey read from localhost only. This removed all the > "read" load from the db-master. > Yes, okay, I do understand that. I should have written that as well, but my main reason is to avoid users from being greylisted numerous times for sending mail to the same user in the same domain. > > Under normal load, i can easily point all queries to the db-master, > without any problems. I just tested with db_cluster=off and i can see > select queries going to the master now, instead of localhost. And > everything else works fine. > Okay, but if the master dies, then no queries occur, correct? Could you post your config so I can compare with mine? loglevel = 2 > reconnect_delay = 5 > max_connect_age = 3 > connect_src_throttle = 15 > awl_age = 32 > group_domain_level = 10 > > db_type = mysql > db_name = sqlgrey > db_host = dbmaster.example.com > db_user = sqlgreyuser > db_pass = password > db_cleandelay = 60 > db_cluster = on > read_hosts=localhost > prepend = 0 > optmethod = optout > discrimination = on > discrimination_add_rulenr = on > reject_first_attempt = immed > reject_early_reconnect = immed > reject_code = 451 > There are a few options there that I'm not using, and I don't recognize, but I don't believe the lack of any of them would cause the issue I'm having, correct? How did you set up mysql? > > > 1 master and many slaves replicating. Each slave lives on the > mailserver-node, together with postfix and sqlgrey. > All sqlgrey's use localhost for read, master for write. > Ah, I think I have it configured for all hosts to write to the one master. How can you have all hosts write to the local database, yet have any kind of synchronization between tables? I'm pretty sure I set it up according to the way it was documented, particularly given I don't know much about replication myself. Hopefully this info helps better isolate where I'm going wrong? Thanks, Alex |
From: Dan F. <da...@ha...> - 2014-07-04 09:44:48
|
Alex wrote: > Okay, maybe your definition of "hang" is different than mine, but perhaps By hanging, i mean "any network connection or connection attempt, that stalls for more than a few seconds". You have this whole chain of individual connections: internet -> postfix -> sqlgrey -> mysql Each of these have a timeout value. Which doesn't have to be the same. So when postfix connects to sqlgrey, its not gonna wait forever for a reply. If sqlgrey's attempt to connect to mysql "hangs", for more seconds than Postfix is willing to wait, postfix kills the connection and replies "Server configuration error". Thus, if your mysql connection attempt doesn't timeout fast enough, sqlgrey never gets a chance to reply "dunno" to postfix and allow the mail to go through. >> db_host = mail02.example.com;mysql_connect_timeout=1 >> >> (same line, no extra spaces) and the restart sqlgrey and see if it >> helps. >> > > Please confirm that you think I should do this, given the new information > about failures above. Yes. I think you should :). I have tested this with 1.7.4 and 1.8.0 and it works in both cases. What I'm doing, is simply adding a connect-timeout of 1 second to the mysql connection. So if the connect attempt hangs (as per my earlier definition), it will give up after one second. (ofcourse you could more seconds than 1, if you worry that your SQL server will ever be slower than 1 second to accept a connection). In my tests, this solves the issue, because postfix doesn't have to timeout the connection to sqlgrey and everything remains shiny. (shiny = "mails will pass through unhindered, while the sql-server is down") > Yes, okay, I do understand that. I should have written that as well, but > my main reason is to avoid users from being greylisted numerous times for > sending mail to the same user in the same domain. For that, you only need 1 sql-server, shared among all mail servers. And sqlgrey running with db_cluster=off. "db_cluster=on" is only needed if the 1 sql-server cant service all your mail servers fast enough. (I'm not saying that you're doing it wrong, I'm just pointing out the different motivations.) >> Under normal load, i can easily point all queries to the db-master, >> without any problems. I just tested with db_cluster=off and i can see > > Okay, but if the master dies, then no queries occur, correct? Correct. But no queries occur in db_cluster=on mode either, if master dies. sqlgrey defaults back to "allow everything" if db_host (the master) dies. And as such, there is no need to do queries anymore, until master is online again. >> read_hosts=localhost prepend = 0 optmethod = optout discrimination = on > There are a few options there that I'm not using, and I don't recognize, > but I don't believe the lack of any of them would cause the issue I'm > having, correct? No. There are no undocumented settings here, that relates to connections to databases. In fact, the only option I'd try to change in your case, would be the prepend. Though i doubt it has any effect, it does change the way sqlgrey responds to postfix. And if postfix doesn't understand the response, you get "Server configuration problem". >> 1 master and many slaves replicating. Each slave lives on the >> mailserver-node, together with postfix and sqlgrey. All sqlgrey's use >> localhost for read, master for write. >> > > Ah, I think I have it configured for all hosts to write to the one > master. How can you have all hosts write to the local database, yet have > any kind of synchronization between tables? Hmm.. Let me just explain MySQL Replication real quick: You have a mysql server. You do reads and writes and everything is fine. Now you'd like a "replica". So you make a NEW mysql-server, calling it "slave01". Then you instruct slave01 to "replicate" from the master. The slave is actually doing all the work, replication wise. The master doesn't know and doesn't care about how the slave is doing, if its behind or whatever. And you can add as more slaves and the master still doesn't know or cares. The master doesn't know its a master. It doesn't "act" differently. It still can do reads and write just like when it was stand-alone. Any statements executed on the master, that would change data in any way, gets executed on all the slaves as well, via replication. On the slaves, you can do reads (and technically it can also do writes, but writing would not be smart, as it causes inconsistencies with the master and can make replication stop dead). If writes WERE to be done to a slave, the write changes would NOT be replicated to the master. Thats simply just not how it works. The slaves copy all INSERT,REPLACE,UPDATE,DELETE,CREATE,ALTER, ect. statements from the master and execute them on themselves.. So now you have 1 server where you can read and write all you like, and X slave servers, that should have the same data as the master, where you can do read queries. So now that we know that slaves are just a read only copy of the master, and the master is still just a normal mysql-server, i assume you can see why disabling db-clustering, wont change anything as long as the master doesn't suffer from poor performance. Since, all that happens by setting db_cluster=off, is that all the slaves wont be used for reads anymore and all read queries will go to the master instead. Hope that makes it clearer. - Dan |
From: Alex <mys...@gm...> - 2014-07-04 14:14:49
|
Hi, > > Okay, maybe your definition of "hang" is different than mine, but perhaps > > By hanging, i mean "any network connection or connection attempt, that > stalls for more than a few seconds". Okay, that's how I understand it, but that's not what's happening here. There are two scenarios where I get the "451 4.3.5 Server configuration problem" error. The first is if sqlgrey dies on any system, then that system will respond with the error. The second is when mysql is stopped on the master server. (after adding your mysql_connect_timeout=1 option, it no longer fails when mysql dies.) However, postfix still responds with "server configuration ..." if sqlgrey is dead or inaccessible. This is the issue I need to fix now. > You have this whole chain of individual connections: > internet -> postfix -> sqlgrey -> mysql > > Each of these have a timeout value. Which doesn't have to be the same. > So when postfix connects to sqlgrey, its not gonna wait forever for a > reply. If sqlgrey's attempt to connect to mysql "hangs", for more seconds > than Postfix is willing to wait, postfix kills the connection and replies > "Server configuration error". > > Thus, if your mysql connection attempt doesn't timeout fast enough, > sqlgrey never gets a chance to reply "dunno" to postfix and allow the mail > to go through. That's assuming sqlgrey is still around to respond. I need to also consider the possibility where sqlgrey dies. > >> db_host = mail02.example.com;mysql_connect_timeout=1 > >> > >> (same line, no extra spaces) and the restart sqlgrey and see if it > >> helps. Okay, this did appear to solve the problem with the master mysqld is not able to respond. It no longer responds with "Server configuration ...", which is good. I don't see that option in the default documentation. Where is this documented? > > Please confirm that you think I should do this, given the new information > > about failures above. > > Yes. I think you should :). I have tested this with 1.7.4 and 1.8.0 and it > works in both cases. What I'm doing, is simply adding a connect-timeout of > 1 second to the mysql connection. So if the connect attempt hangs (as per > my earlier definition), it will give up after one second. (ofcourse you > could more seconds than 1, if you worry that your SQL server will ever be > slower than 1 second to accept a connection). So then in my setup, where the master mysql daemon is unavailable, each client references their own database? And no updating is occurring since they aren't configured as write servers, correct? > In my tests, this solves the issue, because postfix doesn't have to > timeout the connection to sqlgrey and everything remains shiny. > > (shiny = "mails will pass through unhindered, while the sql-server is down") So postfix was always waiting patiently enough; it was sqlgrey that was responding with failure too quickly? > > Yes, okay, I do understand that. I should have written that as well, but > > my main reason is to avoid users from being greylisted numerous times for > > sending mail to the same user in the same domain. > > For that, you only need 1 sql-server, shared among all mail servers. And > sqlgrey running with db_cluster=off. > > "db_cluster=on" is only needed if the 1 sql-server cant service all your > mail servers fast enough. > > (I'm not saying that you're doing it wrong, I'm just pointing out the > different motivations.) So it's okay to leave it on, correct? Wouldn't this also serve to make it possible for existing entries to be queried through the local copies while the master is unavailable? > >> Under normal load, i can easily point all queries to the db-master, > >> without any problems. I just tested with db_cluster=off and i can see > > > > Okay, but if the master dies, then no queries occur, correct? > > Correct. But no queries occur in db_cluster=on mode either, if master > dies. sqlgrey defaults back to "allow everything" if db_host (the master) > dies. And as such, there is no need to do queries anymore, until master is > online again. Each client has a local copy of the database, no? And by setting read_hosts to contain at least localhost, it should then be able to query the local database, no? > >> read_hosts=localhost prepend = 0 optmethod = optout discrimination = on > > There are a few options there that I'm not using, and I don't recognize, > > but I don't believe the lack of any of them would cause the issue I'm > > having, correct? > > No. There are no undocumented settings here, that relates to connections > to databases. In fact, the only option I'd try to change in your case, > would be the prepend. Though i doubt it has any effect, it does change the > way sqlgrey responds to postfix. And if postfix doesn't understand the > response, you get "Server configuration problem". I don't see where these options are defined either. > So now that we know that slaves are just a read only copy of the master, > and the master is still just a normal mysql-server, i assume you can see > why disabling db-clustering, wont change anything as long as the master > doesn't suffer from poor performance. Since, all that happens by setting > db_cluster=off, is that all the slaves wont be used for reads anymore and > all read queries will go to the master instead. Okay, got it. I think I got confused, but I believe I understood it correctly, in that when the master is down, the slaves can continue to read from their local database. I think it was just the db_cluster terminology that I wasn't understanding there. Thanks again, Alex |
From: <da...@ha...> - 2014-07-04 16:18:27
|
On 2014-07-04 16:14, Alex wrote: > > By hanging, i mean "any network connection or connection attempt, that > > stalls for more than a few seconds". > > Okay, that's how I understand it, but that's not what's happening here. All evidence so far, points to this explanation. Including the fact that my timeout-fix worked. Im unsure what you base your assumption on, that this is not whats happening, as the logs wont show you this and you'd need to do somthing like modifying the sqlgrey code to provide you with debugging information or telnet/netcat to talk to sqlgrey & postfix. > There are two scenarios where I get the "451 4.3.5 Server > configuration problem" error. The first is if sqlgrey dies on any > system, then that system will respond with the error. > That's assuming sqlgrey is still around to respond. I need to also > consider the possibility where sqlgrey dies. I've never experienced sqlgrey just dying on me, but if it happens, it is Postfix that decides what to respond. It cannot it be influenced by sqlgrey. And the error, 451, is a temporary error, so mails will be delivered once sqlgrey is running again. I dont think theres a setting in postfix to choose default answers to policy_daemon failures. So this will be the same issue with any postfix policy daemon that isnt running. > > >> db_host = mail02.example.com > <http://mail02.example.com>;mysql_connect_timeout=1 > > I don't see that option in the default documentation. Where is this > documented? Its not an option. Its a hack i made up for this occasion. Sqlgrey uses a "DSN" internally for connecting to mysql. They look somthing like this: DBI:mysql:sqlgrey;host=db.example.com;port=3306 And in sqlgrey $host is just inserted into this DSN somthing like this. DBI:mysql:sqlgrey;host=$host;port=3306 Which is why, if $host = "127.0.0.2;whatever=3", the DSN will contain DBI:mysql:sqlgrey;host=127.0.0.2;whatever=3;port=3306 and mysql_connect_timeout happens to be an option you can add to the DSN. So its just a hack. Its definitely something we should add as an option in a later version. > So then in my setup, where the master mysql daemon is unavailable, > each client references their own database? And no updating is > occurring since they aren't configured as write servers, correct? Sqlgrey will default to "accept all mail" when master is unavailable. So no need to read anything until master is back online. > > > In my tests, this solves the issue, because postfix doesn't have to > > timeout the connection to sqlgrey and everything remains shiny. > > > > (shiny = "mails will pass through unhindered, while the sql-server > is down") > > So postfix was always waiting patiently enough; it was sqlgrey that > was responding with failure too quickly? No. The other way around. sqlgrey may be 3 minutes in getting a timeout from its mysql-connect(). But postfix "aint got time for that" and is disconnecting already after, eg. , 100 seconds. So sqlgrey is too slow to respond to postfix and postfix just disconnects. And THATS why you get "Server configuration problem". > > > "db_cluster=on" is only needed if the 1 sql-server cant service all your > > mail servers fast enough. > > > > (I'm not saying that you're doing it wrong, I'm just pointing out the > > different motivations.) > > So it's okay to leave it on, correct? Yes its fine. > Wouldn't this also serve to make it possible for existing entries to > be queried through the local copies while the master is unavailable? No.. There is no database "high-availability" here. If master dies, all mail accepted by default. > > > dies. And as such, there is no need to do queries anymore, until > master is > > online again. > > Each client has a local copy of the database, no? And by setting > read_hosts to contain at least localhost, it should then be able to > query the local database, no? In theory, we could query the localhost. But since sqlgrey will fall-back to to allowing all mails through, it doesnt matter what is in the database. Since the mail will go through anyway. And sqlgrey doesnt really work without being able to write, so it smarter just to accept all mail. > > > >> read_hosts=localhost prepend = 0 optmethod = optout > discrimination = on > > I don't see where these options are defined either. I see all of them, with comments, in the sample config that comes with sqlgrey-1.8.0. Have a look there and see if not everything is explained. > > Since, all that happens by setting > > db_cluster=off, is that all the slaves wont be used for reads > anymore and > > all read queries will go to the master instead. > > Okay, got it. I think I got confused, but I believe I understood it > correctly, in that when the master is down, the slaves can continue to > read from their local database. I think it was just the db_cluster > terminology that I wasn't understanding there. Yes. In general (non-sqlgrey) cases, when an sql-master is down, the application can still read from the slaves. Sqlgrey just doesnt use this, as sqlgrey NEEDS to be able to write. Hope that answers everything :) - Dan |
From: Alex <mys...@gm...> - 2014-07-06 03:15:41
|
Hi, > > By hanging, i mean "any network connection or connection attempt, that > > stalls for more than a few seconds". > > Okay, that's how I understand it, but that's not what's happening here. > > All evidence so far, points to this explanation. Including the fact that my timeout-fix worked. > Im unsure what you base your assumption on, that this is not whats happening, as the logs wont show you > this and you'd need to do somthing like modifying the sqlgrey code to provide you with debugging information or > telnet/netcat to talk to sqlgrey & postfix. It's only based on the fact that there is no stalling or any delays here - it happens immediately when sqlgrey isn't running at all. Hopefully I'm being pedantic here. I just mean that the connection attempt is never successful if sqlgrey isn't running. It should realize this immediately. > That's assuming sqlgrey is still around to respond. I need to also consider the possibility where sqlgrey dies. > > I've never experienced sqlgrey just dying on me, but if it happens, it is Postfix that decides what to respond. > It cannot it be influenced by sqlgrey. > And the error, 451, is a temporary error, so mails will be delivered once sqlgrey is running again. Okay, right, I should have known that's a temporary error. I do know sqlgrey can't control postfix if it's not running, of course. > > >> db_host = mail02.example.com;mysql_connect_timeout=1 > > I don't see that option in the default documentation. Where is this documented? > > Its not an option. Its a hack i made up for this occasion. > > Sqlgrey uses a "DSN" internally for connecting to mysql. They look somthing like this: > > DBI:mysql:sqlgrey;host=db.example.com;port=3306 > > And in sqlgrey $host is just inserted into this DSN somthing like this. > DBI:mysql:sqlgrey;host=$host;port=3306 > > Which is why, if $host = "127.0.0.2;whatever=3", the DSN will contain > DBI:mysql:sqlgrey;host=127.0.0.2;whatever=3;port=3306 > > and mysql_connect_timeout happens to be an option you can add to the DSN. > So its just a hack. Its definitely something we should add as an option in a later version. Okay, great, got it. It's also nice to hear another version is intended at some point. > > In my tests, this solves the issue, because postfix doesn't have to > > timeout the connection to sqlgrey and everything remains shiny. > > > > (shiny = "mails will pass through unhindered, while the sql-server is down") > > So postfix was always waiting patiently enough; it was sqlgrey that was responding with failure too quickly? > > No. The other way around. sqlgrey may be 3 minutes in getting a timeout from its mysql-connect(). But > postfix "aint got time for that" and is disconnecting already after, eg. , 100 seconds. So sqlgrey is too slow to > respond to postfix and postfix just disconnects. And THATS why you get "Server configuration problem". Right, okay. So is the "mysql_connect_timeout=1" instructing sqlgrey to wait for 1s? Or is that just an on/off thing? I'm trying to understand the postfix interaction part. In other words, postfix must have a fixed-length amount of time it waits, since you mentioned it wasn't adjustable. Hardcoded in sqlgrey is something that makes sure it waits less units of time than this postfix timeout default, correct? > Each client has a local copy of the database, no? And by setting read_hosts to contain at least localhost, it should > then be able to query the local database, no? > > In theory, we could query the localhost. But since sqlgrey will fall-back to to allowing all mails through, it doesnt > matter what is in the database. Since the mail will go through anyway. > > And sqlgrey doesnt really work without being able to write, so it smarter just to accept all mail. Okay, that's a big help. So although mysql itself replicates the data between each host, sqlgrey isn't designed to read the data from that local host, and it doesn't make sense to do that. > > >> read_hosts=localhost prepend = 0 optmethod = optout discrimination = on > > I don't see where these options are defined either. > > I see all of them, with comments, in the sample config that comes with sqlgrey-1.8.0. Have a look there and > see if not everything is explained. I'll have to look again. > Hope that answers everything :) Really appreciate all your hard work, both here and in the code. I've learned so much. Thanks, Alex |
From: Dan F. <da...@ha...> - 2014-07-06 11:08:06
|
> It's only based on the fact that there is no stalling or any delays here > - it happens immediately when sqlgrey isn't running at all. You are now talking about how Postfix reacts to a missing policy-daemon (sqlgrey is a postfix policy-daemon). As this is not something I or sqlgrey can influence, this is not what im talking about at all. I am ONLY talking about the issue you specified in your original mail, which was (slightly summarized): - You had "..configured using the DBCLUSTER.." - and when "..one machine goes down, all three fail.." - with error "..4.3.5 Server configuration problem.." And as such, i believe the issue was a mysql connection attempt that took too long. This is now solved by setting timeout to 1 second. How Postfix reacts to a missing policy-daemon is completely out of my hands and out of scope. If we were troubleshooting WHY sqlgrey wasn't running at the time, that would something else entirely. :) > Right, okay. So is the "mysql_connect_timeout=1" instructing sqlgrey to wait for 1s? Yes. Hence "mysql_connect_timeout=3" would allow it to wait for up to 3 seconds for the sql-server to respond. > In other words, postfix must have a fixed-length amount of time it waits, since you mentioned it wasn't adjustable. No. It is adjustable through postfix' config option "smtpd_policy_service_timeout". (The thing i said that i didnt think could be adjusted, was Postfixes default reaction to unexpected errors - Ie. have it do something other than "Server configuration problem") > Hardcoded in sqlgrey is something that makes sure it waits less units of time than this postfix timeout default, correct? No. sqlgrey should never take more than a few seconds to do anything. The fact that we are hitting the "smtpd_policy_service_timeout" is a bug. sqlgrey simply uses the default timeout-value for connecting to mysql. Which is way to high when something hangs. We've remedied this, by setting it to 1, in your case. > Okay, that's a big help. So although mysql itself replicates the data between each host, sqlgrey isn't designed to read > the data from that local host, and it doesn't make sense to do that. Correct. Under normal operation, localhost is used for reads. When master is dead, nothing will be read from localhost either. - Dan |
From: Karl O. P. <ko...@me...> - 2014-07-06 12:50:28
|
On 07/06/2014 06:07:58 AM, Dan Faerch wrote: > How Postfix reacts to a missing policy-daemon is completely out of my > hands and out of scope. If we were troubleshooting WHY sqlgrey wasn't > running at the time, that would something else entirely. :) If someone was really worried about sqlgrey dying then there's probably a way to run it from inetd. But that just pushes the problem of a dead daemon back to inetd, so the right thing to do is work from inittab. But why? :-) Karl <ko...@me...> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein |
From: Dan F. <da...@ha...> - 2014-07-06 17:37:20
|
Karl O. Pinc wrote: > On 07/06/2014 06:07:58 AM, Dan Faerch wrote: > > If someone was really worried about sqlgrey dying then there's > probably a way to run it from inetd. But that just pushes the problem of a > dead daemon back to inetd, so the right thing to do is work from inittab. Indeed. I had issues with postgrey +10 years ago, before i switched to sqlgrey. And i had based an internal policy-daemon upon that codebase as well. Which then experienced the same problem and i simply couldnt track down the bug. Sometimes they would just stop responding, but they were still running. I searched a long time for a way to configure the default policy-daemon-response in postfix from "defer_if_permit" to "dunno", but found nothing. I even stared at the source to postfix for a while, to see if it was in there, as an undocumented option. I couldnt find anything to suggest it. So i ended up creating an ultra simple "policy-daemon-proxy", whos only job was to talk to the real policy server, have faster timeout and always report "dunno" if something goes wrong. A really silly hack and it just underlines why this option should exist in postfix. Then, i went with "sqlgrey" and all my problems disappeared ;) |
From: Alex <mys...@gm...> - 2014-07-17 03:52:00
|
Hi, Dan, I'm hoping you can still help me, because I'm still doing something wrong. > I searched a long time for a way to configure the default > policy-daemon-response in postfix from "defer_if_permit" to "dunno", but > found nothing. I even stared at the source to postfix for a while, to see > if it was in there, as an undocumented option. I couldnt find anything to > suggest it. > > So i ended up creating an ultra simple "policy-daemon-proxy", whos only > job was to talk to the real policy server, have faster timeout and always > report "dunno" if something goes wrong. A really silly hack and it just > underlines why this option should exist in postfix. > > Then, i went with "sqlgrey" and all my problems disappeared ;) I did some tests this evening by basically disconnecting the server with the master mysql database, and it caused all mail on the two remaining systems that were still running to bounce with the "4.3.5 Server configuration problem". You mention here that sqlgrey has solved your problems and I apparently don't understand how you have it configured to no longer reply with a temporary error and somehow bypass the greylisting? The messages aren't queued, they're just rejected, albeit temporarily, but we can't create this single point of failure... Thanks again for your help. Alex |
From: <da...@ha...> - 2014-07-17 10:59:39
|
On 2014-07-17T05:51:53 CEST, Alex wrote: > I did some tests this evening by basically disconnecting the server > with the master mysql database, and it caused all mail on the two > remaining systems that were still running to bounce with the "4.3.5 > Server configuration problem". If you made the configuration change on all your hosts, i dont know what you are experiencing and your mail contains no new information, technical or otherwise, to go on. And that, paired with the fact that im fairly certain how this works and can see in my tests that it is indeed working as expected, simply makes me unable to come up with guesses as to whats troubling your system. What i CAN do, is show you how to test better, to pinpoint where the issue may lie. The way i tested this manually, was by simply "telnetting" to the postgrey service and talking to it. That may be a bit cumbersome, so fortunately Michael Ludvig has included a testscript in the tar-ball, simply called "tester.pl". On my system, a normal run looks like this: ---- $ ./tester.pl --client-ip 10.0.0.1 action=451 Greylisted for 5 minutes (16) ---- By adding "time" to the beginning of the command, we can see how much time it took to complete. So heres a run where mysql-server has downed its interface just for 10 seconds: ---- $ time ./tester.pl --client-ip 10.0.0.1 action=dunno real 0m3.062s user 0m0.056s sys 0m0.004s ---- "action=dunno" means sqlgrey passes no judgment. Which in turn means "let it through". This "conclusion" is reached within 3 seconds (you can see that at the line "real 0m3.062s"). And this is an example of sqlgrey not running ---- $ time ./tester.pl --client-ip 10.0.0.1 Connect failed: IO::Socket::INET: connect: Connection refused ---- Finding out how long postfix will wait is as simple as: ---- $ postconf smtpd_policy_service_timeout smtpd_policy_service_timeout = 100s ---- In this case 100s. When i point a my sqlgrey to a server behind a packet-dropping-firewall and rerun the test ---- $ time ./tester.pl --client-ip 10.0.0.1 ---- i literally had to ctl-c manually after ~6 minutes. Which is way more than 100s, of course. So THAT would result in "Server configuration problem". Another thing that could give "Server configuration problem", would be if any garbage output (ie. an internal error from sqlgrey) was to be printed out to the socket. But even that would be visible by testing like this. As the predominant theory (and the only theory with a positive test so far) is the timeout theory, I think you'd have to to try running this command while you're experiencing the problem. This should help to either prove or disprove that its a timeout problem and may even catch any garbage output if that was the case.. - Dan |
From: Alex <mys...@gm...> - 2014-07-17 22:30:47
|
Hi, > > I did some tests this evening by basically disconnecting the server > > with the master mysql database, and it caused all mail on the two > > remaining systems that were still running to bounce with the "4.3.5 > > Server configuration problem". > If you made the configuration change on all your hosts, i dont know what > you are experiencing and your mail contains no new information, > technical or otherwise, to go on. And that, paired with the fact that > im fairly certain how this works and can see in my tests that it is > indeed working as expected, simply makes me unable to come up with > guesses as to whats troubling your system. The problem is simply that when the server with the master sql database is running on goes down, mail is stopped on all three systems. The two systems that remain running just respond with temporary bounce messages instead of responding with "dunno" or otherwise forwarding on the message. > So heres a run where mysql-server has downed its interface just for 10 > seconds: > ---- > $ time ./tester.pl --client-ip 10.0.0.1 > action=dunno > > real 0m3.062s > user 0m0.056s > sys 0m0.004s > ---- > "action=dunno" means sqlgrey passes no judgment. Which in turn means > "let it through". This "conclusion" is reached within 3 seconds (you can > see that at the line "real 0m3.062s"). Okay, I did some more testing. Live testing. At first I was surprised to see the systems continued to deliver mail after stopping entirely the master mysqld on mail02 because I knew I was having some kind of problem. I monitored it for a while, made sure it was actually continuing to deliver mail (which it was), and looking at the tons of sqlgrey logs reporting it couldn't properly communicate with the database. Then, about seven minutes into my testing, sqlgrey quit and died on all three systems: Jul 17 18:02:22 mail02 sqlgrey: fatal: setconfig error at /usr/sbin/sqlgrey line 195. Jul 17 18:02:36 mail03 sqlgrey: fatal: setconfig error at /usr/sbin/sqlgrey line 195. Jul 17 18:03:03 mail01 sqlgrey: fatal: setconfig error at /usr/sbin/sqlgrey line 195. The testing began here: Jul 17 17:53:30 mail01 sqlgrey: dbaccess: error: couldn't get now() from DB: Jul 17 17:53:32 mail02 sqlgrey: dbaccess: error: couldn't get now() from DB: Jul 17 17:53:33 mail03 sqlgrey: dbaccess: error: couldn't get now() from DB: Between those times were hundreds of "Server configuration..." postfix errors because sqlgrey had died. > And this is an example of sqlgrey not running > ---- > $ time ./tester.pl --client-ip 10.0.0.1 > Connect failed: IO::Socket::INET: connect: Connection refused > ---- > > > Finding out how long postfix will wait is as simple as: > ---- > $ postconf smtpd_policy_service_timeout > smtpd_policy_service_timeout = 100s > ---- > In this case 100s. I still have mine set for 1s, but I'd like an "indefinite" option for the case where I'm taking the main mysql system down for maintenance, or an unexpected event occurs where I cannot reach the system for an undetermined amount of time. I don't know if my 7m test hit some magic limit or something else happened, but I can test again if necessary, although I'd like your input first. > When i point a my sqlgrey to a server behind a packet-dropping-firewall > and rerun the test > ---- > $ time ./tester.pl --client-ip 10.0.0.1 > ---- > i literally had to ctl-c manually after ~6 minutes. Which is way more > than 100s, of course. So THAT would result in "Server configuration > problem". Another thing that could give "Server configuration problem", > would be if any garbage output (ie. an internal error from sqlgrey) was > to be printed out to the socket. But even that would be visible by > testing like this. So how do we explain it continuing beyond 100s when you've explicitly defined the timeout period to be 100s? Thanks, Alex |