From: Mantis B. T. <no...@bu...> - 2012-06-26 19:55:31
|
The following issue has been REOPENED. ====================================================================== http://bugs.bacula.org/view.php?id=1897 ====================================================================== Reported By: peckel Assigned To: marcovw ====================================================================== Project: bacula Issue ID: 1897 Category: Director Reproducibility: always Severity: block Priority: urgent Status: feedback ====================================================================== Date Submitted: 2012-06-24 21:00 BST Last Modified: 2012-06-26 20:55 BST ====================================================================== Summary: Bacula Release 5.2.9 breaks IPv6 connectivity Description: When I tried to connect to my bacula-dir with bconsole this morning, I noticed that the connection could not be established, session hangs for a long time. All my bacula instances are running IPv6 only. I recently upgraded via yum from 5.2.6 to 5.2.9, at which time the error must have been introduced. Not only the connection from bconsole to bacula-dir is affected, but the bacula-dir also cannot access bacula-fd running on 5.2.9 - same problem, connection cannot be etsablished. Every time the bacula-dir is restarted, an error message is sent via mail and to the error log: 24-Jun 20:25 orcus-dir: Warning: Cannot bind port 9101: ERR=Address already in use: Retrying ... The listener port is set up correctly and a connection can be established, the daemon (bacula-dir and bacula-fd) just doesn't answer. Steps to Reproduce: - Install bacula 5.2.9 (complete bacula-postgresql package, fd, sd, dir and bconsole from the CentOS 5 repo at Dass-IT) - Configure the bacula-dir and bconsole to use IPv6 only: bacula-dir.conf: Director { Name = orcus-dir DirAddresses = { ipv6 = { addr = 2001:16e0:101:c00::b00:0; port = 9101 } } QueryFile = "/usr/lib64/bacula/query.sql" WorkingDirectory = "/var/lib/bacula" PidDirectory = "/var/run" Maximum Concurrent Jobs = 1 Password = "xxxx" Messages = daemon } ... bconsole.conf: Director { Name = orcus-dir DIRport = 9101 address = orcus.hindenburgring.com Password = "=MkJeUo02O.QNY_W" } ... Try to connect. Double-checked: - With 5.2.6 and the same configuration files, connection works without a problem. - Reconfiguration to use IPv4 also 'fixes' the problem. Additional Information: The problem could be reproduced with bacula-fd as well, the 5.2.9 FDs fail, 5.2.6 works fine. Some of the FDs are running on CentOS 6.2, Ubuntu 10.04 LTS and Mac OS X 10.7, Director and Storage Daemon on CentOS 5.8. The problem only occurs on 5.2.9 with IPv6. A pointer to bacula-postgresql RPMs for Version 5.2.6 would be extremely welcome. At the moment, my backup system is down. ====================================================================== ---------------------------------------------------------------------- (0006409) kern (administrator) - 2012-06-25 07:39 http://bugs.bacula.org/view.php?id=1897#c6409 ---------------------------------------------------------------------- I am sorry to hear you are having problems. There were changes in the code resolver code for IP addresses between 5.2.6 and 5.2.9, and I have asked the author to take a look at it. We will see. Sometime this later this week we will be releasing version 5.2.10, and either we will have a solution or we will back out the changes. The major problem for us is that we don't run IPv6 so we don't test it. If you have some pointers on how to install it and make it co-exist with IPv4, please let me know as it may avoid future problems. Concerning your installation, possible solutions: 1. Wait until we put out a new version and find rpms on the net, as it is unlikely CentOS will build a new version in the near future. 2. See with CentOS about installing older rpms. 3. Find older rpms on the net for your OS. 4. Download the 5.2.6 source from Source Forge and build the rpms yourself (using either the community furnished spec files or the spec files from 5.2.9) 5. Build new binaries from 5.2.6 source. 6. Get a Bacula Systems subscription, where we don't generally have these problems, and will supply new binaries in the case that we do -- of course this is a paying solution ... ---------------------------------------------------------------------- (0006410) marcovw (developer) - 2012-06-25 09:41 http://bugs.bacula.org/view.php?id=1897#c6410 ---------------------------------------------------------------------- The 24-Jun 20:25 orcus-dir: Warning: Cannot bind port 9101: ERR=Address already in use: Retrying ... Seems to indicate that you have duplicate addresses in your config of something is resolving to duplicate addresses. When the cannot bind happens it keeps on retrying and the daemon never gets ready for service so that might explain why you cannot connect. The fist steps would be the obvious ones: - output from bacula-dir -d 100 to have some debugging on what it wants to bind too etc. - output of bconsole -d 100 to see if the bconsole tries to connect to the correct instance. - netstat -an | grep 9101 to see if it binds at all and at what address. We can take it from there. ---------------------------------------------------------------------- (0006411) peckel (reporter) - 2012-06-25 10:58 http://bugs.bacula.org/view.php?id=1897#c6411 ---------------------------------------------------------------------- There is no duplicate address involved, though the message might indicate so. After a downgrade to 5.2.6 (I did this after opening the change, as you suggested later on :-)), everything works fine without further changes. There seems to be a problem in the IP socket allocation code for IPv6 that was introduced between 5.2.6 and 5.2.9. What's interesting as well is that the bacula-dir (and the bacula-fd, as far as I can see) open the listener port correctly as indicated by netstat, but don't send out any packages through it. I can connect via telnet to port 9101, but there is never any answer. I am using the repositories linked here <http://www.dass-it.de/news/download/dassit-bacula-rpm-repositories>, specifically <http://download.opensuse.org/repositories/home:dassit:bacula/CentOS_CentOS-5/home:dassit:bacula.repo> and <http://download.opensuse.org/repositories/home:dassit:bacula/CentOS_CentOS-6/home:dassit:bacula.repo> As for enabling IPv6, there is nothing much to be done on a recent operating system. For example, CentOS 5/6 already have IPv6 enabled by default, as you can see in ifconfig: eth0 Link encap:Ethernet HWaddr 00:16:35:06:16:BD inet addr:192.168.42.11 Bcast:192.168.42.255 Mask:255.255.255.0 inet6 addr: fe80::216:35ff:fe06:16bd/64 Scope:Link inet6 addr: 2001:16e0:101:c00::b00:0/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:407412022 errors:0 dropped:0 overruns:0 frame:0 TX packets:277820712 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:473186707500 (440.6 GiB) TX bytes:463697439671 (431.8 GiB) If you did not enable IPv6 on the routers inside your network, you will probably not see a 'Scope:Global' address. However, a link-local address (starting with FE80::, marked 'Scope:Link') will always be available, and that should be enough for testing. Just edit the bacula-dir config so that the 'DIRaddresses' instance matches mine (using your link local address instead of my global one, of course), and modify the bconsole-Setup's 'Director' instance to look like this: Director { Name = your-dir DIRport = 9101 address = fe80::216:35ff:fe06:16bd # change to your link-local address! Password = "your-password" } After restarting the director, you should see the link-local address as the listener address for the director with 'netstat -anTtup', like that: tcp 0 0 fe80::216:35ff:fe06:16bd:9101 :::* LISTEN 27685/bacula-dir I did not get a debug level 100 log, but I have a higher level one ... I'll upload it in a second. ---------------------------------------------------------------------- (0006412) peckel (reporter) - 2012-06-25 11:13 http://bugs.bacula.org/view.php?id=1897#c6412 ---------------------------------------------------------------------- Started the 5.2.9 bacula-dir again for testing. netstat: tcp 0 0 2001:16e0:101:c00::b00:0:9101 :::* LISTEN 4063/bacula-dir debug log for bconsole, and another one for bacula-dir will follow shortly. ---------------------------------------------------------------------- (0006413) peckel (reporter) - 2012-06-25 11:32 http://bugs.bacula.org/view.php?id=1897#c6413 ---------------------------------------------------------------------- I ran another test with the 5.2.6 and the 5.2.9 bacula-dir, logs are attached. The bconsole was always the 5.2.9 one, the client side of the connection obviously does not have any issues. What strikes me is that 5.2.9 seems to see the same IPv6 address three times, while 5.2.6 (correctly) only sees one address: 5.2.9: orcus-dir: bnet_server.c:95-0 Addresses host[ipv6:2001:16e0:101:c00::b00:0:9101] host[ipv6:2001:16e0:101:c00::b00:0:9101] host[ipv6:2001:16e0:101:c00::b00:0:9101] 5.2.6: orcus-dir: bnet_server.c:95-0 Addresses host[ipv6:2001:16e0:101:c00::b00:0:9101] So I'd suspect the problem occurs somewhere in bnet_server.c, which would also explain that all servers (dir, fd, and probably sd as well) seem to have the same issue with IPv6. ---------------------------------------------------------------------- (0006414) marcovw (developer) - 2012-06-25 12:06 http://bugs.bacula.org/view.php?id=1897#c6414 ---------------------------------------------------------------------- Ok looks quite simple indead we just need a duplicate supression in the code which remove duplicates from the list. For whatever reason the new getaddrinfo code seems to find duplicate entries and as such it tries to listen multiple times on the same address which isn't going to work. As it hanging in bnet_server its kind of expected that you don't get a response. ---------------------------------------------------------------------- (0006415) marcovw (developer) - 2012-06-25 20:34 http://bugs.bacula.org/view.php?id=1897#c6415 ---------------------------------------------------------------------- Attached is a patch against 5.2.9 that does address suppression on both listening sockets and outgoing sockets. When I defined 2 times the same ipv6 address on the old code it would give me 6 addresses and the new one only leaves one. So if you can test this code that would be nice and we can take it from there. ---------------------------------------------------------------------- (0006416) peckel (reporter) - 2012-06-25 21:43 http://bugs.bacula.org/view.php?id=1897#c6416 ---------------------------------------------------------------------- Hi Marco, I just applied the patch to the 5.2.9 distribution and recompiled using the exact same options I used for the working 5.2.6 installation (except that I used a different directory so I'll be able to select which release to run). configure/make/make install went smoothly. Started all three daemons from the test install directory, all come up as expected: bacula 9851 1 0 22:35 ? 00:00:00 /usr/test/sbin/bacula-dir -u bacula -g bacula -c /etc/bacula/bacula-dir.conf bacula 9873 1 0 22:35 ? 00:00:00 /usr/test/sbin/bacula-sd -u bacula -g bacula -c /etc/bacula/bacula-sd.conf root 9894 1 0 22:35 ? 00:00:00 /usr/test/sbin/bacula-fd -u root -g bacula -c /etc/bacula/bacula-fd.conf Started bconsole, which connected to bacula-dir just fine. 'status storage' and 'status client' (which are both configured to connect to the local daemons, i.e. the patched 5.2.9 executables) work fine as well. So from a practical point of view the patch fixes the problem. I would be very interested to know why getaddrinfo returns the same address multiple times, though, which seems to be the root cause of the whole trouble. Anyway: Great job, thanks a lot! I assume the fix will be in 5.2.10? ---------------------------------------------------------------------- (0006417) marcovw (developer) - 2012-06-25 21:49 http://bugs.bacula.org/view.php?id=1897#c6417 ---------------------------------------------------------------------- Yes the root cause is something I plan on looking at when I have the time with a debugger. This solution also fixes a problem we have had for a long time e.g. when you defined multiple listeners with resolved to the same address. And yes you shouldn't do that but if it happens it kind of nice if the daemon just ignores duplicate entries and suppresses those. And yes this will be part of 5.2.10 I will at least commit this patch and if I find something debugging the root cause an extra patch may follow but for now we at least we work around any problems and a smarter fix may follow but this safeguard will stay in place. ---------------------------------------------------------------------- (0006418) peckel (reporter) - 2012-06-26 09:52 http://bugs.bacula.org/view.php?id=1897#c6418 ---------------------------------------------------------------------- Since the solution also fixes another problem at least there was some benefit from it :-) Just for the record: I built new 5.2.9 RPMs for CentOS 5 and CentOS 6 and distributed them to my server and my client machines, and all backups now run well again. With Mac OS X and Ubuntu it wasn't a problem to get the older versions which work as well. I'm looking forward to 5.2.10 anyway, since I'm also bitten by the 'catalog backup' bug 1893. ---------------------------------------------------------------------- (0006423) marcovw (developer) - 2012-06-26 18:32 http://bugs.bacula.org/view.php?id=1897#c6423 ---------------------------------------------------------------------- I will look into the root cause later but as we want to cut a new release I'm closing this bug report. ---------------------------------------------------------------------- (0006424) marcovw (developer) - 2012-06-26 20:55 http://bugs.bacula.org/view.php?id=1897#c6424 ---------------------------------------------------------------------- Also found the real problem. Needed to be more explicit in the hints for getaddrinfo and only say we are interested in SOCK_STREAM socket types and IPPROTO_TCP as protocol. If you don't do that you get - SOCK_DGRAM/UDP - SOCK_STREAM/TCP - SOCK_RAW/NONE which explains that the address is 3 times in the list. But the bug was even deeper we call inet_pton and check for a return value > 1 in bnet_host2ipaddr but we should test there for == 1 as a return value of 1 means conversion of ipv6 address is successfull. Then we don't call resolv_host at all which did the wrong getaddrinfo call. Its much nicer if you give a plain ipv6 address that we just convert it and don't try to resolve it at all. That is how is was designed only it wasn't working properly. So in the end we ended up with 3 patches which fix 3 problems so not to bad in the end for something that looked rather obvious. Issue History Date Modified Username Field Change ====================================================================== 2012-06-24 21:00 peckel New Issue 2012-06-25 07:39 kern Note Added: 0006409 2012-06-25 07:39 kern Status new => acknowledged 2012-06-25 09:41 marcovw Note Added: 0006410 2012-06-25 10:24 marcovw Assigned To => marcovw 2012-06-25 10:24 marcovw Status acknowledged => assigned 2012-06-25 10:58 peckel Note Added: 0006411 2012-06-25 11:00 peckel File Added: bacula-debug.log.gz 2012-06-25 11:13 peckel Note Added: 0006412 2012-06-25 11:27 peckel File Added: bacula-bconsole-5.2.6-debug100.log.gz 2012-06-25 11:28 peckel File Added: bacula-dir-5.2.6-debug100.log.gz 2012-06-25 11:28 peckel File Added: bacula-bconsole-5.2.9-debug100.log.gz 2012-06-25 11:28 peckel File Added: bacula-dir-5.2.9-debug100.log.gz 2012-06-25 11:32 peckel Note Added: 0006413 2012-06-25 12:06 marcovw Note Added: 0006414 2012-06-25 20:33 marcovw File Added: 0001-First-try-at-fixing-bug-1897.patch 2012-06-25 20:34 marcovw Note Added: 0006415 2012-06-25 20:35 marcovw Status assigned => feedback 2012-06-25 21:43 peckel Note Added: 0006416 2012-06-25 21:43 peckel Status feedback => assigned 2012-06-25 21:49 marcovw Note Added: 0006417 2012-06-26 08:35 peckel Note Added: 0006418 2012-06-26 09:52 peckel Note Edited: 0006418 2012-06-26 18:32 marcovw Note Added: 0006423 2012-06-26 18:32 marcovw Assigned To marcovw => 2012-06-26 18:32 marcovw Status assigned => closed 2012-06-26 18:32 marcovw Resolution open => fixed 2012-06-26 18:32 marcovw Fixed in Version => 5.2.10 2012-06-26 20:55 marcovw Assigned To => marcovw 2012-06-26 20:55 marcovw Note Added: 0006424 2012-06-26 20:55 marcovw Status closed => feedback 2012-06-26 20:55 marcovw Resolution fixed => reopened ====================================================================== |