From: Mantis B. T. <no...@bu...> - 2011-04-12 21:14:16
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1612 ====================================================================== Reported By: akoch Assigned To: ebollengier ====================================================================== Project: bacula Issue ID: 1612 Category: Director Reproducibility: always Severity: minor Priority: normal Status: feedback ====================================================================== Date Submitted: 2010-07-20 11:06 GMT Last Modified: 2011-04-12 17:53 GMT ====================================================================== Summary: Bacula does not save a checksum for hard links during full backup, but expects one during accurate differential backups Description: 1) Bacula does not save a checksum for hard links during a full backup. E.g., here's the output of FD with debug value 400 for a hard link: /usr/share/postgresql-8.4/man/man7/table.7.bz2 gundabad-fd: backup.c:326-32 FT_LNKSAVED hard link: /usr/share/postgresql-8.4/man/man7/table.7.bz2 => /usr/share/postgresql-8.4/man/man7/with.7.bz2 gundabad-fd: backup.c:424-32 bfiled: sending /usr/share/postgresql-8.4/man/man7/table.7.bz2 to stored gundabad-fd: backup.c:1102-32 encode_and_send_attrs fname=/usr/share/postgresql-8.4/man/man7/table.7.bz2 gundabad-fd: backup.c:1114-32 File /usr/share/postgresql-8.4/man/man7/table.7.bz2 attribs=PwA Dky9 IGk C A A A 6 BAA I BMBmj+ BMBmj+ BMBmkF BW A C attribsEx= gundabad-fd: backup.c:1133-32 >stored: attrhdr 138 1 0 gundabad-fd: backup.c:1226-32 No strip for /usr/share/postgresql-8.4/man/man7/table.7.bz2 gundabad-fd: backup.c:1151-32 Link /usr/share/postgresql-8.4/man/man7/table.7.bz2 to /usr/share/postgresql-8.4/man/man7/with.7.bz2 gundabad-fd: backup.c:1167-32 >stored: attr len=157: 138 1 /usr/share/postgresql-8.4/man/man7/table.7.bz2 gundabad-fd: backup.c:532-32 do_read=0 gundabad-fd: find_one.c:480-32 FT_LNKSAVED FI=138 LinkFI=86 file=/usr/share/postgresql-8.4/man/man7/with.7.bz2 Note the absence of the crypto_digest line. 2) The checksum _is_ saved for the regular file pointed to by the hard link: gundabad-fd: find_one.c:357-32 File ----: /usr/share/postgresql-8.4/man/man7/with.7.bz2 gundabad-fd: find_one.c:494-32 added to hash FI=85 file=/usr/share/postgresql-8.4/man/man7/with.7.bz2 gundabad-fd: backup.c:333-32 FT_REG saving: /usr/share/postgresql-8.4/man/man7/with.7.bz2 gundabad-fd: backup.c:424-32 bfiled: sending /usr/share/postgresql-8.4/man/man7/with.7.bz2 to stored gundabad-fd: crypto.c:607-32 crypto_digest_new jcr=207b978 gundabad-fd: backup.c:1102-32 encode_and_send_attrs fname=/usr/share/postgresql-8.4/man/man7/with.7.bz2 gundabad-fd: backup.c:1114-32 File /usr/share/postgresql-8.4/man/man7/with.7.bz2 attribs=PwA Dky9 IGk C A A A 6 BAA I BMBmj+ BMBmj+ BMBmkF A A C attribsEx= gundabad-fd: backup.c:1133-32 >stored: attrhdr 86 1 0 gundabad-fd: backup.c:1226-32 No strip for /usr/share/postgresql-8.4/man/man7/with.7.bz2 gundabad-fd: backup.c:1167-32 >stored: attr len=109: 86 3 /usr/share/postgresql-8.4/man/man7/with.7.bz2 gundabad-fd: backup.c:532-32 do_read=1 gundabad-fd: bfile.c:900-32 open file /usr/share/postgresql-8.4/man/man7/with.7.bz2 gundabad-fd: bfile.c:924-32 Open file 7 gundabad-fd: backup.c:786-32 Saving data, type=3 gundabad-fd: backup.c:863-32 >stored: datahdr 86 2 0 gundabad-fd: backup.c:1023-32 Send data to SD len=58 gundabad-fd: bfile.c:965-32 Close file 7 gundabad-fd: backup.c:715-32 bfiled>stored:header 86 10 0 gundabad-fd: find_one.c:518-32 FT_REG FI=86 linked=1 file=/usr/share/postgresql-8.4/man/man7/with.7.bz2 3) When doing a Differential Backup in Accurate pins1 mode building on that Full backup, Bacula does expect a checksum to be present for hard links. Since that checksum was never stored in the first place, all hard links cause warnings: Warning: Can't verify checksum for (linkname...) Steps to Reproduce: 1) Do a full backup of a FileSet containing one or more hard links. 2) Do an Accurate Differential of that same FileSet. All hard links in the set will cause the warning due to missing checksum. ====================================================================== ---------------------------------------------------------------------- (0005582) ebollengier (administrator) - 2010-08-31 13:24 http://bugs.bacula.org/view.php?id=1612#c5582 ---------------------------------------------------------------------- We need to add the checksum in the hardlink node and send it like with other files. ---------------------------------------------------------------------- (0005762) ebollengier (administrator) - 2011-02-19 14:23 http://bugs.bacula.org/view.php?id=1612#c5762 ---------------------------------------------------------------------- Can you try the proposed patch and give us your feedbacks ? ---------------------------------------------------------------------- (0005764) bart613 (reporter) - 2011-02-20 10:15 http://bugs.bacula.org/view.php?id=1612#c5764 ---------------------------------------------------------------------- For me, after applying attached patch on client, director and storage daemon, it caused client to fail and eat up all memory causing OOM killer to kick-in and kill everything around. After downgrading to version without this patch everything went back to normal. I can replicate this as many times as you want as well as I can provide SRPM/RPM with attached patch. If you need me to perform any other operations let me know - I will happily do. ---------------------------------------------------------------------- (0005773) ebollengier (administrator) - 2011-02-24 19:50 http://bugs.bacula.org/view.php?id=1612#c5773 ---------------------------------------------------------------------- A bit strange, I will probably simplify the patch and just skip the message. ---------------------------------------------------------------------- (0005831) bart613 (reporter) - 2011-04-11 10:08 http://bugs.bacula.org/view.php?id=1612#c5831 ---------------------------------------------------------------------- This bug is becoming very annoying. Especially when I am trying to read Bacula daily backups report emails -- have to scroll down every single one through several hundreds lines to get to the interesting bit. As I mentioned before, I tested the patch on three installations, and it caused OOM killer on all three. Is there anything I could do to help you nail down the problem? ---------------------------------------------------------------------- (0005832) ebollengier (administrator) - 2011-04-11 13:01 http://bugs.bacula.org/view.php?id=1612#c5832 ---------------------------------------------------------------------- I would like also fix this problem before the 5.2 release, but I'm running a bit out of time nowdays. We have two choices, the first one is to fix the current patch, which is pretty good I think, the second solution is to discard the warning message. I think that you can add an extra check when storing the checksum in the hardlink struct and run your tests again. in find_one.c:ff_pkt_set_digest() - if (ff_pkt->linked) + if (ff_pkt->linked && (ff_pkt->linked->digest == NULL)) ---------------------------------------------------------------------- (0005833) bart613 (reporter) - 2011-04-11 14:15 http://bugs.bacula.org/view.php?id=1612#c5833 ---------------------------------------------------------------------- Same result -- when I tried to either run Full or Incremental backup, bacula-fd is using almost all memory and server ending up with OOM killer rushing around. Shortly before server ran out of memory: [...] root 16036 2.5 89.9 2825240 942896 ? Ssl 17:53 0:59 /usr/sbin/bacula-fd -u root -g bacula -c /etc/bacula/bacula-fd.conf Tried to strace bacula-fd during this time but can't see anything interesting there: # strace -tt -f -p 16036 Process 16036 attached with 4 threads - interrupt to quit [pid 16169] 18:32:46.345833 select(6, [5], NULL, NULL, {1, 0} <unfinished ...> [pid 16037] 18:32:46.354453 clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 16036] 18:32:46.354554 futex(0x2b21448df380, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 16037] 18:32:46.354653 <... clock_gettime resumed> {1302543166, 354506000}) = 0 [pid 16037] 18:32:46.354722 futex(0x2b21448dfe44, FUTEX_WAIT_PRIVATE, 389, {13, 105518000} <unfinished ...> [pid 16169] 18:32:47.353145 <... select resumed> ) = 0 (Timeout) [pid 16169] 18:32:47.353393 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) [pid 16169] 18:32:52.353624 select(6, [5], NULL, NULL, {5, 0} <unfinished ...> Ready for further testing and feeding back :-) ---------------------------------------------------------------------- (0005835) bart613 (reporter) - 2011-04-11 14:19 http://bugs.bacula.org/view.php?id=1612#c5835 ---------------------------------------------------------------------- Also, more to add, when I killed bacula-fd I got following as a backup job report: 11-Apr 17:56 some.host.com-dir JobId 925: Start Backup JobId 925, Job=donkey_FS.2011-04-11_17.56.23_05 11-Apr 17:56 some.host.com-dir JobId 925: Using Device "olartek-FileStorage" 11-Apr 17:56 some.host.com JobId 925: Volume "olartek-0026" previously written, moving to end of data. 11-Apr 17:56 some.host.com JobId 925: Error: Bacula cannot write on disk Volume "olartek-0026" because: The sizes do not match! Volume=22855298947 Catalog=22852718467 11-Apr 17:56 some.host.com JobId 925: Marking Volume "olartek-0026" in Error in Catalog. 11-Apr 17:56 some.host.com-dir JobId 925: Created new Volume "olartek-0029" in catalog. 11-Apr 17:56 some.host.com JobId 925: Labeled new Volume "olartek-0029" on device "olartek-FileStorage" (/var/lib/bacula/storage/). 11-Apr 17:56 some.host.com JobId 925: Wrote label to prelabeled Volume "olartek-0029" on device "olartek-FileStorage" (/var/lib/bacula/storage/) 11-Apr 18:33 some.host.com JobId 925: Fatal error: append.c:242 Network error reading from FD. ERR=No data available 11-Apr 18:33 some.host.com-dir JobId 925: Fatal error: Network error with FD during Backup: ERR=No data available 11-Apr 18:33 some.host.com JobId 925: Job write elapsed time = 00:36:35, Transfer rate = 36.60 K Bytes/second 11-Apr 18:33 some.host.com-dir JobId 925: Fatal error: No Job status returned from FD. 11-Apr 18:33 some.host.com-dir JobId 925: Error: Bacula some.host.com-dir 5.0.3 (04Aug10): 11-Apr-2011 18:33:04 Build OS: x86_64-redhat-linux-gnu redhat JobId: 925 Job: donkey_FS.2011-04-11_17.56.23_05 Backup Level: Full Client: "some.host.com-fd" 5.0.3 (04Aug10) x86_64-redhat-linux-gnu,redhat, FileSet: "donkey FileSet" 2011-01-29 02:05:00 Pool: "olartek-polka Pool" (From Job resource) Catalog: "PWS" (From Client resource) Storage: "olartek-polka" (From Pool resource) Scheduled time: 11-Apr-2011 17:56:19 Start time: 11-Apr-2011 17:56:26 End time: 11-Apr-2011 18:33:04 Elapsed time: 36 mins 38 secs Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: yes Volume name(s): olartek-0029 Volume Session Id: 54 Volume Session Time: 1302262981 Last Volume Bytes: 215 (215 B) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: Error SD termination status: Running Termination: *** Backup Error *** ---------------------------------------------------------------------- (0005836) ebollengier (administrator) - 2011-04-11 15:24 http://bugs.bacula.org/view.php?id=1612#c5836 ---------------------------------------------------------------------- How many hardlinks do you have on your system? It's hard to believe that a single strdup of 10 checksums can cause a out of memory. If you have millions of them, we will have to find an other solution. It would be nice to test with a small set of files. My tests didn't show any orphan buffers. I can provide you a patch for the current 5.1 release. ---------------------------------------------------------------------- (0005837) bart613 (reporter) - 2011-04-12 07:50 http://bugs.bacula.org/view.php?id=1612#c5837 ---------------------------------------------------------------------- There is probably around 8.000. Most of the usual hardlinks that are in pretty much every single CentOS 5 installation these days. It rather seems like something is dead-looping somewhere at least it uses all the memory. If you provide patch for 5.1 release I will upgrade and double-check, perhaps it will nicely resolve itself there. ---------------------------------------------------------------------- (0005838) ebollengier (administrator) - 2011-04-12 12:59 http://bugs.bacula.org/view.php?id=1612#c5838 ---------------------------------------------------------------------- I know why my patch doesn't work, the digest string is a binary, so using strdup() is not a good idea with it. I have to figure how to change that. ---------------------------------------------------------------------- (0005839) ebollengier (administrator) - 2011-04-12 15:46 http://bugs.bacula.org/view.php?id=1612#c5839 ---------------------------------------------------------------------- The new version of the patch should work as expected now. ---------------------------------------------------------------------- (0005840) bart613 (reporter) - 2011-04-12 17:53 http://bugs.bacula.org/view.php?id=1612#c5840 ---------------------------------------------------------------------- Is this one for 5.0.3 or for 5.1? It does not seem to apply to 5.0.3 cleanly... Issue History Date Modified Username Field Change ====================================================================== 2010-07-20 11:06 akoch New Issue 2010-07-20 13:04 mnalis Issue Monitored: mnalis 2010-08-31 13:24 ebollengier Note Added: 0005582 2010-08-31 13:24 ebollengier Status new => acknowledged 2011-02-13 10:45 bart613 Issue Monitored: bart613 2011-02-19 14:22 ebollengier File Added: 0001-Fix-1612-about-checksum-for-hardlinks.patch 2011-02-19 14:23 ebollengier Note Added: 0005762 2011-02-19 14:23 ebollengier Assigned To => ebollengier 2011-02-19 14:23 ebollengier Status acknowledged => feedback 2011-02-20 10:15 bart613 Note Added: 0005764 2011-02-24 19:50 ebollengier Note Added: 0005773 2011-04-11 10:08 bart613 Note Added: 0005831 2011-04-11 13:01 ebollengier Note Added: 0005832 2011-04-11 14:15 bart613 Note Added: 0005833 2011-04-11 14:18 bart613 Note Added: 0005834 2011-04-11 14:18 bart613 Note Deleted: 0005834 2011-04-11 14:19 bart613 Note Added: 0005835 2011-04-11 15:25 ebollengier Note Added: 0005836 2011-04-11 15:26 ebollengier File Deleted: 0001-Fix-1612-about-checksum-for-hardlinks.patch 2011-04-11 15:26 ebollengier File Added: 0001-Fix-1612-about-checksum-for-hardlinks.patch 2011-04-12 07:50 bart613 Note Added: 0005837 2011-04-12 12:59 ebollengier Note Added: 0005838 2011-04-12 15:45 ebollengier File Deleted: 0001-Fix-1612-about-checksum-for-hardlinks.patch 2011-04-12 15:46 ebollengier File Added: 0001-Fix-1612-about-checksum-for-hardlinks.patch 2011-04-12 15:46 ebollengier Note Added: 0005839 2011-04-12 17:53 bart613 Note Added: 0005840 ====================================================================== |