From: Mantis B. T. <no...@bu...> - 2009-02-20 13:38:11
|
The following issue has been REOPENED. ====================================================================== http://bugs.bacula.org/view.php?id=1061 ====================================================================== Reported By: RalfGross Assigned To: ====================================================================== Project: bacula Issue ID: 1061 Category: Director Reproducibility: sometimes Severity: minor Priority: normal Status: feedback ====================================================================== Date Submitted: 2008-03-04 07:40 UTC Last Modified: 2009-02-20 11:39 UTC ====================================================================== Summary: Fatal error: bsock.c: 415 Packet size too big Description: Lately I get these "Packet size too big" errors more often. Until now it only happens during verify jobs. A second verify job completes mostly without an error. I use the Client option to do the verify job on the director/sd, so there shouldn't be any networking be involved. SMTCZB0003 = fd VUMEM004-fd = director, fd and sd Job { Name = VerifySMTCZB0003 Type = Verify Level = VolumeToCatalog Verify Job = SMTCZB0003 FileSet = SMTCZB0003 Schedule = "Verify Regular Backup SMTCZB0003" Client = VUMEM004-fd Messages = "Regular Backup" Pool = Default Storage = NEC-T40A Priority = 10 } http://bugs.bacula.org/view.php?id=4-Mär 07:05 VUMEM004-dir JobId 1394: Verifying against JobId=1384 Job=SMTCZB0003.2008-03-02_00.05.09 04-Mär 07:05 VUMEM004-dir JobId 1394: Bootstrap records written to /opt/bacula/var/bacula/working/VUMEM004-dir.restore.1.bsr 04-Mär 07:05 VUMEM004-dir JobId 1394: 04-Mär 07:05 VUMEM004-dir JobId 1394: The job will require the following Volume(s) Storage(s) SD Device(s) =========================================================================== 04-Mär 07:05 VUMEM004-dir JobId 1394: 04-Mär 07:05 VUMEM004-dir JobId 1394: 06D135L3 NEC-T40A NEC-T40A 04-Mär 07:05 VUMEM004-dir JobId 1394: 04-Mär 07:05 VUMEM004-dir JobId 1394: Start Verify JobId=1394 Level=VolumeToCatalog Job=VerifySMTCZB0003.2008-03-04_07.05.08 04-Mär 07:05 VUMEM004-dir JobId 1394: Using Device "LTO3" 04-Mär 07:05 VUMEM004-sd JobId 1394: 3307 Issuing autochanger "unload slot 4, drive 0" command. 04-Mär 07:07 VUMEM004-sd JobId 1394: 3304 Issuing autochanger "load slot 16, drive 0" command. 04-Mär 07:08 VUMEM004-sd JobId 1394: 3305 Autochanger "load slot 16, drive 0", status is OK. 04-Mär 07:08 VUMEM004-sd JobId 1394: 3301 Issuing autochanger "loaded? drive 0" command. 04-Mär 07:08 VUMEM004-sd JobId 1394: 3302 Autochanger "loaded? drive 0", result is Slot 16. 04-Mär 07:08 VUMEM004-sd JobId 1394: Ready to read from volume "06D135L3" on device "LTO3" (/dev/ULTRIUM-TD3). 04-Mär 07:08 VUMEM004-sd JobId 1394: Forward spacing Volume "06D135L3" to file:block 0:1. 04-Mär 07:08 VUMEM004-sd JobId 1394: End of file 1 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D135L3" 04-Mär 07:09 VUMEM004-sd JobId 1394: End of file 2 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D135L3" 04-Mär 07:10 VUMEM004-sd JobId 1394: End of file 3 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D135L3" 04-Mär 07:10 VUMEM004-sd JobId 1394: End of file 4 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D135L3" 04-Mär 07:12 VUMEM004-sd JobId 1394: End of file 5 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D135L3" 04-Mär 07:12 VUMEM004-sd JobId 1394: End of file 6 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D135L3" 04-Mär 07:16 VUMEM004-sd JobId 1394: End of file 7 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D135L3" 04-Mär 07:16 VUMEM004-dir JobId 1394: Fatal error: bsock.c:415 Packet size too big from "Client: VUMEM004-fd:10.60.1.231:9102. Terminating connection. 04-Mär 07:16 VUMEM004-dir JobId 1394: The following files are in the Catalog but not on the Volume(s): [...] I asked on the user list what the source of this error message could be, but got no response yet. What can I do find the source of the problem? Run the fd with -d100 or should I increase the directors debug level? ====================================================================== ---------------------------------------------------------------------- (0003183) kern (administrator) - 2008-03-04 18:23 http://bugs.bacula.org/view.php?id=1061#c3183 ---------------------------------------------------------------------- I suspect that you have some other program writing to the Director's port -- most likely a HP printer on the network, or possibly you are running two simultaneous Verify jobs for the job, which is a feature that is not supported. However, just to check, please upload your bacula-dir.conf and bacula-fd.conf files. ---------------------------------------------------------------------- (0003187) RalfGross (reporter) - 2008-03-04 20:20 http://bugs.bacula.org/view.php?id=1061#c3187 ---------------------------------------------------------------------- Just to be clear, all involved bacula daemons are running on the same server (dir, sd and fd). I've split my conf files into parts (clients, storage, pools etc), I'll attach them tomorrow morning as tar file. ---------------------------------------------------------------------- (0003190) kern (administrator) - 2008-03-05 06:56 http://bugs.bacula.org/view.php?id=1061#c3190 ---------------------------------------------------------------------- Yes, I noticed that everything is running on one machine, otherwise I would have told you that the most probable problem is a bad network card or network driver. Just the same, you might try replacing your network card or updating the kernel network driver as the problem may be a 32/64 bit problem (not likely a Bacula problem). ---------------------------------------------------------------------- (0003195) RalfGross (reporter) - 2008-03-05 13:10 http://bugs.bacula.org/view.php?id=1061#c3195 ---------------------------------------------------------------------- I've uploaded the config files (bacula-conf.tar.gz). Regarding the HP printers: The bacula server is located in a datacenter, there are no printers. Do you mean that it's a problem if another host contacts the the bacula server on port 9101? That would be a perfect denial of service situation (maybe I simply missunderstood you). There were no other verify jobs running at that time. example: differential backup of 2 hosts + catalog and 3 verify jobs starting at differnt times in the morning. The last verify job failed. | 1,348 | SMTCZB0003 | 2008-02-24 00:07:13 | B | D | 29,147 | 3,082,305,184 | T | | 1,349 | VU0EM003 | 2008-02-24 00:17:52 | B | D | 137,537 | 123,991,016,288 | T | | 1,350 | BackupCatalog | 2008-02-24 01:51:24 | B | F | 1 | 1,367,148,780 | T | | 1,351 | VerifySMTCZB0003 | 2008-02-24 12:05:00 | V | O | 29,147 | 0 | T | | 1,352 | VerifyCatalog | 2008-02-24 12:07:53 | V | O | 1 | 0 | T | | 1,353 | VerifyVU0EM003 | 2008-02-24 12:09:27 | V | O | 121,494 | 0 | f 24-Feb 12:09 VU0EM005-dir JobId 1353: Verifying against JobId=1349 Job=VU0EM003.2008-02-24_00.06.22 24-Feb 12:09 VU0EM005-dir JobId 1353: Bootstrap records written to /opt/bacula/var/bacula/working/VU0EM005-dir.restore.5.bsr 24-Feb 12:09 VU0EM005-dir JobId 1353: 24-Feb 12:09 VU0EM005-dir JobId 1353: The job will require the following Volume(s) Storage(s) SD Device(s) =========================================================================== 24-Feb 12:09 VU0EM005-dir JobId 1353: 24-Feb 12:09 VU0EM005-dir JobId 1353: 06D125L3 NEC-T40A NEC-T40A 24-Feb 12:09 VU0EM005-dir JobId 1353: 24-Feb 12:09 VU0EM005-dir JobId 1353: Start Verify JobId=1353 Level=VolumeToCatalog Job=VerifyVU0EM003.2008-02-24_12.06.50 24-Feb 12:09 VU0EM005-dir JobId 1353: Using Device "LTO3" 24-Feb 12:10 VU0EM005-sd JobId 1353: Ready to read from volume "06D125L3" on device "LTO3" (/dev/ULTRIUM-TD3). 24-Feb 12:10 VU0EM005-sd JobId 1353: Forward spacing Volume "06D125L3" to file:block 4:0. 24-Feb 12:10 VU0EM005-sd JobId 1353: End of file 5 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D125L3" [.....] 24-Feb 12:43 VU0EM005-sd JobId 1353: End of file 120 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D125L3" 24-Feb 12:43 VU0EM005-sd JobId 1353: End of file 121 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D125L3" 24-Feb 12:44 VU0EM005-sd JobId 1353: End of file 122 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D125L3" 24-Feb 12:44 VU0EM005-sd JobId 1353: End of file 123 on device "LTO3" (/dev/ULTRIUM-TD3), Volume "06D125L3" 24-Feb 12:44 VU0EM005-dir JobId 1353: Fatal error: bsock.c:415 Packet size too big from "Client: VU0EM005-fd:10.60.1.250:9102. Terminating connection. 24-Feb 12:44 VU0EM005-dir JobId 1353: The following files are in the Catalog but not on the Volume(s): ---------------------------------------------------------------------- (0003201) RalfGross (reporter) - 2008-03-06 22:10 http://bugs.bacula.org/view.php?id=1061#c3201 ---------------------------------------------------------------------- I just added a new client and did a full backup + verify. The first verify failed with the "Packet size too big" error. The next attempt was successful. Then I restarted the verify job a couple of times to see how many jobs may fail. From 8 verify jobs of the same client, 6 failed (fd, sd, dir on the same client). It also failed for one job where I used a remote fd. Would it help to attach a tcpdump, or increase the debug level of the fd (or sd/dir)? ---------------------------------------------------------------------- (0003212) RalfGross (reporter) - 2008-03-15 13:41 http://bugs.bacula.org/view.php?id=1061#c3212 ---------------------------------------------------------------------- One of my verify jobs failed again. The other 3 jobs completed without problems. I then reran the job 5 times, it always failed with the bsock error. I'v never seen this error during backup, only with verify jobs. I monitored the network traffic with tcpdump, there was no other network device sending packages to the bacula ports. I ran the bacula-dir, bacula-fd and bacula-sd all with debug level 50. But I can't find anything regarding "bsock" in the debug files. 15-Mär 14:31 VUMEM004-dir JobId 1591: Fatal error: bsock.c:415 Packet size too big from "Client: VUMEM004-fd:10.60.1.231:9102. Terminating connection. the last thing I see in the bacula-dir debug file: VUMEM004-dir: verify.c:570-0 dird<filed: stream=1 /server/cvsroot/iprep/lane_recog/laneRecogBirdview/Makefile,v VUMEM004-dir: verify.c:571-0 dird<filed: attr=P0C FyjC IEk B BQu Sg A Dxj BAA g BH2lJ/ BH2lJ/ BH2lJ/ A A C VUMEM004-dir: sql_get.c:127-0 Get_file_record JobId=1566 FilenameId=324150 PathId=753441 VUMEM004-dir: sql_get.c:133-0 get_file_record num_rows=1 VUMEM004-dir: verify.c:570-0 dird<filed: stream=1 /server/cvsroot/iprep/lane_recog/laneRecogBirdview/add_path,v VUMEM004-dir: verify.c:571-0 dird<filed: attr=P0C FyjD IFt B BQu Sg A Uj BAA I BH2lJ/ BH2lJ/ BH2lJ/ A A C VUMEM004-dir: sql_get.c:127-0 Get_file_record JobId=1566 FilenameId=537150 PathId=753441 VUMEM004-dir: sql_get.c:133-0 get_file_record num_rows=1 VUMEM004-dir: job.c:1161-0 wstore=VU0EM003-DISK where= VUMEM004-dir: cram-md5.c:80-0 send: auth cram-md5 <969594968.1205587899@VUMEM004-dir> ssl=0 VUMEM004-dir: cram-md5.c:99-0 Authenticate OK FS+h45/M0hkCu9+pCC+feD VUMEM004-dir: job.c:1161-0 wstore=VU0EM003-DISK where= VUMEM004-dir: cram-md5.c:80-0 send: auth cram-md5 <2024047328.1205587904@VUMEM004-dir> ssl=0 VUMEM004-dir: cram-md5.c:99-0 Authenticate OK R/9kF9cDv6/kQ7/jD3h3lD VUMEM004-dir: postgresql.c:513-0 Result status failed: DROP TABLE DelCandidates VUMEM004-dir: postgresql.c:513-0 Result status failed: DROP INDEX DelInx1 VUMEM004-dir: postgresql.c:513-0 Result status failed: DROP INDEX DelInx1 VUMEM004-dir: ua_prune.c:208-0 select now=1205587904 period=2592000 sql=SELECT count(*) from Job WHERE JobTDate<1202995904 AND ClientId=19 AND PurgedFiles=0 VUMEM004-dir: job.c:340-0 ======== End Job stat=f ========== VUMEM004-dir: ua_status.c:115-0 status:status client=VUMEM004-fd: VUMEM004-dir: btimers.c:225-0 thread timer 5c9768 kill thread tid=44007960 at 1205587962. VUMEM004-dir: fd_cmds.c:102-0 Opened connection with File daemon VUMEM004-dir: authenticate.c:186-0 Sent: Hello Director VUMEM004-dir calling VUMEM004-dir: cram-md5.c:80-0 send: auth cram-md5 <1622592909.1205587962@VUMEM004-dir> ssl=0 VUMEM004-dir: cram-md5.c:99-0 Authenticate OK x/+9b5g6Tg/z185wRE/CVC VUMEM004-dir: ua_status.c:360-0 Connected to file daemon and the bacula-fd debug file: VUMEM004-fd: verify_vol.c:219-0 bfiled>bdird: MD5 len=42: msg=73864 3 QmGFRhCiAjcF36dEas/CwA *MD5-73864* VUMEM004-fd: verify_vol.c:102-0 Got hdr: FilInx=73865 Stream=1. VUMEM004-fd: verify_vol.c:115-0 Got stream data, len=149 VUMEM004-fd: verify_vol.c:149-0 Got Attr: FilInx=73865 type=3 VUMEM004-fd: verify_vol.c:102-0 Got hdr: FilInx=73865 Stream=2. VUMEM004-fd: verify_vol.c:115-0 Got stream data, len=1902 VUMEM004-fd: verify_vol.c:102-0 Got hdr: FilInx=73865 Stream=15. VUMEM004-fd: verify_vol.c:115-0 Got stream data, len=33 VUMEM004-fd: verify_vol.c:102-0 Got hdr: FilInx=73865 Stream=3. VUMEM004-fd: verify_vol.c:115-0 Got stream data, len=16 VUMEM004-fd: verify_vol.c:219-0 bfiled>bdird: MD5 len=42: msg=73865 3 LlNUMVjjwI2HyOzzIl0v8Q *MD5-73865* VUMEM004-fd: verify_vol.c:102-0 Got hdr: FilInx=73866 Stream=1. VUMEM004-fd: verify_vol.c:115-0 Got stream data, len=149 VUMEM004-fd: verify_vol.c:149-0 Got Attr: FilInx=73866 type=3 VUMEM004-fd: verify_vol.c:266-0 End Verify-Vol. Files=73866 Bytes=0 VUMEM004-fd: job.c:252-0 Quit command loop. Canceled=1 VUMEM004-fd: cram-md5.c:73-0 send: auth cram-md5 <1540906308.1205587962@VUMEM004-fd> ssl=0 and the last part of the bacula-sd debug file: VUMEM004-sd: read.c:59-0 Start read data. VUMEM004-sd: acquire.c:108-0 MediaType dcr=File1 dev=File1 VUMEM004-sd: acquire.c:210-0 opened dev "VU0EM003-DISK" (/data/bacula-storage/vu0em003) OK VUMEM004-sd: acquire.c:213-0 calling read-vol-label VUMEM004-sd: reserve.c:313-0 jid=1591 reserve_volume vu0em003-inc-0099 VUMEM004-sd: reserve.c:181-0 jid=1591 List from begin reserve_volume: 06D126L3 at 58a418 on device "LTO3" (/dev/ULTRIUM-TD3) VUMEM004-sd: reserve.c:238-0 jid=1591 new Vol=vu0em003-inc-0099 at 592f88 dev="VU0EM003-DISK" (/data/bacula-storage/vu0em003) VUMEM004-sd: reserve.c:181-0 jid=1591 List from end new volume: 06D126L3 at 58a418 on device "LTO3" (/dev/ULTRIUM-TD3) VUMEM004-sd: reserve.c:181-0 jid=1591 List from end new volume: vu0em003-inc-0099 at 592f88 on device "VU0EM003-DISK" (/data/bacula-storage/vu0em003) Volume Label: Id : Bacula 1.0 immortal VerNo : 11 VolName : vu0em003-inc-0099 PrevVolName : VolFile : 0 LabelType : VOL_LABEL LabelSize : 190 PoolName : VU0EM003-Disk-Incremental MediaType : File1 PoolType : Backup HostName : VUMEM004 Date label written: 12-Mär-2008 00:10 VUMEM004-sd: acquire.c:293-0 Dec reserve=0 dev="VU0EM003-DISK" (/data/bacula-storage/vu0em003) VUMEM004-sd: acquire.c:108-0 MediaType dcr=File1 dev=File1 VUMEM004-sd: acquire.c:210-0 opened dev "VU0EM003-DISK" (/data/bacula-storage/vu0em003) OK VUMEM004-sd: acquire.c:213-0 calling read-vol-label VUMEM004-sd: reserve.c:313-0 jid=1591 reserve_volume vu0em003-inc-0120 VUMEM004-sd: reserve.c:181-0 jid=1591 List from begin reserve_volume: 06D126L3 at 58a418 on device "LTO3" (/dev/ULTRIUM-TD3) VUMEM004-sd: reserve.c:181-0 jid=1591 List from begin reserve_volume: vu0em003-inc-0099 at 592f88 on device "VU0EM003-DISK" (/data/bacula-storage/vu0em003) VUMEM004-sd: reserve.c:336-0 jid=1591 reserve_vol free vol=vu0em003-inc-0099 at 592f88 VUMEM004-sd: reserve.c:493-0 jid=1591 free_volume vu0em003-inc-0099 dev="VU0EM003-DISK" (/data/bacula-storage/vu0em003) VUMEM004-sd: reserve.c:181-0 jid=1591 List from free_volume: 06D126L3 at 58a418 on device "LTO3" (/dev/ULTRIUM-TD3) VUMEM004-sd: reserve.c:181-0 jid=1591 List from reserve_vol free: 06D126L3 at 58a418 on device "LTO3" (/dev/ULTRIUM-TD3) VUMEM004-sd: reserve.c:238-0 jid=1591 new Vol=vu0em003-inc-0120 at 592f88 dev="VU0EM003-DISK" (/data/bacula-storage/vu0em003) VUMEM004-sd: reserve.c:181-0 jid=1591 List from end new volume: 06D126L3 at 58a418 on device "LTO3" (/dev/ULTRIUM-TD3) VUMEM004-sd: reserve.c:181-0 jid=1591 List from end new volume: vu0em003-inc-0120 at 592f88 on device "VU0EM003-DISK" (/data/bacula-storage/vu0em003) Volume Label: Id : Bacula 1.0 immortal VerNo : 11 VolName : vu0em003-inc-0120 PrevVolName : VolFile : 0 LabelType : VOL_LABEL LabelSize : 190 PoolName : VU0EM003-Disk-Incremental MediaType : File1 PoolType : Backup HostName : VUMEM004 Date label written: 15-Mär-2008 00:19 VUMEM004-sd: cram-md5.c:73-0 send: auth cram-md5 <2043364569.1205587459@VUMEM004-sd> ssl=0 VUMEM004-sd: cram-md5.c:73-0 send: auth cram-md5 <1935863610.1205587899@VUMEM004-sd> ssl=0 VUMEM004-sd: cram-md5.c:73-0 send: auth cram-md5 <1204045562.1205587904@VUMEM004-sd> ssl=0 ---------------------------------------------------------------------- (0003213) RalfGross (reporter) - 2008-03-16 12:09 http://bugs.bacula.org/view.php?id=1061#c3213 ---------------------------------------------------------------------- An additional note: Today two verify jobs ran at the same time, same backup client, but different filesets. The second one failed with the bsock error. 1597 Volu 70,241 0 Error 16-Mär008 12:29 VerifyVU0EM003 The last few lines of the debug file: VUMEM004-dir: verify.c:570-0 dird<filed: stream=1 /data/mdat_radar/VPU/vp0064_0251/BASPLUS_ohneCMSWT_20080214_101744/BASPLUS_ohneCMSWT_20080214_101744.01f VUMEM004-dir: verify.c:571-0 dird<filed: attr=hQ ea+Lj IG4 B BON MZ2 A E3K BAA o BHvaS8 BHvaS8 BHxRyr A A C VUMEM004-dir: sql_get.c:127-0 Get_file_record JobId=1590 FilenameId=2040945 PathId=762391 VUMEM004-dir: sql_get.c:133-0 get_file_record num_rows=1 VUMEM004-dir: verify.c:570-0 dird<filed: stream=1 /data/mdat_radar/VPU/vp0064_0251/BASPLUS_ohneCMSWT_20080214_101744/BASPLUS_ohneCMSWT_20080214_101744.01c VUMEM004-dir: verify.c:571-0 dird<filed: attr=hQ ea+Lk IG4 B BON MZ2 A E3E BAA o BHvaS8 BHvaS8 BHxRyr A A C VUMEM004-dir: postgresql.c:513-0 Result status failed: DROP TABLE DelCandidates VUMEM004-dir: postgresql.c:513-0 Result status failed: DROP INDEX DelInx1 VUMEM004-dir: postgresql.c:513-0 Result status failed: DROP INDEX DelInx1 VUMEM004-dir: ua_prune.c:208-0 select now=1205666947 period=2592000 sql=SELECT count(*) from Job WHERE JobTDate<1203074947 AND ClientId=19 AND PurgedFiles=0 VUMEM004-dir: job.c:340-0 ======== End Job stat=f ========== The related backup job was a differential backup, backing up http://bugs.bacula.org/view.php?id=12#c30 GB with no problem. The first verify job that started today is still running and had no bsock error yet (http://bugs.bacula.org/view.php?id=2#c4 TB of data). Running Jobs: JobId 1595 Job VerifyVU0EM003-VPU.2008-03-16_09.56.25 is running. Verify Job started: 16-Mär008 09:56 Files=684,701 Bytes=0 Bytes/sec=0 Errors=0 Files Examined=684,701 Processing file: /data/mdat_radar/VPU/vp0083_1556/CMS2_ES0_20080226_131146/CMS2_ES0_20080226_131146.08d SDReadSeqNo=29600705 fd=5 I just can't find any reason for the bsock errors in the bacula-dir debug file. Nothing with the string bsock in any debug file. ---------------------------------------------------------------------- (0003224) RalfGross (reporter) - 2008-03-22 21:34 http://bugs.bacula.org/view.php?id=1061#c3224 ---------------------------------------------------------------------- I had no verify errors since last weekend, but today it happend again. 22-Mär 10:13 VUMEM004-dir JobId 1661: Fatal error: bsock.c:415 Packet size too big from "Client: VUMEM004-fd:10.60.1.231:9102. Terminating connection. 22-Mär 10:14 VUMEM004-sd JobId 1661: Job VerifyVU0EI001.2008-03-22_10.05.04 marked to be canceled. 22-Mär 10:14 VUMEM004-dir JobId 1661: Fatal error: No Job status returned from FD. bacula-dir debug: VUMEM004-dir: sql_get.c:127-0 Get_file_record JobId=1657 FilenameId=2 PathId=766960 VUMEM004-dir: sql_get.c:133-0 get_file_record num_rows=1 VUMEM004-dir: verify.c:570-0 dird<filed: stream=1 /var/www/fit-2.17.6-erw/template/de-de/.svn/tmp/wcprops/ VUMEM004-dir: verify.c:571-0 dird<filed: attr=gQ YAYDh EHt C P0 P0 A G BAA A BH<FF><FF><FF><FA>31+l BDhA07 BH31+l A A VUMEM004-dir: sql_get.c:127-0 Get_file_record JobId=1657 FilenameId=2 PathId=766963 VUMEM004-dir: sql_get.c:133-0 get_file_record num_rows=1 VUMEM004-dir: job.c:1161-0 wstore=ITD-DISK where= VUMEM004-dir: cram-md5.c:80-0 send: auth cram-md5 <1779860052.1206177241@VUMEM004-dir> ssl=0 VUMEM004-dir: cram-md5.c:99-0 Authenticate OK ZT+vNkBbREQMe3BSW5hSgA VUMEM004-dir: job.c:1161-0 wstore=ITD-DISK where= VUMEM004-dir: cram-md5.c:80-0 send: auth cram-md5 <312989958.1206177246@VUMEM004-dir> ssl=0 VUMEM004-dir: cram-md5.c:99-0 Authenticate OK X6+UTyJ827B3M7ZXR6wQkB VUMEM004-dir: postgresql.c:513-0 Result status failed: DROP TABLE DelCandidates VUMEM004-dir: postgresql.c:513-0 Result status failed: DROP INDEX DelInx1 VUMEM004-dir: postgresql.c:513-0 Result status failed: DROP INDEX DelInx1 VUMEM004-dir: ua_prune.c:208-0 select now=1206177246 period=2592000 sql=SELECT count(*) from Job WHERE JobTDate<1203585246 AND ClientId=19 AND PurgedFiles=0 VUMEM004-dir: job.c:340-0 ======== End Job stat=f ========== The next verify job ran without problems. BTW: it'd be really helpful if the debug output would have some timestamps. I can't find any network card related problems, nothing in /var/log/messages. And I've no problems backing up TB's of data. It's just that the verify jobs fail now and then. Here're the last few lines of a tcpdump done during a verify job (tcpdump -i lo port 9102 -s0) 21:14:39.553628 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: P 3704010:3714762(10752) ack 1 win 97 <nop,nop,timestamp 328767822 328767769> 21:14:39.553634 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3714762 win 0 <nop,nop,timestamp 328767822 328767822> 21:14:39.765630 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: . ack 1 win 97 <nop,nop,timestamp 328767875 328767822> 21:14:39.765635 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3714762 win 0 <nop,nop,timestamp 328767875 328767822> 21:14:40.032692 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3714762 win 35 <nop,nop,timestamp 328767941 328767822> 21:14:40.032697 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: P 3714762:3720394(5632) ack 1 win 97 <nop,nop,timestamp 328767941 328767941> 21:14:40.032701 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3720394 win 26 <nop,nop,timestamp 328767941 328767941> 21:14:40.241638 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: P 3720394:3733706(13312) ack 1 win 97 <nop,nop,timestamp 328767994 328767941> 21:14:40.241645 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3733706 win 0 <nop,nop,timestamp 328767994 328767994> 21:14:40.449640 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: . ack 1 win 97 <nop,nop,timestamp 328768046 328767994> 21:14:40.449645 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3733706 win 0 <nop,nop,timestamp 328768046 328767994> 21:14:40.865645 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: . ack 1 win 97 <nop,nop,timestamp 328768150 328768046> 21:14:40.865649 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3733706 win 0 <nop,nop,timestamp 328768150 328767994> 21:14:41.697658 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: . ack 1 win 97 <nop,nop,timestamp 328768358 328768150> 21:14:41.697662 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3733706 win 0 <nop,nop,timestamp 328768358 328767994> 21:14:43.361682 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: . ack 1 win 97 <nop,nop,timestamp 328768774 328768358> 21:14:43.361686 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3733706 win 0 <nop,nop,timestamp 328768774 328767994> 21:14:46.689733 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: . ack 1 win 97 <nop,nop,timestamp 328769606 328768774> 21:14:46.689737 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3733706 win 0 <nop,nop,timestamp 328769606 328767994> 21:14:53.349833 IP 10.60.1.231.bacula-fd > 10.60.1.231.52607: . ack 1 win 97 <nop,nop,timestamp 328771270 328769606> 21:14:53.349837 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: . ack 3733706 win 0 <nop,nop,timestamp 328771270 328767994> 21:15:00.506921 IP 10.60.1.231.52607 > 10.60.1.231.bacula-fd: R 1:1(0) ack 3733706 win 257 <nop,nop,timestamp 328773060 328767994> The traffic is not using any physical network interface (eth0, eth1 or eth2), all traffic is going over lo because all involved daemons are running on the same server. So a broken network device can't be the reason! Next I did a successful restore of the job where the verify was constantly failing today (I tried it http://bugs.bacula.org/view.php?id=9#c20 times). 22-Mär 22:29 VUMEM004-dir JobId 1681: Bacula VUMEM004-dir 2.2.8 (26Jan08): 22-Mär-2008 22:29:19 Build OS: x86_64-pc-linux-gnu debian 4.0 JobId: 1681 Job: RestoreFiles.2008-03-22_22.15.34 Restore Client: VUMEM004-fd Start time: 22-Mär-2008 22:15:10 End time: 22-Mär-2008 22:29:19 Files Expected: 67,221 Files Restored: 67,221 Bytes Restored: 30,147,456,137 Rate: 35509.4 KB/s FD Errors: 0 FD termination status: OK SD termination status: OK ---------------------------------------------------------------------- (0003230) RalfGross (reporter) - 2008-03-24 19:32 http://bugs.bacula.org/view.php?id=1061#c3230 ---------------------------------------------------------------------- I modified src/lib/bsock.c to show the packet size in the bsock error message. 24-Mär 18:48 VUMEM004-dir JobId 1699: New file: /server/cvsroot/iprep/ANTSRT/SRC/Components/UTA2/ImageRectificationOpenGl/ip/ip_hi����ghlevel/src/pelCont.cc,v 24-Mär 18:48 VUMEM004-dir JobId 1699: Fatal error: bsock.c:416 DEBUG: Packet size 541261824 too big from "Client: VUMEM004-fd:10.60.1.231. Terminating connection. 24-Mär 18:48 VUMEM004-dir JobId 1699: Fatal error: bsock.c:419 Packet size too big from "Client: VUMEM004-fd:10.60.1.231:9102. Terminating connection. 24-Mär 18:48 VUMEM004-sd JobId 1699: Job VerifyVU0EM003.2008-03-24_18.30.05 marked to be canceled. This verify job failed in the first attempt yesterday, I then reran the job, which succeeded. Now, one day later, this job fails again with a packet size of 541261824. I see from bsock.c that the maximum size is 1000000. I've no idea what to check next. Is there something I can check on the fd side? Additional note. I just checked the dir's debug entries. I don't know if this is related to the bsock error problem. Failed verify jobs: sql_get.c:73-0 db_get_file_att_record +fname=/server/cvsroot/iprep/ANTSRT/SRC/Components/UTA2/ImageRectificationOpenGl/ip/ip_hi<FF><FF><FF><FA>ghlevel/ +src/pelCont.cc,v sql_get.c:127-0 Get_file_record JobId=1683 FilenameId=539133 PathId=0 sql_get.c:129-0 Query=SELECT FileId, LStat, MD5 FROM File WHERE File.JobId=1683 AND File.PathId=0 AND +File.FilenameId=539133 sql_get.c:133-0 get_file_record num_rows=0 verify.c:580-0 File not in catalog: +/server/cvsroot/iprep/ANTSRT/SRC/Components/UTA2/ImageRectificationOpenGl/ip/ip_hi<FF><FF><FF><FA>ghlevel/src/pe +lCont.cc,v Successfull verify job (same jobid, but yesterday): sql_get.c:73-0 db_get_file_att_record +fname=/server/cvsroot/iprep/ants-rt/src/components/vehicle/rectification/ip_highlevel/src/pelCont.cc,v sql_get.c:127-0 Get_file_record JobId=1683 FilenameId=539133 PathId=750055 sql_get.c:129-0 Query=SELECT FileId, LStat, MD5 FROM File WHERE File.JobId=1683 AND File.PathId=750055 AND +File.FilenameId=539133 sql_get.c:133-0 get_file_record num_rows=1 getmsg.c:110-0 bget_dirmsg 42: 89449 3 p3STS8yKrYA5KQzv5yelcg *MD5-89449* The difference is the File.PathId=0 and the strange characters in the path to the file. ---------------------------------------------------------------------- (0003232) kern (administrator) - 2008-03-25 07:03 http://bugs.bacula.org/view.php?id=1061#c3232 ---------------------------------------------------------------------- The root of the problem is in your most recent output where in the middle of the record sent from the FD to the DIR, there is a FFFFFFFA. This FFFFFFFA is equivalent to an integer -6, which is the signal that Bacula sends to indicate a heartbeat. So, it is almost certain that the heartbeat code (your 300 second interval) is interfering with the Verify data transmission. Temporary workaround until I confirm the problem and find a fix is to turn off heartbeat. Heartbeat is not needed on a loopback device. ---------------------------------------------------------------------- (0003266) RalfGross (reporter) - 2008-04-03 09:26 http://bugs.bacula.org/view.php?id=1061#c3266 ---------------------------------------------------------------------- I haven't had a verify error since I disabled the fd heartbeat option. ---------------------------------------------------------------------- (0003267) kern (administrator) - 2008-04-03 10:13 http://bugs.bacula.org/view.php?id=1061#c3267 ---------------------------------------------------------------------- Thanks for the feedback/confirmation, and thanks for continuing to work on this until we found the cause. I do have a solution programmed in the SVN trunk, but it is a fair amount of new code that is not really appropriate for 2.2.x (perhaps if there is a 2.2.10). Now that you have confirmed this, I will close the bug. The workaround is to turn off heartbeat or ensure that it is longer than the Verify job. ---------------------------------------------------------------------- (0003958) RalfGross (reporter) - 2009-02-20 11:39 http://bugs.bacula.org/view.php?id=1061#c3958 ---------------------------------------------------------------------- Sorry for reopening this old bug. Just wanted to let you know that a heartbeat signal during a verify in 2.4.4 still triggers a verify error. In note 0003267 you wrote that there is new code that should prevent that. The verify job shows verify differences as state and doesn't do anything after that (no more files get checked) and it does not terminate. I've to cancel the job in bconsole. Issue History Date Modified Username Field Change ====================================================================== 2008-03-04 07:40 RalfGross New Issue 2008-03-04 18:23 kern Note Added: 0003183 2008-03-04 18:23 kern Status new => feedback 2008-03-04 20:20 RalfGross Note Added: 0003187 2008-03-05 06:56 kern Note Added: 0003190 2008-03-05 12:52 RalfGross File Added: bacula-conf.tar.gz 2008-03-05 13:07 RalfGross Note Added: 0003195 2008-03-05 13:10 RalfGross Note Edited: 0003195 2008-03-06 22:10 RalfGross Note Added: 0003201 2008-03-15 13:41 RalfGross Note Added: 0003212 2008-03-16 12:09 RalfGross Note Added: 0003213 2008-03-22 09:31 RalfGross Note Added: 0003224 2008-03-22 20:42 RalfGross Note Edited: 0003224 2008-03-22 21:34 RalfGross Note Edited: 0003224 2008-03-24 18:03 RalfGross Note Added: 0003230 2008-03-24 19:22 RalfGross Note Edited: 0003230 2008-03-24 19:32 RalfGross Note Edited: 0003230 2008-03-25 07:03 kern Note Added: 0003232 2008-04-03 09:26 RalfGross Note Added: 0003266 2008-04-03 10:13 kern Note Added: 0003267 2008-04-03 10:13 kern Status feedback => closed 2008-04-03 10:13 kern Resolution open => fixed 2008-04-03 10:13 kern Fixed in Version => 3.0.0 2009-02-20 11:39 RalfGross Note Added: 0003958 2009-02-20 11:39 RalfGross Status closed => feedback 2009-02-20 11:39 RalfGross Resolution fixed => reopened ====================================================================== |