From: Dan L. <da...@la...> - 2008-12-23 20:47:26
|
In case this provides clues to a bug... Begin forwarded message: > From: Conor Edberg <co...@ls...> > Date: December 23, 2008 11:53:03 AM EST > To: Jason Dixon <jd...@om...> > Cc: bac...@li... > Subject: Re: [Bacula-users] Solaris "Packet size too big" failures > > > On Dec 9, 2008, at 2:36 PM, Jason Dixon wrote: > >> On Tue, Dec 09, 2008 at 09:26:38PM +0000, Allan Black wrote: >>> Jason Dixon wrote: >>>> Alas, I spoke too soon. The CatalogBackup job failed again last >>>> night, >>>> usual symptoms. >>> >>> OK. need to find out what the FD is doing. I would recommend: >>> >>> truss -o filename -f -a -e -v all -w 2 -p <FD pid> >>> >>> Is it possible to run the catalog backup during the day, by hand? >>> That way you could avoid leaving the truss running (and collecting >>> data) all night. >> >> I've run it 6 times today, no failures yet. Frustrating. Here are >> the >> results showing all the previous failures, then the successes today. >> >> -bash-3.2$ echo 'list jobs' | sudo /opt/bacula/sbin/i386/bconsole | >> grep BackupCatalog | grep '| f ' >> | 60 | BackupCatalog | 2008-11-08 00:36:12 | B | F >> | 1 | 965,869,568 | f | >> | 116 | BackupCatalog | 2008-11-11 23:14:05 | B | F >> | 1 | 614,727,680 | f | >> | 286 | BackupCatalog | 2008-11-22 23:14:28 | B | F >> | 1 | 5,278,400,512 | f | >> | 298 | BackupCatalog | 2008-11-23 23:14:32 | B | F >> | 1 | 4,723,965,952 | f | >> | 336 | BackupCatalog | 2008-11-26 23:14:30 | B | F >> | 1 | 3,979,280,384 | f | >> | 361 | BackupCatalog | 2008-11-28 23:16:11 | B | F >> | 1 | 2,101,936,128 | f | >> | 385 | BackupCatalog | 2008-11-30 23:13:21 | B | F >> | 1 | 3,863,216,128 | f | >> >> -bash-3.2$ echo 'list jobs' | sudo /opt/bacula/sbin/i386/bconsole | >> grep BackupCatalog | grep '2008-12-09' >> | 486 | BackupCatalog | 2008-12-09 16:46:51 | B | F >> | 1 | 3,183,911,677 | T | >> | 487 | BackupCatalog | 2008-12-09 16:55:42 | B | F >> | 1 | 3,183,912,220 | T | >> | 488 | BackupCatalog | 2008-12-09 17:03:22 | B | F >> | 1 | 3,183,912,724 | T | >> | 489 | BackupCatalog | 2008-12-09 17:18:03 | B | F >> | 1 | 3,183,913,233 | T | >> | 490 | BackupCatalog | 2008-12-09 17:23:26 | B | F >> | 1 | 3,183,913,740 | T | >> | 491 | BackupCatalog | 2008-12-09 17:29:53 | B | F >> | 1 | 3,183,914,246 | T | >> >> >> -- >> Jason Dixon >> OmniTI Computer Consulting, Inc. >> jd...@om... >> 443.325.1357 x.241 >> > > > > I'm seeing similar problems to the one Jason described. However, in > addition to the 'Packet Size too big' failure, I see a variety of > other error messages as well. FD, SD, and dir are all running on the > same machine, disk to disk backup. Some examples of the failures: > > 20-Dec 05:01 server-sd JobId 328: Fatal error: append.c:190 File index > from FD not positive or sequential > 20-Dec 05:01 server-sd JobId 328: Job write elapsed time = 00:18:22, > Transfer rate = 10.48 M bytes/second > 20-Dec 05:01 server-fd JobId 328: Fatal error: backup.c:1003 Network > send error to SD. ERR=Broken pipe > 20-Dec 05:01 server-fd JobId 328: Error: bsock.c:341 Write error > sending -1 bytes to Storage daemon:192.168.1.123:9103: ERR=Broken pipe > 20-Dec 05:01 server-dir JobId 328: Error: Bacula server-dir 2.5.16- > beta (20Oct08): 20-Dec-2008 05:01:23 > Build OS: i386-pc-solaris2.11 solaris 5.11 > > 05-Dec 13:30 server-fd JobId 275: Fatal error: backup.c:957 Network > send error to SD. ERR=I/O error > 05-Dec 13:30 server-sd JobId 275: Fatal error: append.c:259 Network > error on data channel. ERR=No data available > 05-Dec 13:30 server-sd JobId 275: Job write elapsed time = 01:36:26, > Transfer rate = 50.96 M bytes/second > 05-Dec 13:30 server-sd JobId 275: Error: bsock.c:485 Read error from > client:192.168.1.123:36643: ERR=No data available > 05-Dec 13:30 server-dir JobId 275: Bacula server-dir 2.5.16-beta > (20Oct08): 05-Dec-2008 13:30:43 > Build OS: i386-pc-solaris2.11 solaris 5.11 > > 10-Oct 10:15 server-sd JobId 130: Fatal error: append.c:180 Malformed > data header from FD: > 10-Oct 10:15 server-sd JobId 130: Job write elapsed time = 00:03:34, > Transfer rate = 13.56 M bytes/second\ > 10-Oct 10:15 server-fd JobId 130: Fatal error: backup.c:892 Network > send error to SD. ERR=Broken pipe\ > 10-Oct 10:15 server-dir JobId 130: Error: Bacula server-dir 2.4.2 > (26Jul08): 10-Oct-2008 10:15:49\ > Build OS: i386-pc-solaris2.11 solaris 5.11 > > 19-Sep 23:05 server-sd JobId 23: Ready to append to end of Volume > "test1" size=140938726135 > 19-Sep 23:20 server-sd JobId 23: Fatal error: append.c:159 Error > reading data header from FD. ERR=No data available > 19-Sep 23:20 server-sd JobId 23: Job write elapsed time = 00:15:04, > Transfer rate = 8.229 M bytes/second > 19-Sep 23:20 server-sd JobId 23: Fatal error: bsock.c:415 Packet size > too big from "client:127.0.0.1:36643. Terminating connection. > 19-Sep 23:20 server-fd JobId 23: Fatal error: backup.c:892 Network > send error to SD. ERR=Broken pipe > 19-Sep 23:20 server-fd JobId 23: Error: bsock.c:306 Write error > sending 32768 bytes to Storage daemon:127.0.0.1:9103: ERR=Broken pipe > 19-Sep 23:20 server-dir JobId 23: Error: Bacula server-dir 2.4.2 > (26Jul08): 19-Sep-2008 23:20:18 > Build OS: i386-pc-solaris2.11 solaris 5.11 > JobId: 23 > Job: Client1.2008-09-19_23.05.49 > > > > Here's the truss of a failure (I think this is the relevant bit), > using the options from Alan's message above: > > 14457/53: write(6, "\0\080\0 m k Y\r\n M 3 m".., 32772) = > 32772 > 14457/53: read(14, " a p L S p L 9 f 6 + l 2".., 32768) = > 32768 > 14457/53: write(6, "\0\080\0 a p L S p L 9 f".., 32772) = > 32772 > 14457/53: read(14, " I T F a T F B i m 4 S V".., 32768) = > 32768 > 14457/53: write(6, "\0\080\0 I T F a T F B i".., 32772) = > 32772 > 14457/53: read(14, " C O / B f / / v\r\n 9 f".., 32768) = > 32768 > 14457/53: write(6, "\0\080\0 C O / B f / / v".., 32772) = > 32772 > 14457/54: pollsys(0xFE2DFDF0, 1, 0xFE2DFE98, 0x00000000) = 1 > 14457/54: fd=6 ev=POLLRDNORM rev=POLLRDNORM > 14457/54: timeout: 5.000000000 sec > 14457/53: read(14, " 6 d L r f 3 3\r\n 3 p q".., 32768) = > 32768 > 14457/54: read(6, 0xFE2DFF7C, 4) > Err#131 ECONNRESET > 14457/54: lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = > 0xFFBFFEFF [0x0000FFFF] > 14457/54: lwp_exit() > 14457/53: write(6, "\0\080\0 6 d L r f 3 3\r".., 32772) Err#32 > EPIPE > 14457/53: Received signal #13, SIGPIPE [ignored] > 14457/53: time() = > 1229375022 > 14457/53: time() = > 1229375022 > 14457/53: write(4, "\0\0\094 J m s g J o b".., 152) = 152 > 14457/53: close(14) = 0 > 14457/53: close(13) = 0 > 14457/53: close(12) = 0 > 14457/53: close(11) = 0 > 14457/53: close(10) = 0 > 14457/53: close(9) = 0 > 14457/53: close(8) = 0 > 14457/53: close(7) = 0 > 14457/53: nanosleep(0xFE3DEE0C, 0x00000000) = 0 > 14457/53: tmout: 0.050000000 sec > 14457/53: nanosleep(0xFE3DEE0C, 0x00000000) = 0 > 14457/53: tmout: 0.050000000 sec > 14457/53: nanosleep(0xFE3DEE0C, 0x00000000) = 0 > 14457/53: tmout: 0.050000000 sec > 14457/53: nanosleep(0xFE3DEE0C, 0x00000000) = 0 > 14457/53: tmout: 0.050000000 sec > > > -- > Conor Edberg > Letters & Science IT > UC Santa Barbara > > ------------------------------------------------------------------------------ > _______________________________________________ > Bacula-users mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-users -- Dan Langille http://langille.org/ |