From: Arno L. <al...@it...> - 2006-01-04 17:24:51
|
Hello, On 1/4/2006 12:13 PM, Kern Sibbald wrote: > Hello Arno and everyone, >=20 > On Tuesday 03 January 2006 23:51, Arno Lehmann wrote: >=20 >>Hello, >> >>I just saw that Kern seems to be back... I hope you had a nice vacation= ! >=20 >=20 > Yes, that is an accurate description -- I am sort of back -- lots of li= ttle=20 > errands to run before I settle into doing someting productive for Bacul= a=20 > later this afternoon ... Great, then I've got just the right thing *now* ;-) Unfortunately, and sorry for the trouble, but my jobs failed again=20 without need... you find the details below. >>Well, I assume you're already :-) busy with this project again, so I >>thought I'd give some more details concerning my earlier problem report= s. >> >>Simple things: >>- Performance is back to normal >>- Concurrent jobs now work. This might be magic, but I suppose that the= >>situation which resulted in errors, has not happened again. I saw jobs >>held because the volumes from their pool wasn't available, which is >>something that failed with an earlier 1.38.3beta version. >>- I'm running 1.38.3 beta 22 Dec in production mode since about Dec 28,= >>and I had no problems with it (unless I blame hardware problems to >>Bacula...). I didn't try any tricks yet, like manually starting jobs >>which, would want different volumes in one device. I'll try that as soo= n >>as I finished my (currently running) full backups. >>- The problems I reported on or about Dec 28 didn't happen again, but I= >>still have got the corresponding SD debug logs. If I can reproduce the >>issue I'll see if I can find anything enlighening in them. >=20 >=20 > Oh, this is really good news. It means that I can now release 1.38.3 t= o=20 > production, and start working on the little unfinished details for 1.38= =2E4,=20 > then on to doing our new projects list ... I suggest you look at the follow problem description... I still think=20 there's something to be fixed in the code :-( > While on vacation, I think I worked out a solution to a nasty bug that = has=20 > been reported by Daniel Holtkamp, which causes a seg fault when the wat= ch dog=20 > kicks in and cancels overdue jobs=20 great - I *think* I saw that one, too, but I had other things to track do= wn. >... Once I get 1.38.3 released, it is the=20 > first thing that I will implement as part of 1.38.4 ... >=20 >=20 >> >>And, of course, a happy new year to all of you! >=20 >=20 > Yes, and I hope that everyone had as nice a New Year as I did. I was i= n=20 > Vienna and had the good fortune to attend the Viennea Philharmonic=20 > presentation on the 30th, and even progress a little bit with my rudime= ntary=20 > German :-) Das freut mich. Wie steht's mit dem Besuch in M=FCnchen? Or, in english, = I'm glad to hear that, and what's about the visit to Munich? Well, then, the not-so-nice things. First, the situation: I still use two autochangers. The basic setup is one DLT autochanger for = Full backups. This one is only operated with volumes from pool Full. The = other one is a DDS autochanger with magazines for 6 cartridges. I use=20 one magazine with tapes from pool Incr, and another one is loaded with=20 volumes for pool Diff. The MediaTypes are "DLTIV" for the DLT tapes and=20 "DDS2" for the DDS ones. Usually, I have the Incr volumes in the DDS autoloader. On Tuesday and=20 Wednesday, I run weekly differential backups, so I use the folowing=20 schedule: Mon: Incr backups for all jobs start at 8:25 Tuesday: at 8:20, start Incr backups for windows clients at 8:25, start Diff backups for linux clients. Wednesday: at 8:20, start Diff backups for windows clients at 8:25, start Incr backups for linux clients. This is simplified, but represents what I actually do. The idea is to=20 only change magazines twice: Once Incrementals on Tuesday are done, I=20 want to oad the mag with Diff pool tapes, and on Wednesday, I swap them=20 back. I always do the mag change like this: "umount HPDAT", exchange=20 magazines, "update slots scan storage=3DHPDAT", "mount HPDAT". I assume=20 that this has nothing to do with the problems I observe. Yesterday, due to a hardware problem, I started the Backups manually. I=20 started some full backups, but these don't matter here, and I ran some=20 jos to Diff pool. Everything worked fine. Today, the setup was an "inverse" Wednesday one: Volumes from Incr pool=20 available, jobs are started in the expected order. First, the jobs for Diff got started. This job, naturally, required the=20 pool Diff. I got the mount request, but it took me a while to start=20 working on the backups, so the whole system had to wait a while. Then, later: > 04-Jan 08:25 goblin-dir: Start Backup JobId 4692, Job=3DGoblin.2006-01-= 04_08.25.00 > 04-Jan 08:44 goblin-sd: Failed command: 1000 OK VolName=3DDAT-120-0001 = VolJobs=3D4 VolFiles=3D4 VolBlocks=3D43911 VolBytes=3D2832697280 VolMount= s=3D44 VolErrors=3D0 VolWrites=3D1241890 MaxVolBytes=3D0 VolCapacityBytes= =3D0 VolStatus=3DAppend Slot=3D1 MaxVolJobs=3D0 MaxVolFiles=3D0 InChanger= =3D1 VolReadTime=3D0 VolWriteTime=3D0 EndFile=3D0 EndBlock=3D0 VolParts=3D= 0 LabelType=3D0 >=20 > 04-Jan 08:44 goblin-sd: Goblin.2006-01-04_08.25.00 Fatal error:=20 > Device "HP DAT" with MediaType "DDS2" requested by DIR not found i= n SD Device resources. > 04-Jan 08:44 goblin-dir: Goblin.2006-01-04_08.25.00 Error: Bacula 1.38.= 3 (22Dec05): 04-Jan-2006 08:44:45 > JobId: 4692 This is from the job report for the first job that failed. Still the=20 same I already observed, I fear. Excerpts from the debug log follow. This is the first occurence of that job in todays log, I assume this was = at about the start time (8:25): > goblin-sd: dircmd.c:187 Message channel init completed. > goblin-sd: job.c:70 <dird: JobId=3D4692 job=3DGoblin.2006-01-04_08.25.0= 0 job_name=3DGoblin client_ > name=3Dgoblin-fd type=3D66 level=3D73 FileSet=3DLinuxGoblin NoAttr=3D0 = SpoolAttr=3D0 FileSetMD5=3Dl4Uaa3+C > V7xORjcBJ6hIPC SpoolData=3D1 WritePartAfterJob=3D0 PreferMountedVols=3D= 1 > goblin-sd: job.c:123 >dird: 3000 OK Job SDid=3D11 SDtime=3D1136284947 A= uthorization=3DALCP-MODF-PI > JO-CAEF-LILK-PGGO-JACC-CIMA > goblin-sd: pythonlib.c:216 No startup module. > goblin-sd: reserve.c:319 <dird: use storage=3DHPDAT media_type=3DDDS2 p= ool_name=3DIncr pool_type=3DB > ackup append=3D1 copy=3D0 stripe=3D0 > goblin-sd: reserve.c:342 <dird device: use device=3DHP^ADAT > goblin-sd: reserve.c:416 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D= 0 any=3D0 > goblin-sd: reserve.c:499 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D= 0 Later, I find: > goblin-sd: block.c:728 Leave terminate_writing_volume -- OK > goblin-sd: dev.c:1312 bsf_dev > goblin-sd: dev.c:1414 bsr_dev > goblin-sd: device.c:87 Enter fixup_device_block_write_error > goblin-sd: dev.c:621 rewind res=3D0 fd=3D8 "Quantum DLT 0" (/dev/nst2) > goblin-sd: reserve.c:755 reserve_append device is tape > goblin-sd: reserve.c:786 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D= 0 any=3D0 > goblin-sd: reserve.c:819 failed: want mounted -- no vol JobId=3D4691 > goblin-sd: reserve.c:759 can_reserve_drive!=3D1 > goblin-sd: reserve.c:669 Not OK. > goblin-sd: reserve.c:540 Search res for HP DAT > goblin-sd: reserve.c:548 Try changer device HP DAT 0 > goblin-sd: reserve.c:627 Try reserve HP DAT 0 jobid=3D4692 > goblin-sd: askdir.c:242 >dird: CatReq Job=3DGoblin.2006-01-04_08.25.00 = FindMedia=3D1 pool_name=3DI > ncr media_type=3DDDS2 > goblin-sd: reserve.c:424 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D= 0 any=3D0 > goblin-sd: reserve.c:499 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D= 0 > goblin-sd: askdir.c:165 <dird 1000 OK VolName=3DDAT-120-0001 VolJobs=3D= 4 VolFiles=3D4 VolBlocks=3D43 > 911 VolBytes=3D2832697280 VolMounts=3D44 VolErrors=3D0 VolWrites=3D1241= 890 MaxVolBytes=3D0 VolCapacity > Bytes=3D0 VolStatus=3DAppend Slot=3D1 MaxVolJobs=3D0 MaxVolFiles=3D0 In= Changer=3D1 VolReadTime=3D0 VolWrit > eTime=3D0 EndFile=3D0 EndBlock=3D0 VolParts=3D0 LabelType=3D0 > goblin-sd: reserve.c:643 Looking for Volume=3DDAT-120-0001 > goblin-sd: reserve.c:755 reserve_append device is tape > goblin-sd: reserve.c:786 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D= 0 any=3D0 > goblin-sd: reserve.c:819 failed: want mounted -- no vol JobId=3D4692 > goblin-sd: reserve.c:759 can_reserve_drive!=3D1 > goblin-sd: reserve.c:669 Not OK. > goblin-sd: reserve.c:424 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D= 0 any=3D0 > goblin-sd: reserve.c:499 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D= 0 > goblin-sd: reserve.c:540 Search res for HP DAT > goblin-sd: reserve.c:548 Try changer device HP DAT 0 > goblin-sd: reserve.c:627 Try reserve HP DAT 0 jobid=3D4693 > goblin-sd: askdir.c:242 >dird: CatReq Job=3DElfSys.2006-01-04_08.25.01 = FindMedia=3D1 pool_name=3DI > ncr media_type=3DDDS2 > goblin-sd: askdir.c:165 <dird 1000 OK VolName=3DDAT-120-0001 VolJobs=3D= 4 VolFiles=3D4 VolBlocks=3D43 > 911 VolBytes=3D2832697280 VolMounts=3D44 VolErrors=3D0 VolWrites=3D1241= 890 MaxVolBytes=3D0 VolCapacity > Bytes=3D0 VolStatus=3DAppend Slot=3D1 MaxVolJobs=3D0 MaxVolFiles=3D0 In= Changer=3D1 VolReadTime=3D0 VolWrit > eTime=3D0 EndFile=3D0 EndBlock=3D0 VolParts=3D0 LabelType=3D0 > goblin-sd: reserve.c:643 Looking for Volume=3DDAT-120-0001 > goblin-sd: reserve.c:755 reserve_append device is tape > goblin-sd: reserve.c:786 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D= 0 any=3D0 > goblin-sd: reserve.c:819 failed: want mounted -- no vol JobId=3D4693 > goblin-sd: reserve.c:759 can_reserve_drive!=3D1 > goblin-sd: reserve.c:669 Not OK. > goblin-sd: reserve.c:424 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D= 0 any=3D0 > goblin-sd: reserve.c:499 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D= 0 > goblin-sd: reserve.c:540 Search res for HP DAT > goblin-sd: reserve.c:548 Try changer device HP DAT 0 > goblin-sd: reserve.c:627 Try reserve HP DAT 0 jobid=3D4695 In the above snippet, I can't determine which lines relate to the job in = question, though. Still I think you can see that reserving the device=20 didn't work: can_reserve_drive!=3D1 and Not OK show that. There are no further occurences of the job name=20 Goblin.2006-01-04_08.25.00, so I assume that the failedreservation=20 resulted in aborting the job. The log is rather long and contains entries for several different jobs,=20 some of them running simultaneously, and as there are no timestamps it's = hard to understand, but if you need it, I can mail it to you. Arno --=20 IT-Service Lehmann al...@it... Arno Lehmann http://www.its-lehmann.de |