Re: [Bacula-devel] Bacula 1.38.3 beta status

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello,

On 1/4/2006 12:13 PM, Kern Sibbald wrote:
> Hello Arno and everyone,
>=20
> On Tuesday 03 January 2006 23:51, Arno Lehmann wrote:
>=20
>>Hello,
>>
>>I just saw that Kern seems to be back... I hope you had a nice vacation=
!
>=20
>=20
> Yes, that is an accurate description -- I am sort of back -- lots of li=
ttle=20
> errands to run before I settle into doing someting productive for Bacul=
a=20
> later this afternoon ...

Great, then I've got just the right thing *now* ;-)

Unfortunately, and sorry for the trouble, but my jobs failed again=20
without need... you find the details below.

>>Well, I assume you're already :-) busy with this project again, so I
>>thought I'd give some more details concerning my earlier problem report=
s.
>>
>>Simple things:
>>- Performance is back to normal
>>- Concurrent jobs now work. This might be magic, but I suppose that the=

>>situation which resulted in errors, has not happened again. I saw jobs
>>held because the volumes from their pool wasn't available, which is
>>something that failed with an earlier 1.38.3beta version.
>>- I'm running 1.38.3 beta 22 Dec in production mode since about Dec 28,=

>>and I had no problems with it (unless I blame hardware problems to
>>Bacula...). I didn't try any tricks yet, like manually starting jobs
>>which, would want different volumes in one device. I'll try that as soo=
n
>>as I finished my (currently running) full backups.
>>- The problems I reported on or about Dec 28 didn't happen again, but I=

>>still have got the corresponding SD debug logs. If I can reproduce the
>>issue I'll see if I can find anything enlighening in them.
>=20
>=20
> Oh, this is really good news.  It means that I can now release 1.38.3 t=
o=20
> production, and start working on the little unfinished details for 1.38=
=2E4,=20
> then on to doing our new projects list ...

I suggest you look at the follow problem description... I still think=20
there's something to be fixed in the code :-(

> While on vacation, I think I worked out a solution to a nasty bug that =
has=20
> been reported by Daniel Holtkamp, which causes a seg fault when the wat=
ch dog=20
> kicks in and cancels overdue jobs=20

great - I *think* I saw that one, too, but I had other things to track do=
wn.

>...  Once I get 1.38.3 released, it is the=20
> first thing that I will implement as part of 1.38.4 ...
>=20
>=20
>>
>>And, of course, a happy new year to all of you!
>=20
>=20
> Yes, and I hope that everyone had as nice a New Year as I did.  I was i=
n=20
> Vienna and had the good fortune to attend the Viennea Philharmonic=20
> presentation on the 30th, and even progress a little bit with my rudime=
ntary=20
> German :-)

Das freut mich. Wie steht's mit dem Besuch in M=FCnchen? Or, in english, =

I'm glad to hear that, and what's about the visit to Munich?

Well, then, the not-so-nice things.

First, the situation:

I still use two autochangers. The basic setup is one DLT autochanger for =

Full backups. This one is only operated with volumes from pool Full. The =

other one is a DDS autochanger with magazines for 6 cartridges. I use=20
one magazine with tapes from pool Incr, and another one is loaded with=20
volumes for pool Diff. The MediaTypes are "DLTIV" for the DLT tapes and=20
"DDS2" for the DDS ones.

Usually, I have the Incr volumes in the DDS autoloader. On Tuesday and=20
Wednesday, I run weekly differential backups, so I use the folowing=20
schedule:

Mon: Incr backups for all jobs start at 8:25
Tuesday: at 8:20, start Incr backups for windows clients
          at 8:25, start Diff backups for linux clients.
Wednesday: at 8:20, start Diff backups for windows clients
            at 8:25, start Incr backups for linux clients.

This is simplified, but represents what I actually do. The idea is to=20
only change magazines twice: Once Incrementals on Tuesday are done, I=20
want to oad the mag with Diff pool tapes, and on Wednesday, I swap them=20
back.

I always do the mag change like this: "umount HPDAT", exchange=20
magazines, "update slots scan storage=3DHPDAT", "mount HPDAT". I assume=20
that this has nothing to do with the problems I observe.

Yesterday, due to a hardware problem, I started the Backups manually. I=20
started some full backups, but these don't matter here, and I ran some=20
jos to Diff pool.

Everything worked fine.

Today, the setup was an "inverse" Wednesday one: Volumes from Incr pool=20
available, jobs are started in the expected order.

First, the jobs for Diff got started. This job, naturally, required the=20
pool Diff. I got the mount request, but it took me a while to start=20
working on the backups, so the whole system had to wait a while.

Then, later:

> 04-Jan 08:25 goblin-dir: Start Backup JobId 4692, Job=3DGoblin.2006-01-=
04_08.25.00
> 04-Jan 08:44 goblin-sd: Failed command: 1000 OK VolName=3DDAT-120-0001 =
VolJobs=3D4 VolFiles=3D4 VolBlocks=3D43911 VolBytes=3D2832697280 VolMount=
s=3D44 VolErrors=3D0 VolWrites=3D1241890 MaxVolBytes=3D0 VolCapacityBytes=
=3D0 VolStatus=3DAppend Slot=3D1 MaxVolJobs=3D0 MaxVolFiles=3D0 InChanger=
=3D1 VolReadTime=3D0 VolWriteTime=3D0 EndFile=3D0 EndBlock=3D0 VolParts=3D=
0 LabelType=3D0
>=20
> 04-Jan 08:44 goblin-sd: Goblin.2006-01-04_08.25.00 Fatal error:=20
>      Device "HP DAT" with MediaType "DDS2" requested by DIR not found i=
n SD Device resources.
> 04-Jan 08:44 goblin-dir: Goblin.2006-01-04_08.25.00 Error: Bacula 1.38.=
3 (22Dec05): 04-Jan-2006 08:44:45
>   JobId:                  4692

This is from the job report for the first job that failed. Still the=20
same I already observed, I fear.

Excerpts from the debug log follow.

This is the first occurence of that job in todays log, I assume this was =

at about the start time (8:25):

> goblin-sd: dircmd.c:187 Message channel init completed.
> goblin-sd: job.c:70 <dird: JobId=3D4692 job=3DGoblin.2006-01-04_08.25.0=
0 job_name=3DGoblin client_
> name=3Dgoblin-fd type=3D66 level=3D73 FileSet=3DLinuxGoblin NoAttr=3D0 =
SpoolAttr=3D0 FileSetMD5=3Dl4Uaa3+C
> V7xORjcBJ6hIPC SpoolData=3D1 WritePartAfterJob=3D0 PreferMountedVols=3D=
1
> goblin-sd: job.c:123 >dird: 3000 OK Job SDid=3D11 SDtime=3D1136284947 A=
uthorization=3DALCP-MODF-PI
> JO-CAEF-LILK-PGGO-JACC-CIMA
> goblin-sd: pythonlib.c:216 No startup module.
> goblin-sd: reserve.c:319 <dird: use storage=3DHPDAT media_type=3DDDS2 p=
ool_name=3DIncr pool_type=3DB
> ackup append=3D1 copy=3D0 stripe=3D0
> goblin-sd: reserve.c:342 <dird device: use device=3DHP^ADAT
> goblin-sd: reserve.c:416 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D=
0 any=3D0
> goblin-sd: reserve.c:499 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D=
0

Later, I find:

> goblin-sd: block.c:728 Leave terminate_writing_volume -- OK
> goblin-sd: dev.c:1312 bsf_dev
> goblin-sd: dev.c:1414 bsr_dev
> goblin-sd: device.c:87 Enter fixup_device_block_write_error
> goblin-sd: dev.c:621 rewind res=3D0 fd=3D8 "Quantum DLT 0" (/dev/nst2)
> goblin-sd: reserve.c:755 reserve_append device is tape
> goblin-sd: reserve.c:786 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D=
0 any=3D0
> goblin-sd: reserve.c:819 failed: want mounted -- no vol JobId=3D4691
> goblin-sd: reserve.c:759 can_reserve_drive!=3D1
> goblin-sd: reserve.c:669 Not OK.
> goblin-sd: reserve.c:540 Search res for HP DAT
> goblin-sd: reserve.c:548 Try changer device HP DAT 0
> goblin-sd: reserve.c:627 Try reserve HP DAT 0 jobid=3D4692
> goblin-sd: askdir.c:242 >dird: CatReq Job=3DGoblin.2006-01-04_08.25.00 =
FindMedia=3D1 pool_name=3DI
> ncr media_type=3DDDS2
> goblin-sd: reserve.c:424 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D=
0 any=3D0
> goblin-sd: reserve.c:499 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D=
0
> goblin-sd: askdir.c:165 <dird 1000 OK VolName=3DDAT-120-0001 VolJobs=3D=
4 VolFiles=3D4 VolBlocks=3D43
> 911 VolBytes=3D2832697280 VolMounts=3D44 VolErrors=3D0 VolWrites=3D1241=
890 MaxVolBytes=3D0 VolCapacity
> Bytes=3D0 VolStatus=3DAppend Slot=3D1 MaxVolJobs=3D0 MaxVolFiles=3D0 In=
Changer=3D1 VolReadTime=3D0 VolWrit
> eTime=3D0 EndFile=3D0 EndBlock=3D0 VolParts=3D0 LabelType=3D0
> goblin-sd: reserve.c:643 Looking for Volume=3DDAT-120-0001
> goblin-sd: reserve.c:755 reserve_append device is tape
> goblin-sd: reserve.c:786 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D=
0 any=3D0
> goblin-sd: reserve.c:819 failed: want mounted -- no vol JobId=3D4692
> goblin-sd: reserve.c:759 can_reserve_drive!=3D1
> goblin-sd: reserve.c:669 Not OK.
> goblin-sd: reserve.c:424 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D=
0 any=3D0
> goblin-sd: reserve.c:499 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D=
0
> goblin-sd: reserve.c:540 Search res for HP DAT
> goblin-sd: reserve.c:548 Try changer device HP DAT 0
> goblin-sd: reserve.c:627 Try reserve HP DAT 0 jobid=3D4693
> goblin-sd: askdir.c:242 >dird: CatReq Job=3DElfSys.2006-01-04_08.25.01 =
FindMedia=3D1 pool_name=3DI
> ncr media_type=3DDDS2
> goblin-sd: askdir.c:165 <dird 1000 OK VolName=3DDAT-120-0001 VolJobs=3D=
4 VolFiles=3D4 VolBlocks=3D43
> 911 VolBytes=3D2832697280 VolMounts=3D44 VolErrors=3D0 VolWrites=3D1241=
890 MaxVolBytes=3D0 VolCapacity
> Bytes=3D0 VolStatus=3DAppend Slot=3D1 MaxVolJobs=3D0 MaxVolFiles=3D0 In=
Changer=3D1 VolReadTime=3D0 VolWrit
> eTime=3D0 EndFile=3D0 EndBlock=3D0 VolParts=3D0 LabelType=3D0
> goblin-sd: reserve.c:643 Looking for Volume=3DDAT-120-0001
> goblin-sd: reserve.c:755 reserve_append device is tape
> goblin-sd: reserve.c:786 PrefMnt=3D1 exact=3D1 suitable=3D0 chgronly=3D=
0 any=3D0
> goblin-sd: reserve.c:819 failed: want mounted -- no vol JobId=3D4693
> goblin-sd: reserve.c:759 can_reserve_drive!=3D1
> goblin-sd: reserve.c:669 Not OK.
> goblin-sd: reserve.c:424 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D=
0 any=3D0
> goblin-sd: reserve.c:499 PrefMnt=3D1 exact=3D0 suitable=3D0 chgronly=3D=
0
> goblin-sd: reserve.c:540 Search res for HP DAT
> goblin-sd: reserve.c:548 Try changer device HP DAT 0
> goblin-sd: reserve.c:627 Try reserve HP DAT 0 jobid=3D4695

In the above snippet, I can't determine which lines relate to the job in =

question, though. Still I think you can see that reserving the device=20
didn't work: can_reserve_drive!=3D1 and Not OK show that.

There are no further occurences of the job name=20
Goblin.2006-01-04_08.25.00, so I assume that the failedreservation=20
resulted in aborting the job.

The log is rather long and contains entries for several different jobs,=20
some of them running simultaneously, and as there are no timestamps it's =

hard to understand, but if you need it, I can mail it to you.

Arno

--=20
IT-Service Lehmann                    al...@it...
Arno Lehmann                  http://www.its-lehmann.de