|
From: Gauthier D. <gau...@de...> - 2008-03-31 16:01:11
|
Hello,
I'm setting up a new installation for two weeks and I would like to
share my setup in order to have your feedback and I hope advices for a
better strategy of practise.
Also I had a crash for a backup which I can't explain.
First my setup :
director named BACULA, debian etch virtualized machine (32bits), 256MB
of ram, mysql-5 with customized my.cnf ( from my-large.cnf with fex
values divided by two). Packages : bacula-common 2.2.8-4~bpo40+1 -
bacula-console 2.2.8-4~bpo40+1 -bacula-director-common
2.2.8-4~bpo40+1 -bacula-director-mysql 2.2.8-4~bpo40+1
sd and fd on AZURITE, Scientific linux5 64bits, 16GB of memory, 2 quad
core xeon.bacula-mtx-2.2.8-2 and bacula-mysql-2.2.8-2
Azurite is a lustre client, the filesystem to backup is 20TB large,
presently with only 10TB, and more than 10 millions of files.
the library is an overland neo 2000 with an FCO3 card and two LTO3 (
will be exchanged for LTO4 tomorrow) HP 960 also connected via FC.
the FCO3 and the two LTO-3 are connected via a FC switch to the emulex
HBA on azurite. HBA link speed is 4GB and both LTO3 at 2GB/s
Lustre file system performance are impressive when we work with large
files and several clients, or in our case there is millions of small
files ( source code ) to backup and the read performances can go down to
10MB/s for most of folders when the FS read them.
I divided the whole file system in three jobs, I created two pools for
data spooling ( on the same lustre fs ) and I'm facing two problems :
the low read performances ( to spool) then the low performances during
despooling ( 65MB/s without compression, 40MB/s with compression)
I attached my dir and sd configuration files if someone see something wrong in it
As example of the performance, you can see the error message at the end
of the email and also this output of the current status :
Running Jobs:
JobId 23 Job lustreArgile.2008-03-28_22.33.33 is running.
Backup Job started: 28-Mar-08 22:32
Files=3,403,844 Bytes=4,195,243,945,301 Bytes/sec=17,631,075 Errors=0
Files Examined=3,403,844
Processing file: /mnt/lustre/home/argile/somedata........
SDReadSeqNo=5 fd=12
Director connected at: 31-Mar-08 17:38
====
Here a small output of the fs performances in the spool folder
************************
[root@azurite test]# dd if=/dev/zero of=10G bs=10M count=1000
1000+0 records in
1000+0 records out
10485760000 bytes (10 GB) copied, 43.1407 seconds, 243 MB/s
[root@azurite test]# date&&cp 10G 10G.1&&sync && date
Mon Mar 31 17:32:32 CEST 2008
Mon Mar 31 17:33:19 CEST 2008
[root@azurite test]# ls -lh
total 20G
-rw-r--r-- 1 root root 9.8G Mar 31 17:32 10G
-rw-r--r-- 1 root root 9.8G Mar 31 17:33 10G.1
************************
I also had a crash of a job and I'm still wondering why, here few lines
from the log:
28-mar 00:44 bacula-dir JobId 16: No prior Full backup Job record found.
28-mar 00:44 bacula-dir JobId 16: No prior or suitable Full backup found in catalog. Doing FULL backup.
28-mar 00:45 bacula-dir JobId 16: Start Backup JobId 16, Job=lustre.2008-03-28_00.44.05
28-mar 00:45 bacula-dir JobId 16: Using Device "Drive-1"
28-mar 00:44 azurite-sd JobId 16: Spooling data ...
....
30-mar 01:16 azurite-sd JobId 16: Spooling data again ...
30-mar 03:12 azurite JobId 16: Fatal error: backup.c:1051 Network send error to SD. ERR=Connection reset by peer
30-mar 03:12 azurite JobId 16: Error: bsock.c:306 Write error sending 11 bytes to Storage daemon:azurite.andra.fr:9103: ERR=Connection reset by peer
30-mar 03:14 bacula-dir JobId 16: Error: Bacula bacula-dir 2.2.8 (26Jan08): 30-mar-2008 03:14:12
Build OS: i486-pc-linux-gnu debian 4.0
JobId: 16
Job: lustre.2008-03-28_00.44.05
Backup Level: Full (upgraded from Incremental)
Client: "azurite-fd" 2.2.8 (26Jan08) x86_64-redhat-linux-gnu,redhat,
FileSet: "lustre" 2008-03-28 00:44:58
Pool: "Default" (From Job resource)
Storage: "Autochanger" (From Job resource)
Scheduled time: 28-mar-2008 00:44:57
Start time: 28-mar-2008 00:45:00
End time: 30-mar-2008 03:14:12
Elapsed time: 2 days 1 hour 29 mins 12 secs
Priority: 10
FD Files Written: 5,562,241
SD Files Written: 0
FD Bytes Written: 1,963,406,238,087 (1.963 TB)
SD Bytes Written: 0 (0 B)
Rate: 11021.0 KB/s
Software Compression: None
VSS: no
Storage Encryption: no
Volume name(s): KN9884L3|KN9897L3|KN9891L3
Volume Session Id: 6
Volume Session Time: 1206660491
Last Volume Bytes: 431,657,017,344 (431.6 GB)
Non-fatal FD errors: 1
SD Errors: 0
FD termination status: Error
SD termination status: Error
Termination: *** Backup Error ***
I also saw an another error related to the library in the log for a different backup which still runs :
30-mar 12:24 azurite-sd JobId 23: End of medium on Volume "KN9892L3" Bytes=889,312,693,248 Blocks=13,785,228 at 30-Mar-2008 12:24.
30-mar 12:24 azurite-sd JobId 23: 3307 Issuing autochanger "unload slot 7, drive 1" command.
30-mar 12:24 azurite-sd JobId 23: 3307 Issuing autochanger "unload slot 6, drive 0" command.
30-mar 12:25 azurite-sd JobId 23: 3304 Issuing autochanger "load slot 6, drive 1" command.
30-mar 12:26 azurite-sd JobId 23: 3305 Autochanger "load slot 6, drive 1", status is OK.
30-mar 12:26 azurite-sd JobId 23: 3301 Issuing autochanger "loaded? drive 1" command.
30-mar 12:26 azurite-sd JobId 23: 3991 Bad autochanger "loaded? drive 1" command: ERR=Child exited with code 1.
Results=mtx: Request Sense: Long Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=70 (Current)
mtx: Request Sense: Sense Key=Not Ready
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 04
mtx: Request Sense: Additional Sense Qualifier = 00
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
mtx: Request Sense: SKSV=no
READ ELEMENT STATUS Command Failed
30-mar 12:26 azurite-sd JobId 23: Volume "KN9891L3" previously written, moving to end of data.
30-mar 12:27 azurite-sd JobId 23: Error: Bacula cannot write on tape Volume "KN9891L3" because:
The number of files mismatch! Volume=432 Catalog=431
30-mar 12:27 azurite-sd JobId 23: Marking Volume "KN9891L3" in Error in Catalog.
30-mar 12:29 bacula-dir JobId 23: Using Volume "KN9898L3" from 'Scratch' pool.
30-mar 12:27 azurite-sd JobId 23: 3301 Issuing autochanger "loaded? drive 1" command.
30-mar 12:27 azurite-sd JobId 23: 3302 Autochanger "loaded? drive 1", result is Slot 6.
30-mar 12:27 azurite-sd JobId 23: 3307 Issuing autochanger "unload slot 6, drive 1" command.
30-mar 12:27 azurite-sd JobId 23: 3304 Issuing autochanger "load slot 8, drive 1" command.
30-mar 12:28 azurite-sd JobId 23: 3305 Autochanger "load slot 8, drive 1", status is OK.
30-mar 12:28 azurite-sd JobId 23: 3301 Issuing autochanger "loaded? drive 1" command.
30-mar 12:28 azurite-sd JobId 23: 3991 Bad autochanger "loaded? drive 1" command: ERR=Child exited with code 1.
Results=mtx: Request Sense: Long Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=70 (Current)
mtx: Request Sense: Sense Key=Not Ready
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 04
mtx: Request Sense: Additional Sense Qualifier = 00
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
Thanks for any comments-help ( and of course for this nice piece of software :-) )
Gauthier
|