Re: [Dar-support] Incremental backup of large amount of files is very slow
For full, incremental, compressed and encrypted backups or archives
Brought to you by:
edrusb
|
From: J. R. <jo...@an...> - 2022-02-26 08:29:41
|
On Friday, February 25, 2022 6:54:49 PM CET Denis Corbin wrote: > Le 24/02/2022 à 21:38, J. Roeleveld via Dar-support a écrit : > > Hi all, > > Hi Joost, > > > Is there a way to speed up incremental backups of directories with large > > amounts of files? > > first thing first: could you check that you have the > "Large dir. speed optimi." set to "YES" in the compile-time features > shown issuing 'dar -V' ? It's enabled. For completeness: dar version 2.7.1, Copyright (C) 2002-2021 Denis Corbin Long options support : YES Using libdar 6.4.0 built with compilation time options: gzip compression (libz) : YES bzip2 compression (libbzip2) : YES lzo compression (liblzo2) : NO xz compression (liblzma) : YES zstd compression (libzstd) : YES lz4 compression (liblz4) : NO Strong encryption (libgcrypt): YES Public key ciphers (gpgme) : YES Extended Attributes support : YES Large files support (> 2GB) : YES ext2fs NODUMP flag support : YES Integer size used : 64 bits Thread safe support : YES Furtive read mode support : YES Linux ext2/3/4 FSA support : YES Mac OS X HFS+ FSA support : NO Linux statx() support : YES Detected system/CPU endian : little Posix fadvise support : YES Large dir. speed optimi. : YES Timestamp read accuracy : 1 nanosecond Timestamp write accuracy : 1 nanosecond Restores dates of symlinks : YES Multiple threads (libthreads): NO Delta compression (librsync) : NO Remote repository (libcurl) : NO argon2 hashing (libargon2) : NO > > Backing up my mail storage (1 file per email) takes a very long time, I'm > > not sure what it's doing, but as it takes a long time before it actually > > starts writing the backupfile, I feel it's checking the content of every > > file against the catalogue. > > only metadata data is used to decide whether to backup or not a given > file. Then, the file content is read only if decision to back it up has > been taken. > > When using large directories and this is the reason why the large > directory speed optimization has been added some years ago, in addition > to reading only once each directory content and metadata associated to > each file storing it into memory, as usually done, this data is also > indexed from a sorted list for fast lookup (lookup used when comparing > each file metadata to its previous status). > > Searching a sorted list has logarithm complexity so it is quite > efficient (this search algorithm implementation relies on the standard > C++ library). For the rest, the process has the same cost per file, > whatever is the number of files per directory. > > Point to consider are thus: > - does your system struggle for disk I/O? > - does your system struggle for CPU load? > - does your system struggle for memory (and started swapping)? > - if backing up over the network, does a network congestion occurring? It's actually running directly on the NAS, there is overcapacity on the I/O, CPU and memory. Swap is present, but never used. > Depending on that you report we can investigate in the appropriated > direction. > > > If this is the case, is there a way to have dar only compare the name (if > > it exists) and filesize? > > this is already the case, the metadata is gather in a single system call > (from the stat() familly) that returnes the whole file's metata > structure at once: the file type, filesize if appropriated, dates > (mtime, actime, ctime, birthime if available), permissions, ... Dar uses > most of this to decide whether a file has change or not (it does not use > the atime for example). Anyway, I cannot see how to do with less that > than and this faster. Ok, guess I'll need to live with it then. It's only this filesystem that is taking a large time. The files are all numbered, with numbers increasing when a new mail is added in the folder. There are currently 2.5 million inodes in use for just 171GB of storage. Which means, on average, 71K / file. > > I am willing to risk minor losses for most of the > > incrementals. I would be doing regular "masters" where this option would > > not be used. > > some features that have some impact on performance : > - compression: algorithm, compression level, block/stream mode, number > of threads (see -z and -G options) > - ciphering: algorithm, number of threads used (see -K/-J and -G options) > - if disk I/O is not the problem, you can disable the lookup for sparse > files (which, depending on data under backup, can however save a lot of > storage space, in particular at restoration time, something compression > alone cannot do). see --sparse-file-min-size option > - you could disable tape marks (see -at option) to reduce the CPU usage > (if this was the point of contention) at the cost of not being able to > repair a truncated backup, loosing a redundancy level of information > about the backup content (the catalogue) and the inability to read the > backup sequentially (only direct access would be available, which is > fast, but may not fit all need like storing backup on tapes). I'll have a look at these options and see if any of these will improve the time it takes. Thanks, Joost |