Re: [Dar-support] Incremental backup of large amount of files is very slow

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Friday, February 25, 2022 6:54:49 PM CET Denis Corbin wrote:
> Le 24/02/2022 à 21:38, J. Roeleveld via Dar-support a écrit :
> > Hi all,
> 
> Hi Joost,
> 
> > Is there a way to speed up incremental backups of directories with large
> > amounts of files?
> 
> first thing first: could you check that you have the
> "Large dir. speed optimi." set to "YES" in the compile-time features
> shown issuing 'dar -V' ?

It's enabled. For completeness:

 dar version 2.7.1, Copyright (C) 2002-2021 Denis Corbin
   Long options support         : YES

 Using libdar 6.4.0 built with compilation time options:
   gzip compression (libz)      : YES
   bzip2 compression (libbzip2) : YES
   lzo compression (liblzo2)    : NO
   xz compression (liblzma)     : YES
   zstd compression (libzstd)   : YES
   lz4 compression (liblz4)     : NO
   Strong encryption (libgcrypt): YES
   Public key ciphers (gpgme)   : YES
   Extended Attributes support  : YES
   Large files support (> 2GB)  : YES
   ext2fs NODUMP flag support   : YES
   Integer size used            : 64 bits
   Thread safe support          : YES
   Furtive read mode support    : YES
   Linux ext2/3/4 FSA support   : YES
   Mac OS X HFS+ FSA support    : NO
   Linux statx() support        : YES
   Detected system/CPU endian   : little
   Posix fadvise support        : YES
   Large dir. speed optimi.     : YES
   Timestamp read accuracy      : 1 nanosecond
   Timestamp write accuracy     : 1 nanosecond
   Restores dates of symlinks   : YES
   Multiple threads (libthreads): NO 
   Delta compression (librsync) : NO
   Remote repository (libcurl)  : NO
   argon2 hashing (libargon2)   : NO

> > Backing up my mail storage (1 file per email) takes a very long time, I'm
> > not sure what it's doing, but as it takes a long time before it actually
> > starts writing the backupfile, I feel it's checking the content of every
> > file against the catalogue.
> 
> only metadata data is used to decide whether to backup or not a given
> file. Then, the file content is read only if decision to back it up has
> been taken.
> 
> When using large directories and this is the reason why the large
> directory speed optimization has been added some years ago, in addition
> to reading only once each directory content and metadata associated to
> each file storing it into memory, as usually done, this data is also
> indexed from a sorted list for fast lookup (lookup used when comparing
> each file metadata to its previous status).
> 
> Searching a sorted list has logarithm complexity so it is quite
> efficient (this search algorithm implementation relies on the standard
> C++ library). For the rest, the process has the same cost per file,
> whatever is the number of files per directory.
> 
> Point to consider are thus:
> - does your system struggle for disk I/O?
> - does your system struggle for CPU load?
> - does your system struggle for memory (and started swapping)?
> - if backing up over the network, does a network congestion occurring?

It's actually running directly on the NAS, there is overcapacity on the I/O, 
CPU and memory. Swap is present, but never used.

> Depending on that you report we can investigate in the appropriated
> direction.
> 
> > If this is the case, is there a way to have dar only compare the name (if
> > it exists) and filesize?
> 
> this is already the case, the metadata is gather in a single system call
> (from the stat() familly) that returnes the whole file's metata
> structure at once: the file type, filesize if appropriated, dates
> (mtime, actime, ctime, birthime if available), permissions, ... Dar uses
> most of this to decide whether a file has change or not (it does not use
> the atime for example). Anyway, I cannot see how to do with less that
> than and this faster.

Ok, guess I'll need to live with it then. It's only this filesystem that is 
taking a large time. The files are all numbered, with numbers increasing when a 
new mail is added in the folder.
There are currently 2.5 million inodes in use for just 171GB of storage.
Which means, on average, 71K / file.

> > I am willing to risk minor losses for most of the
> > incrementals. I would be doing regular "masters" where this option would
> > not be used.
> 
> some features that have some impact on performance :
> - compression: algorithm, compression level, block/stream mode, number
> of threads (see -z and -G options)
> - ciphering: algorithm, number of threads used (see -K/-J and -G options)
> - if disk I/O is not the problem, you can disable the lookup for sparse
> files (which, depending on data under backup, can however save a lot of
> storage space, in particular at restoration time, something compression
> alone cannot do). see --sparse-file-min-size option
> - you could disable tape marks (see -at option) to reduce the CPU usage
> (if this was the point of contention) at the cost of not being able to
> repair a truncated backup, loosing a redundancy level of information
> about the backup content (the catalogue) and the inability to read the
> backup sequentially (only direct access would be available, which is
> fast, but may not fit all need like storing backup on tapes).

I'll have a look at these options and see if any of these will improve the 
time it takes.

Thanks,

Joost

Re: [Dar-support] Incremental backup of large amount of files is very slow

For full, incremental, compressed and encrypted backups or archives

Re: [Dar-support] Incremental backup of large amount of files is very slow