Re: [Dar-support] dar without recursion

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Le 22/05/2022 à 22:12, Jean-Baptiste Denis a écrit :
> Hi Per, Hi Denis,
> 
> Per, thank you for the suggestion. I tried something approaching without luck before asking the list ;)
> 
> Denis, thank you very much for the detailed explanation. Your first suggestion looks perfect to me. It's elegant and
> almost obvious after reading it ;) Do you consider this solution not perfect because of the (empty) directories creation
> or something else ? I can live with that.

yes, exactly, you end with empty directories. But if you can live with
that, this is less imperfect... ;)

[...]

> 
>> I don't see the use case of your requirement, can you develop?
> 
> Here is the simplified story:
> 
> - I've got two nas (nas0 et nas1) I access using nfs3.
> - I need to backup nas0 on nas1 (300 TBytes, 100+ millions of files), rebuild nas0 and restore its content from nas1
> - nas0 is actively used until the downtime
> - I need to minimize the downtime
> 
> My actual strategy is to run rsync mutiples times until the scheduled downtime. I'm running multiple rsync in parallel
> from multiple servers with the help of fpart/fpsync (https://www.fpart.org/). fpsync creates set of files (with a
> maximum total size or a maximum file number, whatever comes first). I do a last synchronization pass (using fpsync)
> during the downtime to catch up with the last modifications. I do what I have to do, and I restore using the same tools.
> 
> It works reasonably well, I've done multiple migrations using this strategy. But I've got a lot of small files
> (<1KBytes), sometimes millions files at the same level in a single directory, or nicely distributed in sub-directories.
>  I've also big files (sometimes weird characters in filenames ;))  Any tools using "posix" access will struggle in this
> situation. Splitting by number of files helps in such case by preventing a single rsync process handling 1 million 1K
> files (for a total of 1 GB).
> 
> With that said, using this method, I'm paying the small files overhead reading and writing thoses files during the
> backup from source to destination, and also during the restoration.
> 
> dar looks like a tool I could use to cut this overhead in half: keeping the overhead during the read phase of the
> backup, and the write phase during the restore. All other steps would deal with dar slices (writing "big" slices during
> backup and reading them during restore). Dar allows me to keep the iterative backup pass before the scheduled downtime.
> 
> I'm just exploring ideas at the moment, and I do not have a definitive idea on how to make it works. Hence my questions
> about recursion, limiting files number in slices and parallelism.

I seems your use of dar is more copy oriented (I mean backup used and
destroyed right after creation) than backup (long time storage under dar
format).

if so and as you target dar for small files, you will gain some CPU
cycle without big penalty disabling the sparse file feature (use
"--sparse-file-min-size 0" for that)

You can also make the backup-restoration without intermediate storage
requirement using pipes through ssh [1] or netcat [2] (for less CPU
requirement).

[1] http://dar.linux.free.fr/doc/usage_notes.html#ssh
[2] http://dar.linux.free.fr/doc/usage_notes.html#netcat

Doing that way, you should be able to leverage the whole network
bandwidth or disk I/O whichever is the most limiting. Dar will still
compute CRC and tape mark (in order to be able to read dar backup from a
pipe).

If you want to iterate the process incrementally, I suggest making
on-fly isolated catalogues. So you can incrementally backup and restore
over the network using locally stored extracted catalogues:

STEP 1:
  on nas0:
dar -c - -R / -g directory -P "directory/*/*" --on-fly-isolate cat_full
... | ssh or netcat

  on nas1:
dar -x - --sequential-read -R /some/where

STEP 2:
  a bit later on nas0:
dar -c - -A cat_full -R / -g directory -P "directory/*/*"
--on-fly-isolate cat_diff1 ... | ssh or netcat

  and at the same time on nas1:
dar -x - --sequential-read -R /some/where -w [or -r]

and so on.

> 
> The devil is in the details of course but your suggestions of a single slice backup; maybe per directory, without
> recursion, looks like a good starting point to play with.

I suspect the bottleneck will be the network bandwidth (unless you have
10 Gbit/s or more), in that case having concurrent dar/backup process on
different data sets will not bring much improvement than a single backup
made without intermediate storage as described above. my 2 cents.

> 
> Thank you again for your initial answer !
> 

You're welcome,

> Jean-Baptiste
> 

Denis

[...]

Re: [Dar-support] dar without recursion

For full, incremental, compressed and encrypted backups or archives

Re: [Dar-support] dar without recursion