Re: [Dar-support] MAYBE BUG in dar 2.7.2 with hardlinks handling

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 08/12/2021 12:14, Alexey Loukianov via Dar-support wrote:
> On 04.12.2021 20:39, Denis Corbin wrote:
>> I'm not sure to follow what you plan to do, but I guess it is
>> possible...
> 
> Based on your answers I guess that you are probably missing the main
> point of "my plan" here. Idea was to introduce a new mode of "filesystem
> traversal" for which we totally get rid of real FS traversing and
> instead get the list of files to work with from some "external source".
> Then we apply filters in usual manner based on include/exclude rules,
> e.t.c., e.t.c.
> 
> Expected use pattern would look like this:
> # find <backup root dir> <filtering options> | sort -u >
> file-list-to-include-into-backup.txt
> # dar -R <backup root dir> --files-to-consider-list
> file-list-to-include-into-backup.txt -c <new backup name> -A <previous
> backup> .....
> 
> This way user would be in the full control of the ordering of files to
> backup thus it would be possible to make sure that files from "older"
> subdirectory would come in front of files from "new" subdir when applied
> to my failing use case described in earlier message.

OK, I understand.

But things are not implemented that way. The provided list of file is
just yet a filtering mechanism. It is not an instruction to dar of file
to consider to backup in that order.

The engine under the hood is still the same:
- open a directory, read and store at once its content in the order
provided by the system
- parse the obtained listing content and loop for each entry:
- apply filter the current entry (-g/-P/-X/-I/-[/-]/-u/-U/...)
- if it passes filters:
-    check the backup of reference (if present) for that same entry
-    modify the saving scheme (full/inode-only/just a place holder)
-    save what has to be saved (abstracted action, this is more complex)
- else
-    put a temporary "ignored" entry in the catalogue
- end if
- if the entry was a directory, recurse in it (recursive call)
- loop to the next entry of the current directory listing
- at the end of the backup, check all entries that are present in the
  backup of reference and not in the backup under process, add a
"deleted" (if a file has been filtered out, the added "ignored" entry
will avoid having it considered a removed from the backup of reference)
- remove the "ignored" entry from the catalogue
- write down the catalogue at the end of the backup

> 
>> I tried playing with the merging operation ... If the new backup do
>> not contain the old hardlinked inodes, at restoration time, the hard
>> links may not be restored (the hard link is missing in the restored
>> directory): this is because the order in which the directory content
>> was returned by the operating system is kept in the backup...
> 
> Tried to do the same and observation is that for hardlink to be restored
> two conditions should be met:
> 1. Target file to hardlink to should exist.
> 2. Archive entry pointing to the target file to be hardlinked to (i.e.
> one that is not "[Saved]" and is inside a dir that is not "[Saved]" too)
> should come before any other entry which is a hardlink to the same inode
> residing in a dir that is "[Saved]".
> 
> Let me illustrate with an example. Given is that we have file named
> "79/abc.txt" in place on the disk. Then with archive looking like this:
> 
> [Data ][D][ EA  ][FSA][Compr][S]| Permission | User  | Group | Size   
> |          Date                 |    filename
> --------------------------------+------------+-------+-------+---------+-------------------------------+------------
> 
> [     ][-]       [---][     ][ ]  drwxr-xr-x   root     root    0      
> Fri Dec  3 02:14:30 2021        79
> [     ][ ]       [---][-----][ ] *-rw-r--r--   root     root    15 kio 
> Fri Dec  3 02:14:30 2021        79/abc.txt [0]
> [Saved][-]       [---][     ][ ]  drwxr-xr-x   root     root    0      
> Fri Dec  3 02:14:30 2021        80
> [     ][ ]       [---][-----][ ] *-rw-r--r--   root     root    15 kio 
> Fri Dec  3 02:14:30 2021        80/abc.txt [0]
> 
> ... hardlink 80/abc.txt will be restored.
> 
> But in case the archive lists like this:
> 
> [Data ][D][ EA  ][FSA][Compr][S]| Permission | User  | Group | Size   
> |          Date                 |    filename
> --------------------------------+------------+-------+-------+---------+-------------------------------+------------
> 
> [Saved][-]       [---][     ][ ]  drwxr-xr-x   root     root    0      
> Fri Dec  3 02:14:30 2021        80
> [     ][ ]       [---][-----][ ] *-rw-r--r--   root     root    15 kio 
> Fri Dec  3 02:14:30 2021        80/abc.txt [0]
> [     ][-]       [---][     ][ ]  drwxr-xr-x   root     root    0      
> Fri Dec  3 02:14:30 2021        79
> [     ][ ]       [---][-----][ ] *-rw-r--r--   root     root    15 kio 
> Fri Dec  3 02:14:30 2021        79/abc.txt [0]
> 
> ... then hardlink won't be restored.
> 
> In your experiments at least one or both of conditions listed above were
> not met due to the ordering of the entries in the archive which was
> determined by the initial ordering at creation time matching the order
> filesystem traversal was done by the OS. It won't be the case for "
> filelist-based backup approach" described above as entry order would be
> warranted to be predictable and stable thanks to manual efforts taken to
> sort (and manipulate in any other way required) the list of
> files/directories to be backed up.
> 

I agree, however this would lead to a complete change of design in dar
for a need that is a bit away of its target: backup of a given operating
system/filesystem over time. The problem we face here is still present,
but only occurs in a very very rare situations and, if unoptimal, it
does not lead to a loss of data. In your context is occurs much often
due to the way the hard links are used by rsync over time.

I guess a first scan of the whole fileystem to backup would be necessary
to detect such condition prior to start saving any file. It would put
pressure on memory and if a file is removed modified or added between
the time the scan started and the time it comes to be save after the
scan has completed, the problem would still persist...

I'm pretty sure you have good reasons to use rsync + dar that way, but a
more simple approach would be to use dar only. It would simplify the
whole backup process and still bring rsync binary delta (see --delta
option), wrapping all with compression and ciphering in a single
process, dar support multi-threading for that by the way (see -G
option). Using dar_manager on top will also ease the recovering of a
particular file version from a large amount of full, differential and
incremental backups.

As a side effect, your backups would require less storage space and
would benefit of a better data protected against corruption over time
(detection with integrated checksum and repairing by adding par2 parity
data to the whole backup or to each slice)

You can check this benchmark if you want some concrete numbers:
http://dar.linux.free.fr/doc/benchmark.html

Cheers,
Denis

Re: [Dar-support] MAYBE BUG in dar 2.7.2 with hardlinks handling

For full, incremental, compressed and encrypted backups or archives

Re: [Dar-support] MAYBE BUG in dar 2.7.2 with hardlinks handling