DAR - Disk ARchive / Feature Requests / #50 Detect moved files when making differential backups

Lorand Szollosi - 2006-03-30

Logged In: YES
user_id=618851

I find it a great idea, but it should be extended before
implemented with the followings: why not also use the same
engine over the files of one_single backup? I mean, if you
have many identical files (such as an NFS server with many
exported roots), then you only need to store them once, and
identify the similarities automatically. On my home servers
system it would reduce the size of root backups by an
estimated factor of 2..3.

-lorro

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Denis Corbin - 2006-10-16

Logged In: YES
user_id=634979

Hello,

First a hash table does not give any prof that a "moved"
file is really the same file. Two files may share the same
hash result while they may stay completely differents.

This may bring the user to the impression its data has been
saved while dar would just have seen a file move where it
ought not to be.

Moreover there would be an important CPU overhead + I/O
overhead: compute the hash table for each file at every
backup needs CPU and disk access to all file's data, and
then the lookup in the computed hash table for each file
that has been potentially moved will also bring some
additional CPU cycles.

The last overhead, would be the need to store the has table
as it must be computed before any file get saved. At that
time you cannot know which file will have to be saved
(unless you replay the file filtering ...) Thus, either you
compute the hash table for all files under the -R root or
your have to "replay" file filtering for each file before
being able to save the first file... still CPU overhead and
Memory overhead.

Even optional (the user assumes the risk of losing his
data), this feature would transform dar one step further to
an elephant (big and slow) ;-)

Waiting for your comments,

Regards,
Denis.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lluís Batlle i Rossell - 2006-10-16

Logged In: YES
user_id=264950

I was thinking of a hash table being build (and stored in
memory) on the fly, as files are getting into the archive.
Each new file to be added is checked against the hash
table, and if it exists, if possible a byte by byte check
can be done for paranoics. if possible, because the full
archive may not be reachable.

I don't think storing that hash table in memory is such an
elephant thing, according to actual machines. And at most,
the user could limit that memory, and let that hash table
be self-adapting (discarding its contents according to an
algorithm, maybe). And all of that is an option.

About the probability of hash-collision... Using sha256 or
so, it shouldn't be that high. Serious data storage
filesystems are based on very-unprobable-hash-collision,
i.e. Venti.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lluís Batlle i Rossell - 2006-10-16

Logged In: YES
user_id=264950

Sorry, in the last comment I was thinking of a check of the
duplicated files in a single backup, and not about the
check in differential backups.

Yes, in this last case, the hash should be stored in the
index of the archive. In fact I imagine there is a hash
stored of each file, by now.

And IMO usually there are plenty of CPU cycles to be used
during computers' I/O. :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lorand Szollosi - 2006-10-17

Logged In: YES
user_id=618851

Just a note: finding a string that collodes both long SHA
and the proposed MD6 would probably worth much more than all
the data on your (or my) system. Let's face it: most of the
data you backup has arrived on the client's machine using
such hash comparisons. Sure, do not use this switch in a
nuclear plant; otherwise it's safe i think.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Denis Corbin - 2006-10-17

Logged In: YES
user_id=634979

Hello,

first, I guess you write about the has used in network
algorithm to verify the packet integrity. There is a huge
difference, because the CRC (or other algo) is not used to
compare two different packet but to detect whether a packet
has been corrupted. The underlying network is usually safe
and the probability of a corruption is low, thus the
probability of an undetected corruption is even lower, but I
agree not null. This is different here as would use this
hash algorithm to compare any type of byte sequence of any
arbitrary length.

If the hash results on let's say 100 bytes, Considering only
files of 1 Mbyte, there is 256^1,048,576 different possible
files. And there is 256^100 different hash. Thus a single
hash is shared by 256^(1,048,576 - 100) files thus
256^1,048,476 which is far more than what can handle my
computer's CPU.

The hash value can be stored in the archive for each file,
but to detect file move, you have to compute the hash of
every file under the filesystem -R root directory. You are
then able to know where a given absent file has been moved
to and/or renamed as.

Regards,
Denis.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lluís Batlle i Rossell - 2006-10-17

Logged In: YES
user_id=264950

The hash comparision, as I said, is used in Venti
filesystems. That filesystem is thought for backup, and the
data storage is based on hashs. If two blocks (of arbitrary
length) have the same hash, they have the same data. The
filesystem designers made probability calculus, and
according to they words, it's much probable that the
storage media burns out. And that filesystem has been in
use for years.

According to calculating the hash of every file in -R
root... well, I don't think calculating a hash is a lot
more time consuming than reading the file from the disk.
And in fact the file has to be read in any case, isn't it?
Relying in timestamps may also not be trusty.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Denis Corbin - 2006-10-18

Logged In: YES
user_id=634979

OK, let's assume that the risk of meeting two different
files with the same hash has a very low probability. I agree
that the CPU cycle required by the hash table is negligible,
but still remains the need to read all existing files to be
able to calculate the hash. This is not just reading the
inode but this is to read its whole data. Thus if you have
80 GB of data under the -R root directory you will have to
spend enough time for this data to be read and the hash
table to be built.

Only then, you can start the backup, and you have once again
to read the files you want to backup. :-(

but suppose that during the backup process you find a new
file, so fare you save it in the archive. But some file
further you realize that a file is missing, so you lookup in
the hash table computed on the existing filesystem and note
that the file has been renamed/moved but too late, you have
already saved it as a new file just before.

Conclusion, you cannot consult the hash table only for files
that have been destroyed. You must lookup the hash table for
each file before saving it (By the way, I guess this will
need an additional important amount of CPU cycle ...)

Last, it is difficult to know whether a file is missing from
a differential backup at the time you process each file.
Even if the file is in the filesystem, it may be excluded by
filters. The current solution used by dar is to wait for the
backup to complete and then to sequentially read the
catalogue of reference. For each entry found, dar makes a
fast lookup (fast because of the directory tree structure)
in the catalogue of the current archive. If it is not found,
a new special entry is added in the current catalogue to
record file as deleted.

Thus, you have at least to wait the end of each directory to
know whether a file is present or not in the filesystem and
whether it has to be saved or not. Note that due to
recursion the first directory opened are closed last, thus
for example the root filesystem (-R option) is closed when
all file have been considered. I guess you understand what
succeeds if you want to know whether a file or directory at
the root is present or not. As this is not possible to wait
(we will see further) you will have to scan the directory
tree three time (once to build the hash table, a second time
to consider each file, a third time to learn whether a file
has been deleted, and this later case can be much more heavy
as you can imagine at first: you have to scan the whole
directory as much time as you have to look for deleted files
in it, files with several thousand entry is common in Unix
filesystems).

In conclusion, if you are about to save a file, you have
first to compare the hash table computed on the exiting
filesystem and the one stored in the archive of reference.
Then consider the case where a same hash is found. Three
possibilities can occur: it may be a file movement, a file
change or a new hard link on an already existing file. To
know which case it is, you have to look in the filesystem at
the place recorded in the archive of reference to know
whether the file still exist or not, you may find several
possibilities [here take place a third scan of the directory
tree] The new possibilities that we can meet are now: no
file, same file, different file --- according to hash and
inode information. But assuming there is the same file, you
must be sure that this one will not be excluded by filters
so you have to re-run the filter on that particular file. If
it is excluded by filters you can assume a file movement,
else you have to consider a new hard link on the same file
or if in the case of file change, you last by re-saving the
whole data.

Today's dar algorithm is much more simple: for each file to
save, check whether it passes the filter, if it passes, look
for the same entry in the archive of reference, either save
its data+EA or record a new hard link on an already saved
data in this archive, or record the file as already saved in
the archive of reference. And yet some people complain that
dar is too slow and too fond of memory!

As you can see the algorithm is much more complicated than
the current one, and will require far much more time to
complete (especially because of the time necessary to build
the hash). Be assured also that at this level of description
I have skipped many annoying details and features that would
have to be considered, too.

The fact that the algorithm is much more complicated than
the one used is not a good point for a bug free software,
but OK, that's my concern. Now the fact that the time to
complete such a backup will be incredibly long will probably
make this feature usable only for small backup (where it
lacks much interest). And last the fact that this feature
let some probability even very very small probability that
data is not properly saved is not acceptable for me. I guess
it will not be acceptable for many users too, but OK, if it
is an option and is well documented no one should complain.
Still remains that this feature will probably not be use
often (lack of trust in the feature or required execution
time) and, spending much of my time (which I have not that
much for dar) for a complicated feature that would be rarely
used does not bring me any motivation to implement it.

This is, I think, a correct picture of my point of view.

Regards,
Denis.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lorand Szollosi - 2006-10-18

Logged In: YES
user_id=618851

Denis: I totally understand your point of view. However, let
me clarify one thing: two uses were proposed for the hash table.

1st use: replace the current algorithm that decides whether
a file has changed. This is clearly impractical after
reading your mail.

2nd use: keep the current decision algorithm, but add a
further level of indirection to file storage. I.e., filename
in an archive will refer to a hash value (provided it's a
regular file), and the hash value - file content pairs would
be stored afterwards. The hypothesis here is that hash
collosion on files with different contents is negligible
(far less than hardware malfunction), while files with equal
contents automatically get stored only once, no matter how
many times it appears. This would be feasible and would only
add a little overhead to the process.

I find the 2nd use very promising in company-wide backup
systems. Correct me if I'm wrong.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Denis Corbin - 2006-10-19

Logged In: YES
user_id=634979

Yep, the 2nd use is possible to implement. I guess it can be
a subset of another feature that has been requested beside
this one: make dar able to make binary diff.

In other words, have dar able to not save all the data of a
file that has changed since the archive of reference but
only some chunk of data. There is several way to implement
this feature, but I think one could be acceptable for both
features:

Instead of storing for each file the offset where the data
lies or a flag telling that the data is saved in the archive
of reference, the archive would now contain for each file a
list of hashes and "dead" hashes. Each hash would refer to a
hash table where could be found the length of the
corresponding block and its offset within the archive. While
the dead hashes would provide a hash key and the length of
the block it was applied to, but the dead hash would not
provide any pointers to real data, meaning that the
corresponding block is saved in the archive of reference.

At restoration time, dar would then read the catalogue and
the hash table (also stored at the end of the archive), then
it could be able to restore fully stored files the
hash/block ordered list, as well as files that have been
partially saved in the differential backup: The "dead hash"
would let dar check that the corresponding block in the
existing file still provides the same hash result and thus
that the global file coherence is kept across several
differential backups.

This would of course require some additional execution time,
mainly to compute the hashes of the existing files, which
can be done when each file is being processed.

This would address your request (2nd use) as two same file's
data would only be saved once under a single set of
hash/block correspondance and well it would address the
"binary diff" feature request as a file that had only a few
bit changed would not have its whole data saved during a
differential backup.

The main drawback I see is that at restoration time this may
lead the user to play a bit more often with the slices as a
file's data could be spread in blocks among several distant
slices.

Of course this feature would be an option, as the user must
still trust the hash algorithm, which algorithm must be
defined by the way (a fast one would be interesting :-) ),
which one would you choose?

Regards,
Denis.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

beentoo - 2012-04-21

Forking a separate process that inspects the target location (backup files already present) as suggested for
https://sourceforge.net/tracker/?func=detail&aid=3520053&group_id=65612&atid=511615
might also be usefull to handle the issues you mentioned here.
The separate process could be queried to determine if a file (or chunk) and is already present in the existing backup with the same hash.

The hashtable should need to be computed completely only on the first backup run, and be saved. Later runs then only need to compute the hashes of new or changed files (modification time changed).

Concerning the risk to declare two files/chunks as equal only based on their hashes: The risk should be much lower if the hashes are only consulted if their sizes do already match. (As with checking the name (prevents detection of renamed files), checking the modification time here would unfortunately prevent the detection of two separate instances of the same file.) So, maybe considering the option of a whole or partial byte for byte comparison, if a file or chunk seems to be equal based on their hashes, might be an usefull option.

NB: As binary diffs are quite a complex topic in itself (handling small addtions, periodic changes, variable chunk size optimizations, etc) maybe it is better to rely on an adanced external project for that?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

beentoo - 2012-04-21

Ah, there is also the inode number by which unchanged (moved/renamed) files could be identified.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cálestyo - 2018-07-31

I was thinking about this issue (how an incremental/differential backup could track movements of files) for quite a while now...

First, I think the typical use case for this is not system backups, at least not the "system parts" like /usr, /lib, /var, and so on... as these locations are typically rather static and files don't move that much.
It's more about "precious" data in some archives, things like photography collections and so on.
I have myself a big archive of data, that piled up over my life... all photography I've made, stuff that I've digitalised, papers from university and so on.
If part of this is contained in say:
/archive/pictures
/archive/documents
/archive/unsorted
...
and I move e.g. 2 TB of previously unsorted files to some other locations... or if I just want to rename "pictures" to "photography".... I'm currently screwed as these will be identified as new files and stored again.
Also,... less a problem for the"system parts" like /usr... since these are rarely bigger than some 30-40 GB... but a big issue for such file archives.

In the above discussion, I think two things have been mixed up:
1) tracking moved files (or having some heuristics for that)
2) deduplication

It sounds tempting, to use hashes to do (2) as well, i.e. when having two files of the same content, to store that content only once.
But I think at the level of whole files, this is practically less relevant... how often does it really happen that one hast very large files of exactly the same contents?
The typical use case is rather things like VM images,... but these share only many blocks and are not identical at the whole file. So to actually reach something, dar would need to hash at block level and deduplicate there as well.
If such feature was ever really needed... well I think it would go beyond the scope of backups/archive files as dar is.
Also, it would make recovery of partially broken archives far more difficult, since files would no longer be stored sequentially at once.

I think the main motivation should be to catch cases when /archive is renamed to /Archive, i.e. just movement of files and not deduplicating them at file or some block level.
So I don't loose much more words on (2).

Now for (1).
In principle one can do this securely... e.g. btrfs can it. It has send/receive which allow to (incrementally) send changes from one fs to the same (or another fs).
It even works at block level, I think.
Not really sure how they do it, but I strongly suspect they use the fact that extents have some unique UUID and a generation (which increases if and extent is written to)...thereby they can guarantee to catch any change.

Obviously we don't have this (BUT... one may think about providing some special code for dar being used on btrfs or filesystems with similar features).

Using hash tables was mentioned above.
I don't think this should be done (at least not alone, neither as the primary criterion).
Hash collisions are real: https://shattered.io/
These are just some normal PDF files... anyone who'd have them downloaded and wants to archive them with dar... would already get troubles if we make this hash based.

OTOH,... what does dar actually check normally to find whether a file has changed (since the archive of reference)? I'd assume it just works by paths and looking at mtimes? Or does it look at ctimes as well (which are harder, though not impossible, to (accidentally) fake?
A file may have very well been changed internally, but the mtime been kept at the same... so dar couldn't notice it was different and would not backup it, right?
Or does it really compare the stored CRCs?
So if it's like I assume.... one could also say: well allow people to have something based on hashes if they want this (e.g. as one option of tracking movements).

What else could we do?

a) As beentoo already mentions, we have the inode numbers.
And since a backup may go over multiple fileystems, with each having collisions of such numbers,... we have UUIDs for the filessystem.
What could dar do now? (And admittedly, I don't know it's internal details too much, so maybe my thinking is too naive)

The first full backup is made:

if "movement detection mode is enabled"...

whenever dar encounters a new filesystem in it's processed files, it checks whether this UUID was already used by another fs, and errors out when that the UUID collides... which could happen artificially when a fs was cloned (don't do this with btrfs... it will corrupt)... or when a fs has a very limited ID (e.g. FAT))

for every file it additionally stores the fs-UUID and the inode number in the catalogue (and possibly also in the data stream itself... like for sequential mode)

everything else (times, file type and so on)... is anyway already stored in the catalogue, isn't it?

Now if a incremental/differental backup is made:

dar would anyway already go through all pathnames, compare them (and the mtimes) with the catalogue of the archive of reference to see whether files were added, changed, deleted,... right?

now in addition, it would go through the already known fs-UUID/inode-number tuples ... if for a processed file, the same fs-UUID/inode-number is already found in the catalogue AND the mtime/ctime are the same... then I'd guess it would need to mean, that the file was just moved, doesn't it?
So dar could just store some special record, which indicated, that the previous filepath "/foo/bar" (which it knows from the catalogue of the archive of reference) is to be renamed to "/baz/blub/" (which is the one that is currently processed when creating the new incremental archive).

Sure, we now depend upon the mtime/ctime being changed when the file content is changed... but that we do already anyway, don't we?
Can it be a problem, that the same inode numbers may be re-used in a fs? Hmm not really sure about that... I would expect it's not a problem, because if e.g. a file with inode 12345 is stored in the first backup... deleted and later another file get's 12345 as well.... than it would have another mtime/ctime (wouldn't) it... and thus at the time of incremental backup, dar would notice that it has to store it.

b) Allow the user to use custom user XATTRS for this.
The basic scheme would be the same than above in (a)... expect that fs-UUID/inodenumber/times one would allow the user to use certain XATTRS in addition or instead.

For example I do the following with my own data archive already:
Each file has an XATTR attached, that contains the SHA512 sum of the file.
Setting these (verifying them and keeping them up to date, when I modify the files) is beyond the scope of dar (or we could provide a tool doing it).
When I run dar, I basically assert: I'm sure that all files match their XATTR SHA512 value.
dar could then pick up these, and use it as another criterion to check whether the file contents have changed since the last backup of reference... i.e. instead of just trusting the mtime/ctime... it would trust the user set hash xattrs.
If archive.of.reference.xattr.foo == current.file.xattr.foo => the file contents are equal... file has just moved, no need to store it again, but only record the move.

Using such hash xattr could be done in addition or instead of checking the c/mtimes.
As soon as one thing considered wouldn't match... better store the full file contents in the new incremental backup.

For detecting the move (not the content change), one could either again just trust on the fs-UUID+inode-number ... or do the following (or both):
Another xattr per file, which contains e.g. a file UUID.

So the user would need to employ another tool, which goes over all his files, and for files which have none, set some unique UUID into some XATTR, which resembles the ID of that file.
The tool could make sure, that no (UU)ID is used twice (which for UUID should be already guaranteed by the algo).

Again, dar would store this the first time in the full backup... and when creating the incremental backup it would go through the catalogue... looking for the same UUID... if found (and by either times and/or hash XATTR determined to be of the same content)... it would just store a special "move record".

Obviously you may now say: what about files that have a given inode (and/or hash and/or ID XATTRS)... and they're rewritten in a fashion like vim does on storing.
vim doesn't write to the same file, but rather creates a new one at a temporary locations, stores into that, and moves it to the original file.
If that's done, the inode number is definitely gone... they XATTRs... well some programs keep them others don't.

That the inode is gone doesn't really matter... it just means we'd store the file again in the differential backup, whether it's content has changed or not (we could have just saved/re-created it with the same content)... so nothing would be lost,... expect we don't save as much space in the new archive as we could.

Same if the XATTRS are lost... they'd get either new values before the backup is made.. or not have them at all and dar could somehow handle this (exiting, warning, just storing them as is, etc.).
If they're however kept... it may be problematic: The file contents may have changed, but the hash XATTR not.
Well... as said before... it's up to the user, to guarantee, that the hashes match the files.. if he doesn't do that (by e.g. verifying that before the backup)... fault's on him.
Plus... we still could check the files times,.. and if the program that wrote didn't mangle with them we'd still notice something fishy (not better and not worse than we do now).

So the hash XATTRS could be just some additional (to the times) criterion to determine file change... even if they're out of sync we'd still be fine by the times, or at least not worse than we're anyway already.

What would then be the actual benefit of ID XATTRs since they cannot do anything what inode numbers can't do already?
Well one could allow the user to select a mode, where dar looks really only on ID XATTRs and not on inode numbers.
That way... if files are rewritten for some reason, but the user is sure there contents stayed the same... he could just set the same ID XATTR.. something he cannot really do for inode numbers.

Isn't it problematic... if files are editied/stored and get a new inode / loose their XATTRS?
Well I think the worst thing would be, that we store them again without any need... but in practise I think the space loss is little... cause many kinds of big files (videos, images, etc.) rather not rewritten often... this is more typical for text documents, and that like.

Cheers,
Chris.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Denis Corbin - 2018-08-05

Hi,

as you described there is no simple solution to detect file move without increasing either memory requirement (hash/UUID comparison table) or I/O count (byte per byte comparison) both exponentially increasing CPU cycle with the number of files to backup. Else doing asumption on the way dar is used, like what "/usr" or "/bin" means in term of preciousness to the eye of the user for example (which usually leads to creating stupid tools that prentend to "think better" and at the place of the user, like high end cars today, that you cannot open the window if you have not the door closed and/or engine started, and the engine will not start until seat belt is locked and your foot is pressing the break... while you just wanted to open the window to reduce heat in the car during a picknick...)

File move is not something that occurs often, unless you have a real problem organizing your data and anticipating your needs ;-) However this may happen from time to time... I know :-)=)

So comparing the heavy cost of development such feature would require with either an approximate result (hash collision) or a very slow process and required data/time to do so (byte per byte comparison) and considering how frequently the memory/CPU resource investement would provide some benefit, I've postpone this feature request to implement more interesting ones like binary delta, which address a really more common/frequent problem when you backup VM: this feature avoids having dar saving the whole VM if it has just been powered up and down for a second since the last backup was made (will be part of release 2.6.0 in a few months), situation that arise very frequently and feature that will saved a lot of storage space for little additional CPU cycle and memory requirment.

In brief, all is possible but it should not explode execution time nor memory requirement used by dar/libdar... and last but not least, all is possible but devlopment resources are limited (my free time mainly). Thus I had to make choices and so far, this feature was really not at the top of the todo list. ... though it is not impossible thing change in the future... who knows.

Thanks for your comprehension

Regards,
Denis

Last edit: Denis Corbin 2018-08-05

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cálestyo - 2018-08-07

as you described there is no simple solution to detect file
move without increasing either memory requirement (hash/UUID
comparison table) or I/O count

Well I think the two methods proposed above ((a)=based on fs UUID + inode-#,.. (b)=based on custom user XATTR UUID) should be pretty fast, shouldn't they?

AFAIU dar loads the catalogue anyway already fully into RAM when doing an differential backup doesn't it?
If it does, it could just put the IDs (either (fs+UUID+inode) or (custom user XATTR UUID)) in some well suited tree., which would make lookup pretty fast (even a poor man two level tree, like git uses for the sha1 objects greatly increases speed).
When then a new file is processed, it could look up the tree, and if a match is found, only a special marker is stored in the new backup (like the "detriuid" marker for deleted files) to denote the new location (but same data).
Let's assume we have 2x 128 bit UUID then we'd be at 32 additional bytes per already known file.
For 10.000.000 files that would lead to additional ~305 MiB of memory being needed... not that much I think.
Differentiating between records of type (a) and (b) could be made by either allowing just either of the two per archive.... or by allowing both (intermixed) and keeping the records in two different trees...

The benefit of the fsUUID+inode# would obviously be that these information is already there for all files (unlike the custom user XATTR IDs, which the user would need to set somehow).
The disadvantage of course would be, that creating a differential backup that "finds" moved files would be only possible on the same fs (and as said previously, only for files that haven't been edited in a way like vi does (store+move)).
The custom user XATTR IDs would solve the first part (e.g. you could do this even if the data was moved (or previously resotred on) to another fs).

But even if for some of these reasons, an actually moved file isn't detected as such,... we don't loose anything, but just store it yet another time (as it happens right now).

Extraction would work just similar to what we have with deleted files. Well more or less ;-)
Assuming we've already restored the archive of reference... and now we restore the differential archive.
dar reaches a file that has been recorded as "moved"...
It should now move the restored file to it's new place.
The problem is, it cannot search for the old file by fsUUID+inode# (because these are different for the restored file), nor by custom user XATTR ID (it could search for that, but that would be painfully slow).

So the "file-has-moved" marker in the differential archive would also need to contain the old path.
Luckily, at creation time (of the differential archive), we know that path, as we should have it in the catalogue of the archive of reference).

Of course in sequential mode, all this probably wouldn't work at all...

I wonder what else one would need to store in the special "file-was-moved" marker... ? Obviously not the data (that's the goal)... but apart from that probably at least the whole inode (which includes new path and new times) and as stated before the previous full pathname of the file (to be able to move that file on restoration of the differential archive.

Anything more? Like the archive/slice in which the real data is found?
I'd guess this wouldn't be needed for the dar files themselves... but probably for the DBs for dar_manager.
Probably dar_manager would handle this similar to how it handles deleted files.

Actually I think the whole handling would be pretty similar to deleted files... just that in addition we need a way to find whether a file has been moved or not (and if, where it was previously)...
The two ways (fsUUID+inode#) OR (custom user XATTR ID) would be the way to get that... and the thing with (pre-stored) hash sums in user xattrs would be just as an additional way (to mtimes) to detect file modification.

The "hash-feature/idea" would be just an optional addition to get a (possibly) "better" modification detection, than the one without (i.e. based on mtime/ctime), which is equal to what we have now.
So I think that would be total optional, and thus if the user wants to spend that extra time of IO/CPU.. it would be up to him. And as I've said: this is nothing I'd do from inside dar(!)... dar would merely use the EAs when in place it wouldn't actually read/create/verify them,... this would be up to another tool.

File move is not something that occurs often, unless you
have a real problem organizing your data and anticipating
your needs ;-)

You may underestimate this ;-) (or maybe my data is just toooooooo unsorted ;-D)
I work at university and I'm responsible for some 2-3 PiB of LHC data... many people have thought for years of how to organise that data... and even there, reorganisation/movement happens every now and then.
Also consider simple cases like e.g. on a universities /home with many user dirs, where people may get their family names changed because of wedding.
Or a simply you have 1TB of "/holiday-pictures/italy/ which you need to rename later to italy_someYear, because you went there a 2nd time.

Anyway.. thanks if you keep it on some long-term todo list... and I'd be happy if you should pick it up sooner or later :-)

Best wishes,
Chris.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cálestyo - 2018-08-09

Three things to add here that came to my mind:

1) In such "file-has-been-moved" markers... one should think whether it makes any sense, not just to include the most recent previous pathname (i.e. the old pathname of the file from the most recent archive of reference) but also all previous pathnames.
Right now I cannot think of any case where this would be beneficial, but there may be one.

2) For very small files, dar should possibly check whether storing just storing both:

the full file (including data)

the deletion marker
would be smaller in terms of space usage, than storing such a "the file-has-been-moved" marker and optionally it should allow then to not track moved files for these cases.

3) One big question that I haven't thought about above is the following:
What if having large amounts of files in e.g. /archive/ and this is moved to /ARCHIVE/ - would we store a "file-has-moved-marker" just for that directory... or for each and every file below it?

Obviously, in the later case... we'd ge a huge amount of such "file-has-moved-markers".
In the former case, one would at least need to make sure, that any moves/renames are made in the correct order / point in time.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Detect moved files when making differential backups

For full, incremental, compressed and encrypted backups or archives

km stone :)

Searches

Help

#50 Detect moved files when making differential backups

Discussion