Menu

#50 Detect moved files when making differential backups

open
None
requested
5
2018-08-09
2005-12-28
No

I think this could be a good parameter. A simple hash
table, and doing comparisions between the files found
and the files already in the backup (file size,
checksum), should work. In fact this would enable even
detection of -equal- files in a backup.

There could be a parameter for telling the minimum file
size to check for a file move, as there is for
compression.

Well, some people may not believe in the "same-size
same-checksum" comparision, so there should be a byte-
by-byte final comparision as a parameter.

IMO in some systems this could save some megabytes in
the backups. But a lot of parameters, isn't it? :)

Discussion

  • Lorand Szollosi

    Lorand Szollosi - 2006-03-30

    Logged In: YES
    user_id=618851

    I find it a great idea, but it should be extended before
    implemented with the followings: why not also use the same
    engine over the files of one_single backup? I mean, if you
    have many identical files (such as an NFS server with many
    exported roots), then you only need to store them once, and
    identify the similarities automatically. On my home servers
    system it would reduce the size of root backups by an
    estimated factor of 2..3.

    -lorro

     
  • Denis Corbin

    Denis Corbin - 2006-10-16

    Logged In: YES
    user_id=634979

    Hello,

    First a hash table does not give any prof that a "moved"
    file is really the same file. Two files may share the same
    hash result while they may stay completely differents.

    This may bring the user to the impression its data has been
    saved while dar would just have seen a file move where it
    ought not to be.

    Moreover there would be an important CPU overhead + I/O
    overhead: compute the hash table for each file at every
    backup needs CPU and disk access to all file's data, and
    then the lookup in the computed hash table for each file
    that has been potentially moved will also bring some
    additional CPU cycles.

    The last overhead, would be the need to store the has table
    as it must be computed before any file get saved. At that
    time you cannot know which file will have to be saved
    (unless you replay the file filtering ...) Thus, either you
    compute the hash table for all files under the -R root or
    your have to "replay" file filtering for each file before
    being able to save the first file... still CPU overhead and
    Memory overhead.

    Even optional (the user assumes the risk of losing his
    data), this feature would transform dar one step further to
    an elephant (big and slow) ;-)

    Waiting for your comments,

    Regards,
    Denis.

     
  • Lluís Batlle i Rossell

    Logged In: YES
    user_id=264950

    I was thinking of a hash table being build (and stored in
    memory) on the fly, as files are getting into the archive.
    Each new file to be added is checked against the hash
    table, and if it exists, if possible a byte by byte check
    can be done for paranoics. if possible, because the full
    archive may not be reachable.

    I don't think storing that hash table in memory is such an
    elephant thing, according to actual machines. And at most,
    the user could limit that memory, and let that hash table
    be self-adapting (discarding its contents according to an
    algorithm, maybe). And all of that is an option.

    About the probability of hash-collision... Using sha256 or
    so, it shouldn't be that high. Serious data storage
    filesystems are based on very-unprobable-hash-collision,
    i.e. Venti.

     
  • Lluís Batlle i Rossell

    Logged In: YES
    user_id=264950

    Sorry, in the last comment I was thinking of a check of the
    duplicated files in a single backup, and not about the
    check in differential backups.

    Yes, in this last case, the hash should be stored in the
    index of the archive. In fact I imagine there is a hash
    stored of each file, by now.

    And IMO usually there are plenty of CPU cycles to be used
    during computers' I/O. :)

     
  • Lorand Szollosi

    Lorand Szollosi - 2006-10-17

    Logged In: YES
    user_id=618851

    Just a note: finding a string that collodes both long SHA
    and the proposed MD6 would probably worth much more than all
    the data on your (or my) system. Let's face it: most of the
    data you backup has arrived on the client's machine using
    such hash comparisons. Sure, do not use this switch in a
    nuclear plant; otherwise it's safe i think.

     
  • Denis Corbin

    Denis Corbin - 2006-10-17

    Logged In: YES
    user_id=634979

    Hello,

    first, I guess you write about the has used in network
    algorithm to verify the packet integrity. There is a huge
    difference, because the CRC (or other algo) is not used to
    compare two different packet but to detect whether a packet
    has been corrupted. The underlying network is usually safe
    and the probability of a corruption is low, thus the
    probability of an undetected corruption is even lower, but I
    agree not null. This is different here as would use this
    hash algorithm to compare any type of byte sequence of any
    arbitrary length.

    If the hash results on let's say 100 bytes, Considering only
    files of 1 Mbyte, there is 256^1,048,576 different possible
    files. And there is 256^100 different hash. Thus a single
    hash is shared by 256^(1,048,576 - 100) files thus
    256^1,048,476 which is far more than what can handle my
    computer's CPU.

    The hash value can be stored in the archive for each file,
    but to detect file move, you have to compute the hash of
    every file under the filesystem -R root directory. You are
    then able to know where a given absent file has been moved
    to and/or renamed as.

    Regards,
    Denis.

     
  • Lluís Batlle i Rossell

    Logged In: YES
    user_id=264950

    The hash comparision, as I said, is used in Venti
    filesystems. That filesystem is thought for backup, and the
    data storage is based on hashs. If two blocks (of arbitrary
    length) have the same hash, they have the same data. The
    filesystem designers made probability calculus, and
    according to they words, it's much probable that the
    storage media burns out. And that filesystem has been in
    use for years.

    According to calculating the hash of every file in -R
    root... well, I don't think calculating a hash is a lot
    more time consuming than reading the file from the disk.
    And in fact the file has to be read in any case, isn't it?
    Relying in timestamps may also not be trusty.

     
  • Denis Corbin

    Denis Corbin - 2006-10-18

    Logged In: YES
    user_id=634979

    OK, let's assume that the risk of meeting two different
    files with the same hash has a very low probability. I agree
    that the CPU cycle required by the hash table is negligible,
    but still remains the need to read all existing files to be
    able to calculate the hash. This is not just reading the
    inode but this is to read its whole data. Thus if you have
    80 GB of data under the -R root directory you will have to
    spend enough time for this data to be read and the hash
    table to be built.

    Only then, you can start the backup, and you have once again
    to read the files you want to backup. :-(

    but suppose that during the backup process you find a new
    file, so fare you save it in the archive. But some file
    further you realize that a file is missing, so you lookup in
    the hash table computed on the existing filesystem and note
    that the file has been renamed/moved but too late, you have
    already saved it as a new file just before.

    Conclusion, you cannot consult the hash table only for files
    that have been destroyed. You must lookup the hash table for
    each file before saving it (By the way, I guess this will
    need an additional important amount of CPU cycle ...)

    Last, it is difficult to know whether a file is missing from
    a differential backup at the time you process each file.
    Even if the file is in the filesystem, it may be excluded by
    filters. The current solution used by dar is to wait for the
    backup to complete and then to sequentially read the
    catalogue of reference. For each entry found, dar makes a
    fast lookup (fast because of the directory tree structure)
    in the catalogue of the current archive. If it is not found,
    a new special entry is added in the current catalogue to
    record file as deleted.

    Thus, you have at least to wait the end of each directory to
    know whether a file is present or not in the filesystem and
    whether it has to be saved or not. Note that due to
    recursion the first directory opened are closed last, thus
    for example the root filesystem (-R option) is closed when
    all file have been considered. I guess you understand what
    succeeds if you want to know whether a file or directory at
    the root is present or not. As this is not possible to wait
    (we will see further) you will have to scan the directory
    tree three time (once to build the hash table, a second time
    to consider each file, a third time to learn whether a file
    has been deleted, and this later case can be much more heavy
    as you can imagine at first: you have to scan the whole
    directory as much time as you have to look for deleted files
    in it, files with several thousand entry is common in Unix
    filesystems).

    In conclusion, if you are about to save a file, you have
    first to compare the hash table computed on the exiting
    filesystem and the one stored in the archive of reference.
    Then consider the case where a same hash is found. Three
    possibilities can occur: it may be a file movement, a file
    change or a new hard link on an already existing file. To
    know which case it is, you have to look in the filesystem at
    the place recorded in the archive of reference to know
    whether the file still exist or not, you may find several
    possibilities [here take place a third scan of the directory
    tree]
    The new possibilities that we can meet are now: no
    file, same file, different file --- according to hash and
    inode information. But assuming there is the same file, you
    must be sure that this one will not be excluded by filters
    so you have to re-run the filter on that particular file. If
    it is excluded by filters you can assume a file movement,
    else you have to consider a new hard link on the same file
    or if in the case of file change, you last by re-saving the
    whole data.

    Today's dar algorithm is much more simple: for each file to
    save, check whether it passes the filter, if it passes, look
    for the same entry in the archive of reference, either save
    its data+EA or record a new hard link on an already saved
    data in this archive, or record the file as already saved in
    the archive of reference. And yet some people complain that
    dar is too slow and too fond of memory!

    As you can see the algorithm is much more complicated than
    the current one, and will require far much more time to
    complete (especially because of the time necessary to build
    the hash). Be assured also that at this level of description
    I have skipped many annoying details and features that would
    have to be considered, too.

    The fact that the algorithm is much more complicated than
    the one used is not a good point for a bug free software,
    but OK, that's my concern. Now the fact that the time to
    complete such a backup will be incredibly long will probably
    make this feature usable only for small backup (where it
    lacks much interest). And last the fact that this feature
    let some probability even very very small probability that
    data is not properly saved is not acceptable for me. I guess
    it will not be acceptable for many users too, but OK, if it
    is an option and is well documented no one should complain.
    Still remains that this feature will probably not be use
    often (lack of trust in the feature or required execution
    time) and, spending much of my time (which I have not that
    much for dar) for a complicated feature that would be rarely
    used does not bring me any motivation to implement it.

    This is, I think, a correct picture of my point of view.

    Regards,
    Denis.

     
  • Lorand Szollosi

    Lorand Szollosi - 2006-10-18

    Logged In: YES
    user_id=618851

    Denis: I totally understand your point of view. However, let
    me clarify one thing: two uses were proposed for the hash table.

    1st use: replace the current algorithm that decides whether
    a file has changed. This is clearly impractical after
    reading your mail.

    2nd use: keep the current decision algorithm, but add a
    further level of indirection to file storage. I.e., filename
    in an archive will refer to a hash value (provided it's a
    regular file), and the hash value - file content pairs would
    be stored afterwards. The hypothesis here is that hash
    collosion on files with different contents is negligible
    (far less than hardware malfunction), while files with equal
    contents automatically get stored only once, no matter how
    many times it appears. This would be feasible and would only
    add a little overhead to the process.

    I find the 2nd use very promising in company-wide backup
    systems. Correct me if I'm wrong.

     
  • Denis Corbin

    Denis Corbin - 2006-10-19

    Logged In: YES
    user_id=634979

    Yep, the 2nd use is possible to implement. I guess it can be
    a subset of another feature that has been requested beside
    this one: make dar able to make binary diff.

    In other words, have dar able to not save all the data of a
    file that has changed since the archive of reference but
    only some chunk of data. There is several way to implement
    this feature, but I think one could be acceptable for both
    features:

    Instead of storing for each file the offset where the data
    lies or a flag telling that the data is saved in the archive
    of reference, the archive would now contain for each file a
    list of hashes and "dead" hashes. Each hash would refer to a
    hash table where could be found the length of the
    corresponding block and its offset within the archive. While
    the dead hashes would provide a hash key and the length of
    the block it was applied to, but the dead hash would not
    provide any pointers to real data, meaning that the
    corresponding block is saved in the archive of reference.

    At restoration time, dar would then read the catalogue and
    the hash table (also stored at the end of the archive), then
    it could be able to restore fully stored files the
    hash/block ordered list, as well as files that have been
    partially saved in the differential backup: The "dead hash"
    would let dar check that the corresponding block in the
    existing file still provides the same hash result and thus
    that the global file coherence is kept across several
    differential backups.

    This would of course require some additional execution time,
    mainly to compute the hashes of the existing files, which
    can be done when each file is being processed.

    This would address your request (2nd use) as two same file's
    data would only be saved once under a single set of
    hash/block correspondance and well it would address the
    "binary diff" feature request as a file that had only a few
    bit changed would not have its whole data saved during a
    differential backup.

    The main drawback I see is that at restoration time this may
    lead the user to play a bit more often with the slices as a
    file's data could be spread in blocks among several distant
    slices.

    Of course this feature would be an option, as the user must
    still trust the hash algorithm, which algorithm must be
    defined by the way (a fast one would be interesting :-) ),
    which one would you choose?

    Regards,
    Denis.

     
  • beentoo

    beentoo - 2012-04-21

    Forking a separate process that inspects the target location (backup files already present) as suggested for
    https://sourceforge.net/tracker/?func=detail&aid=3520053&group_id=65612&atid=511615
    might also be usefull to handle the issues you mentioned here.
    The separate process could be queried to determine if a file (or chunk) and is already present in the existing backup with the same hash.

    The hashtable should need to be computed completely only on the first backup run, and be saved. Later runs then only need to compute the hashes of new or changed files (modification time changed).

    Concerning the risk to declare two files/chunks as equal only based on their hashes: The risk should be much lower if the hashes are only consulted if their sizes do already match. (As with checking the name (prevents detection of renamed files), checking the modification time here would unfortunately prevent the detection of two separate instances of the same file.) So, maybe considering the option of a whole or partial byte for byte comparison, if a file or chunk seems to be equal based on their hashes, might be an usefull option.

    NB: As binary diffs are quite a complex topic in itself (handling small addtions, periodic changes, variable chunk size optimizations, etc) maybe it is better to rely on an adanced external project for that?

     
  • beentoo

    beentoo - 2012-04-21

    Ah, there is also the inode number by which unchanged (moved/renamed) files could be identified.

     
  • Cálestyo

    Cálestyo - 2018-07-31

    I was thinking about this issue (how an incremental/differential backup could track movements of files) for quite a while now...

    First, I think the typical use case for this is not system backups, at least not the "system parts" like /usr, /lib, /var, and so on... as these locations are typically rather static and files don't move that much.
    It's more about "precious" data in some archives, things like photography collections and so on.
    I have myself a big archive of data, that piled up over my life... all photography I've made, stuff that I've digitalised, papers from university and so on.
    If part of this is contained in say:
    /archive/pictures
    /archive/documents
    /archive/unsorted
    ...
    and I move e.g. 2 TB of previously unsorted files to some other locations... or if I just want to rename "pictures" to "photography".... I'm currently screwed as these will be identified as new files and stored again.
    Also,... less a problem for the"system parts" like /usr... since these are rarely bigger than some 30-40 GB... but a big issue for such file archives.

    In the above discussion, I think two things have been mixed up:
    1) tracking moved files (or having some heuristics for that)
    2) deduplication

    It sounds tempting, to use hashes to do (2) as well, i.e. when having two files of the same content, to store that content only once.
    But I think at the level of whole files, this is practically less relevant... how often does it really happen that one hast very large files of exactly the same contents?
    The typical use case is rather things like VM images,... but these share only many blocks and are not identical at the whole file. So to actually reach something, dar would need to hash at block level and deduplicate there as well.
    If such feature was ever really needed... well I think it would go beyond the scope of backups/archive files as dar is.
    Also, it would make recovery of partially broken archives far more difficult, since files would no longer be stored sequentially at once.

    I think the main motivation should be to catch cases when /archive is renamed to /Archive, i.e. just movement of files and not deduplicating them at file or some block level.
    So I don't loose much more words on (2).

    Now for (1).
    In principle one can do this securely... e.g. btrfs can it. It has send/receive which allow to (incrementally) send changes from one fs to the same (or another fs).
    It even works at block level, I think.
    Not really sure how they do it, but I strongly suspect they use the fact that extents have some unique UUID and a generation (which increases if and extent is written to)...thereby they can guarantee to catch any change.

    Obviously we don't have this (BUT... one may think about providing some special code for dar being used on btrfs or filesystems with similar features).

    Using hash tables was mentioned above.
    I don't think this should be done (at least not alone, neither as the primary criterion).
    Hash collisions are real: https://shattered.io/
    These are just some normal PDF files... anyone who'd have them downloaded and wants to archive them with dar... would already get troubles if we make this hash based.

    OTOH,... what does dar actually check normally to find whether a file has changed (since the archive of reference)? I'd assume it just works by paths and looking at mtimes? Or does it look at ctimes as well (which are harder, though not impossible, to (accidentally) fake?
    A file may have very well been changed internally, but the mtime been kept at the same... so dar couldn't notice it was different and would not backup it, right?
    Or does it really compare the stored CRCs?
    So if it's like I assume.... one could also say: well allow people to have something based on hashes if they want this (e.g. as one option of tracking movements).

    What else could we do?

    a) As beentoo already mentions, we have the inode numbers.
    And since a backup may go over multiple fileystems, with each having collisions of such numbers,... we have UUIDs for the filessystem.
    What could dar do now? (And admittedly, I don't know it's internal details too much, so maybe my thinking is too naive)

    The first full backup is made:

    • if "movement detection mode is enabled"...
    • whenever dar encounters a new filesystem in it's processed files, it checks whether this UUID was already used by another fs, and errors out when that the UUID collides... which could happen artificially when a fs was cloned (don't do this with btrfs... it will corrupt)... or when a fs has a very limited ID (e.g. FAT))
    • for every file it additionally stores the fs-UUID and the inode number in the catalogue (and possibly also in the data stream itself... like for sequential mode)
    • everything else (times, file type and so on)... is anyway already stored in the catalogue, isn't it?

    Now if a incremental/differental backup is made:

    • dar would anyway already go through all pathnames, compare them (and the mtimes) with the catalogue of the archive of reference to see whether files were added, changed, deleted,... right?
    • now in addition, it would go through the already known fs-UUID/inode-number tuples ... if for a processed file, the same fs-UUID/inode-number is already found in the catalogue AND the mtime/ctime are the same... then I'd guess it would need to mean, that the file was just moved, doesn't it?
      So dar could just store some special record, which indicated, that the previous filepath "/foo/bar" (which it knows from the catalogue of the archive of reference) is to be renamed to "/baz/blub/" (which is the one that is currently processed when creating the new incremental archive).

    Sure, we now depend upon the mtime/ctime being changed when the file content is changed... but that we do already anyway, don't we?
    Can it be a problem, that the same inode numbers may be re-used in a fs? Hmm not really sure about that... I would expect it's not a problem, because if e.g. a file with inode 12345 is stored in the first backup... deleted and later another file get's 12345 as well.... than it would have another mtime/ctime (wouldn't) it... and thus at the time of incremental backup, dar would notice that it has to store it.

    b) Allow the user to use custom user XATTRS for this.
    The basic scheme would be the same than above in (a)... expect that fs-UUID/inodenumber/times one would allow the user to use certain XATTRS in addition or instead.

    For example I do the following with my own data archive already:
    Each file has an XATTR attached, that contains the SHA512 sum of the file.
    Setting these (verifying them and keeping them up to date, when I modify the files) is beyond the scope of dar (or we could provide a tool doing it).
    When I run dar, I basically assert: I'm sure that all files match their XATTR SHA512 value.
    dar could then pick up these, and use it as another criterion to check whether the file contents have changed since the last backup of reference... i.e. instead of just trusting the mtime/ctime... it would trust the user set hash xattrs.
    If archive.of.reference.xattr.foo == current.file.xattr.foo => the file contents are equal... file has just moved, no need to store it again, but only record the move.

    Using such hash xattr could be done in addition or instead of checking the c/mtimes.
    As soon as one thing considered wouldn't match... better store the full file contents in the new incremental backup.

    For detecting the move (not the content change), one could either again just trust on the fs-UUID+inode-number ... or do the following (or both):
    Another xattr per file, which contains e.g. a file UUID.

    So the user would need to employ another tool, which goes over all his files, and for files which have none, set some unique UUID into some XATTR, which resembles the ID of that file.
    The tool could make sure, that no (UU)ID is used twice (which for UUID should be already guaranteed by the algo).

    Again, dar would store this the first time in the full backup... and when creating the incremental backup it would go through the catalogue... looking for the same UUID... if found (and by either times and/or hash XATTR determined to be of the same content)... it would just store a special "move record".

    Obviously you may now say: what about files that have a given inode (and/or hash and/or ID XATTRS)... and they're rewritten in a fashion like vim does on storing.
    vim doesn't write to the same file, but rather creates a new one at a temporary locations, stores into that, and moves it to the original file.
    If that's done, the inode number is definitely gone... they XATTRs... well some programs keep them others don't.

    That the inode is gone doesn't really matter... it just means we'd store the file again in the differential backup, whether it's content has changed or not (we could have just saved/re-created it with the same content)... so nothing would be lost,... expect we don't save as much space in the new archive as we could.

    Same if the XATTRS are lost... they'd get either new values before the backup is made.. or not have them at all and dar could somehow handle this (exiting, warning, just storing them as is, etc.).
    If they're however kept... it may be problematic: The file contents may have changed, but the hash XATTR not.
    Well... as said before... it's up to the user, to guarantee, that the hashes match the files.. if he doesn't do that (by e.g. verifying that before the backup)... fault's on him.
    Plus... we still could check the files times,.. and if the program that wrote didn't mangle with them we'd still notice something fishy (not better and not worse than we do now).

    So the hash XATTRS could be just some additional (to the times) criterion to determine file change... even if they're out of sync we'd still be fine by the times, or at least not worse than we're anyway already.

    What would then be the actual benefit of ID XATTRs since they cannot do anything what inode numbers can't do already?
    Well one could allow the user to select a mode, where dar looks really only on ID XATTRs and not on inode numbers.
    That way... if files are rewritten for some reason, but the user is sure there contents stayed the same... he could just set the same ID XATTR.. something he cannot really do for inode numbers.

    Isn't it problematic... if files are editied/stored and get a new inode / loose their XATTRS?
    Well I think the worst thing would be, that we store them again without any need... but in practise I think the space loss is little... cause many kinds of big files (videos, images, etc.) rather not rewritten often... this is more typical for text documents, and that like.

    Cheers,
    Chris.

     
  • Denis Corbin

    Denis Corbin - 2018-08-05

    Hi,

    as you described there is no simple solution to detect file move without increasing either memory requirement (hash/UUID comparison table) or I/O count (byte per byte comparison) both exponentially increasing CPU cycle with the number of files to backup. Else doing asumption on the way dar is used, like what "/usr" or "/bin" means in term of preciousness to the eye of the user for example (which usually leads to creating stupid tools that prentend to "think better" and at the place of the user, like high end cars today, that you cannot open the window if you have not the door closed and/or engine started, and the engine will not start until seat belt is locked and your foot is pressing the break... while you just wanted to open the window to reduce heat in the car during a picknick...)

    File move is not something that occurs often, unless you have a real problem organizing your data and anticipating your needs ;-) However this may happen from time to time... I know :-)=)

    So comparing the heavy cost of development such feature would require with either an approximate result (hash collision) or a very slow process and required data/time to do so (byte per byte comparison) and considering how frequently the memory/CPU resource investement would provide some benefit, I've postpone this feature request to implement more interesting ones like binary delta, which address a really more common/frequent problem when you backup VM: this feature avoids having dar saving the whole VM if it has just been powered up and down for a second since the last backup was made (will be part of release 2.6.0 in a few months), situation that arise very frequently and feature that will saved a lot of storage space for little additional CPU cycle and memory requirment.

    In brief, all is possible but it should not explode execution time nor memory requirement used by dar/libdar... and last but not least, all is possible but devlopment resources are limited (my free time mainly). Thus I had to make choices and so far, this feature was really not at the top of the todo list. ... though it is not impossible thing change in the future... who knows.

    Thanks for your comprehension

    Regards,
    Denis

     

    Last edit: Denis Corbin 2018-08-05
  • Cálestyo

    Cálestyo - 2018-08-07

    as you described there is no simple solution to detect file
    move without increasing either memory requirement (hash/UUID
    comparison table) or I/O count

    Well I think the two methods proposed above ((a)=based on fs UUID + inode-#,.. (b)=based on custom user XATTR UUID) should be pretty fast, shouldn't they?

    AFAIU dar loads the catalogue anyway already fully into RAM when doing an differential backup doesn't it?
    If it does, it could just put the IDs (either (fs+UUID+inode) or (custom user XATTR UUID)) in some well suited tree., which would make lookup pretty fast (even a poor man two level tree, like git uses for the sha1 objects greatly increases speed).
    When then a new file is processed, it could look up the tree, and if a match is found, only a special marker is stored in the new backup (like the "detriuid" marker for deleted files) to denote the new location (but same data).
    Let's assume we have 2x 128 bit UUID then we'd be at 32 additional bytes per already known file.
    For 10.000.000 files that would lead to additional ~305 MiB of memory being needed... not that much I think.
    Differentiating between records of type (a) and (b) could be made by either allowing just either of the two per archive.... or by allowing both (intermixed) and keeping the records in two different trees...

    The benefit of the fsUUID+inode# would obviously be that these information is already there for all files (unlike the custom user XATTR IDs, which the user would need to set somehow).
    The disadvantage of course would be, that creating a differential backup that "finds" moved files would be only possible on the same fs (and as said previously, only for files that haven't been edited in a way like vi does (store+move)).
    The custom user XATTR IDs would solve the first part (e.g. you could do this even if the data was moved (or previously resotred on) to another fs).

    But even if for some of these reasons, an actually moved file isn't detected as such,... we don't loose anything, but just store it yet another time (as it happens right now).

    Extraction would work just similar to what we have with deleted files. Well more or less ;-)
    Assuming we've already restored the archive of reference... and now we restore the differential archive.
    dar reaches a file that has been recorded as "moved"...
    It should now move the restored file to it's new place.
    The problem is, it cannot search for the old file by fsUUID+inode# (because these are different for the restored file), nor by custom user XATTR ID (it could search for that, but that would be painfully slow).

    So the "file-has-moved" marker in the differential archive would also need to contain the old path.
    Luckily, at creation time (of the differential archive), we know that path, as we should have it in the catalogue of the archive of reference).

    Of course in sequential mode, all this probably wouldn't work at all...

    I wonder what else one would need to store in the special "file-was-moved" marker... ? Obviously not the data (that's the goal)... but apart from that probably at least the whole inode (which includes new path and new times) and as stated before the previous full pathname of the file (to be able to move that file on restoration of the differential archive.

    Anything more? Like the archive/slice in which the real data is found?
    I'd guess this wouldn't be needed for the dar files themselves... but probably for the DBs for dar_manager.
    Probably dar_manager would handle this similar to how it handles deleted files.

    Actually I think the whole handling would be pretty similar to deleted files... just that in addition we need a way to find whether a file has been moved or not (and if, where it was previously)...
    The two ways (fsUUID+inode#) OR (custom user XATTR ID) would be the way to get that... and the thing with (pre-stored) hash sums in user xattrs would be just as an additional way (to mtimes) to detect file modification.

    The "hash-feature/idea" would be just an optional addition to get a (possibly) "better" modification detection, than the one without (i.e. based on mtime/ctime), which is equal to what we have now.
    So I think that would be total optional, and thus if the user wants to spend that extra time of IO/CPU.. it would be up to him. And as I've said: this is nothing I'd do from inside dar(!)... dar would merely use the EAs when in place it wouldn't actually read/create/verify them,... this would be up to another tool.

    File move is not something that occurs often, unless you
    have a real problem organizing your data and anticipating
    your needs ;-)

    You may underestimate this ;-) (or maybe my data is just toooooooo unsorted ;-D)
    I work at university and I'm responsible for some 2-3 PiB of LHC data... many people have thought for years of how to organise that data... and even there, reorganisation/movement happens every now and then.
    Also consider simple cases like e.g. on a universities /home with many user dirs, where people may get their family names changed because of wedding.
    Or a simply you have 1TB of "/holiday-pictures/italy/ which you need to rename later to italy_someYear, because you went there a 2nd time.

    Anyway.. thanks if you keep it on some long-term todo list... and I'd be happy if you should pick it up sooner or later :-)

    Best wishes,
    Chris.

     
  • Cálestyo

    Cálestyo - 2018-08-09

    Three things to add here that came to my mind:

    1) In such "file-has-been-moved" markers... one should think whether it makes any sense, not just to include the most recent previous pathname (i.e. the old pathname of the file from the most recent archive of reference) but also all previous pathnames.
    Right now I cannot think of any case where this would be beneficial, but there may be one.

    2) For very small files, dar should possibly check whether storing just storing both:

    • the full file (including data)
    • the deletion marker
      would be smaller in terms of space usage, than storing such a "the file-has-been-moved" marker and optionally it should allow then to not track moved files for these cases.

    3) One big question that I haven't thought about above is the following:
    What if having large amounts of files in e.g. /archive/ and this is moved to /ARCHIVE/ - would we store a "file-has-moved-marker" just for that directory... or for each and every file below it?

    Obviously, in the later case... we'd ge a huge amount of such "file-has-moved-markers".
    In the former case, one would at least need to make sure, that any moves/renames are made in the correct order / point in time.

     

Log in to post a comment.

MongoDB Logo MongoDB