Menu

#178 Extend CRC/Hash

2.4.0
closed
None
Implemented
5
2018-08-07
2015-12-17
xloem
No
  1. Include CRC in isolated catalog. This allows an isolated catalog to detect if file data has changed but file timestamp has not. Also allows space efficiency if file was touched but data was not changed.
  2. Allow selection of CRC algorithm like hash algorithm, to prevent an adversary from making use of hash collisions to alter data.

Discussion

  • Denis Corbin

    Denis Corbin - 2015-12-17
    • status: unread --> open
    • assigned_to: Denis Corbin
     
  • Denis Corbin

    Denis Corbin - 2015-12-17
    1. isolated catalog do contain data CRC
    2. better use strong encryption than making something complicated that will not completely solve thge problem. CRC are not used to detect file change but to detect corruption inside the archive.
     
  • xloem

    xloem - 2015-12-19

    Ah, I was misled by the xml output not including it. I've submitted a patch to fix that.

    I mention my interest in #2 in the response in the previous feature request. I think it would be great if dar doubled as data verification, but perhaps that is beyond its scope.

    Thank you.

     
  • Denis Corbin

    Denis Corbin - 2017-12-31

    for item #2 whatever is the number of hash you add to reduce the risk of collision, the risk will always stay, this is the by definition of a hash (a fonction N -> M where the N set contains much more values than the M set).

    Now preventing one to alter data in a dar cannot rely only on CRC per file. You can reduce the risk by use of the --hash feature (hashing per slice) but better use strong encryption inside dar (-K option) to drastically (completely?) suppress this risk, no?

     
  • Denis Corbin

    Denis Corbin - 2017-12-31
    • Progression: requested --> abandoned
     
  • Cálestyo

    Cálestyo - 2018-07-30

    removed, was bogus

     

    Last edit: Cálestyo 2018-07-31
  • Cálestyo

    Cálestyo - 2018-07-31

    I'd still like to see the feature to have one or even several other digest types per file, and I think there is quite some value to this.

    First, any hash alone does of course not prevent against intentional modifications (because the hash can be forged too)... but against accidental.
    And there, other algorithms have far better resistance than CRC(32?)... CRC32 is 32 bit, SHA3-512 is 512 bits... thats obviously many magnitudes more, giving much better collision resistance especially for large files (since I assume dar computes one hash per file, regardless of how big that is,... and not e.g. one hash per say up to 1GB of data, resulting in multiple hashes per file).

     
  • Denis Corbin

    Denis Corbin - 2018-08-04

    Since release 2.4.0 an isolated catalogue contains the CRC of the original file, which let the isolated catalogue to be used as backup of an internal catalogue, and even to compare an isolated catalogue with filesystem (using CRC calculation to detect difference).

    This is my bad not to have updated properly this feature request, sorry for that!

     
  • Denis Corbin

    Denis Corbin - 2018-08-04
    • status: open --> closed
    • Progression: abandoned --> Implemented
    • km stone :): none --> 2.4.0
     
  • Cálestyo

    Cálestyo - 2018-08-04

    And what about my last comment, where I argued that extenting the hash algos would make sense? Shall I open a new feature request for this?

     
  • Denis Corbin

    Denis Corbin - 2018-08-05

    the CRC size used per file depends on the file size (the more the file is large the larger is the CRC). Is is at least 4 bytes (32 bits) and is added 32 additional bits each new giga of file length.

    In other words :

    32 bits CRC for file from 0 to 1 Gio
    64 bits CRC for files from 1 to 2 Gio
    96 bits CRC for files from 2 to 3 Gio
    and so on

    The risk of CRC collision should this way not be dependent on file length.

    When considering the main factor of corruption to be passing time (bit swap with time on some poor media), it would be rather advised to use Parchive in complement to dar in order to be able to repair an archive years after its creation.

    At the other end when considering the main factor fo corruption is bad copy (bad memory/cache/disk cache/...) using dar's --hash option would let you detect corruption that took place during archive creation process (and you can use sha512 there if you want) and prevent you relying on an archive that does not reflect the backed up filesystem.

    Of course you can combine both methods (--hash option and par2/parchive)

    Then internal CRC used per file is probably not perfect but is intersting for its scalability, efficiency and independance on external libraries.

    Hopes this answer your need.
    Regards,
    Denis

     
  • Cálestyo

    Cálestyo - 2018-08-07

    Hey Denis.

    I see.
    Actually I think this should be documented sowhere in the manpage, as it's some valuable information about the resilience of the dar format.

    I think the way you do this (hash per amount of storage vs. hash per file) is really nice, actually I was already about to advise this in addition to using something "stronger" than CRC32.

    As you say, having the on-the-fly calculated hash from --hash is really nice as well (especially because of things like disk corruption and so on, if one is not on a checksumming fs like btrfs).

    The advantage of the "internal" hash over the "external" one is, that one can compute it more easily if only few files are extracted/tested. And of course again, the "external" --hash goes over much more data and is thus much more prone to collision.

    That being said... I think you're right and there's no critical need to replace the CRC32... but still other algos have much better collision resistance (and are also accelerated by hardware)... so maybe you'd consider to put this on a possible feature list for future major versions of dar.
    :-)

    And even if you'd allow e.g. to have SHA3 or SHA2 for the internal hash... I still suggest to keep the above scheme of having a hash per amount of bytes and not per file. :-)

    Thanks,
    Chris.

     

Log in to post a comment.

MongoDB Logo MongoDB