Most likely this has already been considered and rejected...
But wouldn't it make sense to separate all data block hashes into a separate hash file that can be read on demand similar to how parity files work?
In theory this would eliminate the need to keep hashes in RAM which would then make SnapRaids RAM requirement very tiny.
In the least complicated form it would be a matrix with fixed locations for every hash in stripes like this:
Byte 0-15: hash for d1 block 0
Byte 16-31: hash for d2 block 0
Byte 32-47: hash for d3 block 0
Byte 48-63: hash for d1 block 1
Byte 64-79: hash for d2 block 1
...
During fix, scrub and check it would be very simple to predict which checksums you need and collect them in a single disk read for every stripe of blocks, either in parallell or in advance while working.
Sync could build a new hash file in parallell to the old file and collect all necessary hashes from the old hash file.
In the end when it is complete the hash file could be appended (basically stored) in a separate section at the end of the content file to make it cleaner and less files to keep track of for the end-user.
The obvious downside would be time to implement, lost backwards compatibility and probably a lot of head scratching how to optimize so it doesn't have to be a complete matrix with tons of dead space for less used disks and still possible to pinpoint exactly where to find specific hashes.
Additionally it might be a good idea to compliment with a hash for every stripe of hashes to avoid the risk of corruption of the hashes.
Personally I have a very limited need for this since I am quite happy with the balance of block size and RAM in my setup. So it is only an idea how to improve in general in case you see a need for it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes. Storing the hashes in that order would allow a fast access, as that hash file would be accessed sequentially in most operations.
I suppose that if such low memory support is ever implemented, it's likely to be something like that. Maybe with two files, a ".content" for the other information and a ".hash" only for hashes.
The issue is that if enouh memory is available the present implementation is going to be faster anyway, so there is only a small incentive to do so.
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
SnapRAID uses so little memory as it is, I prefer faster. Heck, I would welcome a switch or something that says it can use a LOT more memory if available to make it even faster...but I doubt something like that can be done (or would make any difference)
I have 16GB RAM, so plenty to play with.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Most likely this has already been considered and rejected...
But wouldn't it make sense to separate all data block hashes into a separate hash file that can be read on demand similar to how parity files work?
In theory this would eliminate the need to keep hashes in RAM which would then make SnapRaids RAM requirement very tiny.
In the least complicated form it would be a matrix with fixed locations for every hash in stripes like this:
Byte 0-15: hash for d1 block 0
Byte 16-31: hash for d2 block 0
Byte 32-47: hash for d3 block 0
Byte 48-63: hash for d1 block 1
Byte 64-79: hash for d2 block 1
...
During fix, scrub and check it would be very simple to predict which checksums you need and collect them in a single disk read for every stripe of blocks, either in parallell or in advance while working.
Sync could build a new hash file in parallell to the old file and collect all necessary hashes from the old hash file.
In the end when it is complete the hash file could be appended (basically stored) in a separate section at the end of the content file to make it cleaner and less files to keep track of for the end-user.
The obvious downside would be time to implement, lost backwards compatibility and probably a lot of head scratching how to optimize so it doesn't have to be a complete matrix with tons of dead space for less used disks and still possible to pinpoint exactly where to find specific hashes.
Additionally it might be a good idea to compliment with a hash for every stripe of hashes to avoid the risk of corruption of the hashes.
Personally I have a very limited need for this since I am quite happy with the balance of block size and RAM in my setup. So it is only an idea how to improve in general in case you see a need for it.
Hi Leifi,
Yes. Storing the hashes in that order would allow a fast access, as that hash file would be accessed sequentially in most operations.
I suppose that if such low memory support is ever implemented, it's likely to be something like that. Maybe with two files, a ".content" for the other information and a ".hash" only for hashes.
The issue is that if enouh memory is available the present implementation is going to be faster anyway, so there is only a small incentive to do so.
Ciao,
Andrea
SnapRAID uses so little memory as it is, I prefer faster. Heck, I would welcome a switch or something that says it can use a LOT more memory if available to make it even faster...but I doubt something like that can be done (or would make any difference)
I have 16GB RAM, so plenty to play with.