1. Goals and brief description of functionality
2. Detailed description of functionality
Goals:
Checking (validating) the integrity of data on a storage device using hash signatures of the files which very generated earlier.
The validation process can be started manually or automatically by using a schedule.
Due to a defect harddrive or malware your data can be damaged. It is likely that this won`t be noticed until only defect copies of the data are left on your backups. Validating can prevent this.
You can make backups of your data but almost no developer of backup software takes into account that the source data can be corrupted.
This can happen due to a defect harddrive or malware and it is very likely that the damage of the data will not be noticed for a long time.
When it is finally noticed chances are that all backups have already be overwritten with defect copies of the data so there is no valid copy left.
The result is that despite a regular and redundant backup strategy and discipline data will be lost.
The only way to prevent this is to validate also the source data.
When is a file considered to be corrupt?
If the date of change (meta data) is still the same as when its hash was generated but the hash has changed.
I have been informed that not all applications update the date of change in a file`s meta data when writing like they are supposed to do.
If this is true it definitely is an issue and it has to be clearly communicated to the user that a detected change of the file despite unchanged date of change (in the meta data) does not ultimately mean that the file is damaged.
This is a problem.
Nevertheless, continueing to let important data be completely unchecked to me seems not like a good idea and with data that is not changed (anymore) the case is even more clear:
- Data of finished projects.
- Libraries or original photographs/recordings (a photographer surely will want to keep unedited original photos, in desktop publishing and music production you have libraries like stock photos, fonts and sound libraries which often are huge in size) aswell as any kind of archived data
- Previous backups (checking old backups for data integrity -maybe every 6 months or so and before overwriting your backup copy with the longest turn-around-interval*)
* Here I have a backup-scenario in mind with multiple redundand backups in backup-intervals daily, weekly, monthly. It is not a bad idea to keep permanent backup copies but I assume that the most typical users will eventually overwrite all backups.
In the example scenario, if files become damaged and this is not noticed for 1 month, then the last existing copy might be overwritten.
In most cases files that had been backed up before and did not change won`t be copied again. Then the risk of propagated data corruption is reduced.
But this does not change the problem that you have if there either is just
1 backup copy and the original and one of which is defect or you have multiple backups but the original
or
one of the copies are different from the others (professional backup programs can at least compare original and backup).
In the first case you practically have no backup (because it`s damaged and you don`t know it).
In the second case you would have to decide which file is OK, original or copy. With 1 original and 2 copies, chances are that at least 2 are identical so you can make a decision. With 1 backup you`re on your own to manually open files and decide -which can be easy or impossible.
Long term goal:
Integrating Data Integrity Monitor in a Backup application. The application should however always be available as a seperate program.
Main platform: Mac OS X. Hopefully also Linux. If someone would like to do a Windows version then I have nothing against it.
Language: Obective-C or C++
As far as I know there already is a Unix function built into Unix and its derivates that calculates hashes.
It is obvious to use this.
Here it is:
bash shell:
for i in $(find .); do if test -f $i; then md5sum $i >>/tmp/sum.md5; fi;
This is supposed to callculate md5-checksums for all files in the current directory/folder and its sub-directories and save it in the file /tmp/sum.md5
Detailed description and functionality:
The program will save hashes for all files within folders/directories (including sub-folders/sub-directories) that the user can select.
The saved hash signatures can then be used later to validate that the data has not been damaged.
The user can also define paths/volumes that contain copies (backups) of the source data (this makes automatic validation easier when external harddrives are used. More about that later).
Typically this will be the home folder/directory.
It should be possible to exclude folders/directories or files by selecting them (list-view like finder and a checkbox for each object) or by using include/exclude lists (I would call this a filter) where the user can add rules for objects (files/folders) to include or exclude depending on properties like filename, filetype, size, date of change.
As a default, caches would be excluded (by the filter).
This set of hash signatures derived from the source is a snapshot or hash-set.
The snapshot or hash-set can then be used to verify/validate the original data and copies of it (backups).
The snapshot has to be updated when new versions of existing files are saved and when new files are added.
Automatic updating & validation - schedules, intervals, start on mounting a backup volume:
Updating the snapshot (calculating hash signatures of new source data and deleting hashes of files that have been deleted -deleting should optionally be turned off for versioned backups, more about that later) and validating source data and backups (copies of the source data) can be started manually.
But both (updating the snapshot & validating data) can also be done automatically.
Otions for this:
- Interval: The action is done in an interval of x days. The time of day can be selected specifically.
- Schedule: The user can select a specific day of the week (also multiple weekdays) or day(s) of a month when the update of the snapshot and/or the validation of data is performed.
- Start on mounting a backup volume (with optional delay): Not all external harddrives used for backups are always connected to the computer. So it is desirable that the program could start validating whenever a volume is mounted that has been defined as a copy of the source data.
However in that case -typically- a backup will be performed so it would be great if the validation would wait until the backup is finished.
Different ways to do that:
(1) Some backup programs can start AppleScripts when they are finished -> A Script could start the validation job.
(2) The program could wait until the harddrive is not used anymore (using activity monitor which comes with OS X) and then start validating. Maybe with a delay of x minutes (to make sure the backup process is really finished and/or to let the backup drive cool down).
(3) Using a delay of x minutes after the backup volume is mounted before starting to validate the data.
The time used here would be an estimation how long the backup program would take to finish. This is not ideal but probably the most easy way and sufficient for an early implementation of the feature.
Option: Work in advance. This means that the program would use the time when the computer is not used to update the snapshot or/and to validate data. So when the scheduled action starts, a lot of the work is already done and the job can be completed more quickly.
It should be possible to use different settings of automatic updating and validation for different selections of data (to validate important data more frequently).
It`s obvious to avoid using seperate snapshots/hash-sets for different jobs so if 2 validation jobs partly include the same data, the hash signatures are not calculated and stored twice.
Things I am not sure about how to organize them:
It should be possible to use seperate schedules/intervals for updating the snapshot and for validating.
I am not sure how to seperate the settings for the snapshot and it`s copies (backups) from the scheduling in the GUI because I think that seperate schedules would make sense but this makes it more complicated, of course.
Various things:
Optional easy setup using templates for typical uses.
For example:
Snapshot of complete home folder excluding caches. Updating snapshot once a week, validating source once a month. Work in advance (whenever computer is unused) activated for updating the snapshot.
The user would only have to set the desired time of day for updating the snapshot and validating and has to add backups.
It is important that the snapshot is saved seperately from the job-data (scheduling) because then the same snapshot can be used for jobs with different schedules, different inclusions/exclusions for validating source data and various copies of the source data (backups).
This would also allow to create additional validation/update jobs including just the most important folders/files and using a more frequent update/validation-schedule (while still using the same snapshot).
When finding a file that is supposed to be damaged (because the date of change is still the same but the hash is incorrect) the program should offer to copy the file from a backup to the internal HD and check the data of the copy.
Data integrity of the snapshot itself has to be guaranteed.
The place to store the snapshot can be determined by the user.
The estimated storage space required for the snapshot should be shown to the user.
It is possible that new files are excluded by old rules.
So as an option it would be good if new exclusions must first be shown to the user and the user has to confirm that these files will be excluded. Otherwise the exclusion is inactive.
Should be an option - not everybody will want this.
(checkbox) Alway show new excluded file/folders on start-up
(checkbox) New exclusions have to be confirmed by user before they become active (to avoid that new files are unwillingly excluded by old rules)
The time used for updating the snapshot and for validating data should be remembered to give a better estimation how long a job will supposedly take in the future.
Before starting a job (updating snapshot, validating data) the program should (optionally) show a summary showing the amount of data (filesize of all files and number of files) that has to be processed.
Maybe it is possible to give an estimation how long it will take.
It should be possible to pause an update-process of the snapshot or validation of data.
Option:
Program asks user to connect/mount a backup device/volume when the scheduled time for validation has come.
It should also be possible to do this if the last complete validation was x days ago.
It should be possible to show a list of all files excluded by the rules in a window. The window should have 2 panels: When you click on any of the excluded files in the list of the upper part of the window, you see the rule that excludes this file in the lower part of the window and can edit or delete this rule. You can also select any excluded file and (from the context menu, right-click) add this file to the include list.
Log:
Brief summary is stored as a seperate log from detailed log.
Filenames
YYYY-MM-DD-HH-MM-SS-detailed.log
YYYY-MM-DD-HH-MM-SS-brief summary.log
If handled this way old detailed logs can be deleted after a given time while the brief summaries can be kept longer.
Pause HD-access to give HDs a rest.
Read/write-access can be paused based on these circumstances:
(1) After x minutes of reading/writing from/to the harddrive
(2) Depending on the temperature values of harddrive`s/CPU`s built-in sensors
(3) Depending on location, time of day. season, weather data
I think 3 is obsolete because it should be possible to read the sensor data easily. There are a few freeware programs which already do this.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1. Goals and brief description of functionality
2. Detailed description of functionality
Checking (validating) the integrity of data on a storage device using hash signatures of the files which very generated earlier.
The validation process can be started manually or automatically by using a schedule.
Due to a defect harddrive or malware your data can be damaged. It is likely that this won`t be noticed until only defect copies of the data are left on your backups. Validating can prevent this.
You can make backups of your data but almost no developer of backup software takes into account that the source data can be corrupted.
This can happen due to a defect harddrive or malware and it is very likely that the damage of the data will not be noticed for a long time.
When it is finally noticed chances are that all backups have already be overwritten with defect copies of the data so there is no valid copy left.
The result is that despite a regular and redundant backup strategy and discipline data will be lost.
The only way to prevent this is to validate also the source data.
When is a file considered to be corrupt?
If the date of change (meta data) is still the same as when its hash was generated but the hash has changed.
I have been informed that not all applications update the date of change in a file`s meta data when writing like they are supposed to do.
If this is true it definitely is an issue and it has to be clearly communicated to the user that a detected change of the file despite unchanged date of change (in the meta data) does not ultimately mean that the file is damaged.
This is a problem.
Nevertheless, continueing to let important data be completely unchecked to me seems not like a good idea and with data that is not changed (anymore) the case is even more clear:
- Data of finished projects.
- Libraries or original photographs/recordings (a photographer surely will want to keep unedited original photos, in desktop publishing and music production you have libraries like stock photos, fonts and sound libraries which often are huge in size) aswell as any kind of archived data
- Previous backups (checking old backups for data integrity -maybe every 6 months or so and before overwriting your backup copy with the longest turn-around-interval*)
* Here I have a backup-scenario in mind with multiple redundand backups in backup-intervals daily, weekly, monthly. It is not a bad idea to keep permanent backup copies but I assume that the most typical users will eventually overwrite all backups.
In the example scenario, if files become damaged and this is not noticed for 1 month, then the last existing copy might be overwritten.
In most cases files that had been backed up before and did not change won`t be copied again. Then the risk of propagated data corruption is reduced.
But this does not change the problem that you have if there either is just
1 backup copy and the original and one of which is defect or you have multiple backups but the original
or
one of the copies are different from the others (professional backup programs can at least compare original and backup).
In the first case you practically have no backup (because it`s damaged and you don`t know it).
In the second case you would have to decide which file is OK, original or copy. With 1 original and 2 copies, chances are that at least 2 are identical so you can make a decision. With 1 backup you`re on your own to manually open files and decide -which can be easy or impossible.
Long term goal:
Integrating Data Integrity Monitor in a Backup application. The application should however always be available as a seperate program.
Main platform: Mac OS X. Hopefully also Linux. If someone would like to do a Windows version then I have nothing against it.
Language: Obective-C or C++
As far as I know there already is a Unix function built into Unix and its derivates that calculates hashes.
It is obvious to use this.
Here it is:
bash shell:
for i in $(find .); do if test -f $i; then md5sum $i >>/tmp/sum.md5; fi;
This is supposed to callculate md5-checksums for all files in the current directory/folder and its sub-directories and save it in the file /tmp/sum.md5
The program will save hashes for all files within folders/directories (including sub-folders/sub-directories) that the user can select.
The saved hash signatures can then be used later to validate that the data has not been damaged.
The user can also define paths/volumes that contain copies (backups) of the source data (this makes automatic validation easier when external harddrives are used. More about that later).
Typically this will be the home folder/directory.
It should be possible to exclude folders/directories or files by selecting them (list-view like finder and a checkbox for each object) or by using include/exclude lists (I would call this a filter) where the user can add rules for objects (files/folders) to include or exclude depending on properties like filename, filetype, size, date of change.
As a default, caches would be excluded (by the filter).
This set of hash signatures derived from the source is a snapshot or hash-set.
The snapshot or hash-set can then be used to verify/validate the original data and copies of it (backups).
The snapshot has to be updated when new versions of existing files are saved and when new files are added.
Automatic updating & validation - schedules, intervals, start on mounting a backup volume:
Updating the snapshot (calculating hash signatures of new source data and deleting hashes of files that have been deleted -deleting should optionally be turned off for versioned backups, more about that later) and validating source data and backups (copies of the source data) can be started manually.
But both (updating the snapshot & validating data) can also be done automatically.
Otions for this:
- Interval: The action is done in an interval of x days. The time of day can be selected specifically.
- Schedule: The user can select a specific day of the week (also multiple weekdays) or day(s) of a month when the update of the snapshot and/or the validation of data is performed.
- Start on mounting a backup volume (with optional delay): Not all external harddrives used for backups are always connected to the computer. So it is desirable that the program could start validating whenever a volume is mounted that has been defined as a copy of the source data.
However in that case -typically- a backup will be performed so it would be great if the validation would wait until the backup is finished.
Different ways to do that:
(1) Some backup programs can start AppleScripts when they are finished -> A Script could start the validation job.
(2) The program could wait until the harddrive is not used anymore (using activity monitor which comes with OS X) and then start validating. Maybe with a delay of x minutes (to make sure the backup process is really finished and/or to let the backup drive cool down).
(3) Using a delay of x minutes after the backup volume is mounted before starting to validate the data.
The time used here would be an estimation how long the backup program would take to finish. This is not ideal but probably the most easy way and sufficient for an early implementation of the feature.
Option: Work in advance. This means that the program would use the time when the computer is not used to update the snapshot or/and to validate data. So when the scheduled action starts, a lot of the work is already done and the job can be completed more quickly.
It should be possible to use different settings of automatic updating and validation for different selections of data (to validate important data more frequently).
It`s obvious to avoid using seperate snapshots/hash-sets for different jobs so if 2 validation jobs partly include the same data, the hash signatures are not calculated and stored twice.
Things I am not sure about how to organize them:
It should be possible to use seperate schedules/intervals for updating the snapshot and for validating.
I am not sure how to seperate the settings for the snapshot and it`s copies (backups) from the scheduling in the GUI because I think that seperate schedules would make sense but this makes it more complicated, of course.
Various things:
Optional easy setup using templates for typical uses.
For example:
Snapshot of complete home folder excluding caches. Updating snapshot once a week, validating source once a month. Work in advance (whenever computer is unused) activated for updating the snapshot.
The user would only have to set the desired time of day for updating the snapshot and validating and has to add backups.
It is important that the snapshot is saved seperately from the job-data (scheduling) because then the same snapshot can be used for jobs with different schedules, different inclusions/exclusions for validating source data and various copies of the source data (backups).
This would also allow to create additional validation/update jobs including just the most important folders/files and using a more frequent update/validation-schedule (while still using the same snapshot).
When finding a file that is supposed to be damaged (because the date of change is still the same but the hash is incorrect) the program should offer to copy the file from a backup to the internal HD and check the data of the copy.
Data integrity of the snapshot itself has to be guaranteed.
The place to store the snapshot can be determined by the user.
The estimated storage space required for the snapshot should be shown to the user.
It is possible that new files are excluded by old rules.
So as an option it would be good if new exclusions must first be shown to the user and the user has to confirm that these files will be excluded. Otherwise the exclusion is inactive.
Should be an option - not everybody will want this.
(checkbox) Alway show new excluded file/folders on start-up
(checkbox) New exclusions have to be confirmed by user before they become active (to avoid that new files are unwillingly excluded by old rules)
The time used for updating the snapshot and for validating data should be remembered to give a better estimation how long a job will supposedly take in the future.
Before starting a job (updating snapshot, validating data) the program should (optionally) show a summary showing the amount of data (filesize of all files and number of files) that has to be processed.
Maybe it is possible to give an estimation how long it will take.
It should be possible to pause an update-process of the snapshot or validation of data.
Option:
Program asks user to connect/mount a backup device/volume when the scheduled time for validation has come.
It should also be possible to do this if the last complete validation was x days ago.
It should be possible to show a list of all files excluded by the rules in a window. The window should have 2 panels: When you click on any of the excluded files in the list of the upper part of the window, you see the rule that excludes this file in the lower part of the window and can edit or delete this rule. You can also select any excluded file and (from the context menu, right-click) add this file to the include list.
Log:
Brief summary is stored as a seperate log from detailed log.
Filenames
YYYY-MM-DD-HH-MM-SS-detailed.log
YYYY-MM-DD-HH-MM-SS-brief summary.log
If handled this way old detailed logs can be deleted after a given time while the brief summaries can be kept longer.
Pause HD-access to give HDs a rest.
Read/write-access can be paused based on these circumstances:
(1) After x minutes of reading/writing from/to the harddrive
(2) Depending on the temperature values of harddrive`s/CPU`s built-in sensors
(3) Depending on location, time of day. season, weather data
I think 3 is obsolete because it should be possible to read the sensor data easily. There are a few freeware programs which already do this.