this approach might be also interesting for other OS and files systems that have journals.
I was surprised that I couldn't find that "idea/enhancement" in the forum already (or did I miss it?).
I know, the primary focus with snapraid is on "static" data, but why not use journal information to identify added, deleted and modified files (and directories in future?) instead for scanning the entire disk?
Today I've identified 2 silent corruptions that were actually none: application/OS changed the file content and "restored" the original last modified date. That's really bad behavior, but out of my control. Right now, the only option I have is to exclude this data from snapraid...
I couldn't find Microsoft NTFS Change Journal API, but it's public and should be there... somewhere ;-)
If the journals don't contain the logged changes anymore, snapraid could still fall-back to the scanning approach.
Let me know your thoughts.
Cheers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yep. Reading the journal is for sure an interesting approach!
Anyway, my next major TODO entry for SnapRAID will be to use multiple threads to scan disks for changes. This should already provide some kind of improvement. I prefer to do this as first, as it's something that will work everywhere.
But feel free to experiment :)
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
you mean a thread per disk scan? I'm sure that will improve the total scan time.
I was a bit surprised about the scanning disks time I have with SR in my environment (Windows 2012 on HP Microserver N54L), and thought before coding a test tool, why not google for an existing one and found https://sourceforge.net/projects/dirtree/. I thought I could test multi-threaded scanning even on a single disk (though did not ready expect any improvement) with that tool, but it was single threaded as well...
Anyway, I have some interesting finding: though the CPU utilization for scanning a sample disk of some 40k directories with 400k files was between SR and dirtree pretty much the same, dirtree completed the scan perceived 10 times faster...
Are you doing much more than just scanning the disk and retrieving all folders and file details? I cannot really explain that huge difference in scanning performance between SR and dirtree.
Regarding the NTFS journal I found these old but really nice articles from 2009 for my reading before going through the API:
Yes. SnapRAID does something more than a normal directory listing. It has also to gather information about the physical location of the files on the disk.
Anyway, likely some optimizations for Windows are possible.
First try a "snapraid diff" and then a "snapraid --test-force-order-dir diff".
The diff command has no risk involved as it's read only, so, it's safe to try.
Is it faster than the normal SnapRAID ?
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
cool, quick alpha binaries and I had to test it today :-)
Here are the even better stats of my tests (avg of two executions per diff-test):
SR 6.3 diff: 7:23 Minutes
SR 7.0 diff: 3:51 Minutes
SR 7.0 --test-force-order-dir diff: 0:36 Minutes !!!
The standard diff performance of 7.0 is already hugely improved, but with --test-force-order-dir over 12 times faster compared to 6.3!
Now, I did not test sync with 7.0 as 3 files on my snapshot disks are having unexpected 0 size that required --force-zero option for complete diff execution. As 6.3 is reporting no differences, I'm a cautious though...
Anyway, the new option improves scanning like hell!
Great job!!!
Once I can use it in production, I'm monitoring general sync through-put changes (if any ;-)!
Cheers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the promptly report. Now I know I'm working in the right direction :)
Do you have some more info about that three files with 0 size ? Are they normal files ? I suppose they are not really with 0 size. What is their real size ?
This is some way unexpected...
In case you are interested, the reason of these timings, is that 6.3 has to do two slow calls for each file. To read inode and physical address.
Now 7.0 uses a different and fast way to read inode, and it almost halve the time. The --test-force-order-dir option also removes the need of physical address, and then it becomes really fast.
The plan is to be more selective about physical address, as it's needed only when the file is seen for the first time. So, in theory the super-fast speed, is the goal for 7.0 :)
Thanks,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
regarding the 3 files I'm also surprised. One is actually 0 byte, one is 154 bytes and the last one is 52,445,871 bytes. They only have one thing in common: they had open write file handles during NTFS shadow creation followed by the SR sync. But with SR 7.0 it's the first time I encounter this "The file 'C:/yadda' has unexpected zero size! If this an expected state
you can 'diff' anyway usinge 'snapraid --force-zero diff'
Instead, it's possible that after a kernel crash this file was lost,
and you can use 'snapraid --filter yadda fix' to recover it.
"-warning...
I'm using disk volume shadow copies as SR data disks for almost a year, really stable...
Cheers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
quick update on SR 7 alpha running on VSS snapshot disks:
I'm using commands like mklink /d "F:\$RECYCLE.BIN\snapraid\" "\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy35\" to mount the shadow copy on a path and use "F:\$RECYCLE.BIN\snapraid\" as a data disk. Now I executed a SR 7 sync and got the same zero error. After adding the --force-zero I've received many read errors on one of the files:
Unexpected size change at file 'F:/$RECYCLE.BIN/snapraid/Save.tv & Co/Sick-Beard/mr-orange_Sick-Beard.git/Logs/sickbeard.log' from 79942318 to 79943426.
WARNING! You cannot modify files during a sync.
That's really strange because the data on the mount point is a shadow copy, and that data cannot change.
Is SR 7 resolving mount points somehow, even on the data disk root level?
Cheers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The reason of this behavior is the Windows filesystem cache. The new way to read directories use it, and such cache may return information that are up to five minutes older.
Anyway, I've now changed SnapRAID to ensure that any modified file uses the old method to get updated information. This should be a good compromise. The fast method for static files, and the slow one for dynamic files.
The same applies for reading the physical offsets of files. It's now done only for files that really need it. So, now it should be always fast, without the need of using the --test-force-order-dir option.
Please let me know how it behaves.
I've also added a new undocumented option --test-force-scan-winfind, to force the use of the old directory listing on all files.
thx for the quick updated alpha. I've tested the performance with and without --test-force-scan-winfind options:
snapraid diff: 41 Secondsbut with 3 "WARNING! Detected uncached size change for file 'yadda' It's better if you run SnapRAID without other processes running." (sames files mentioned earlier)
snapraid --test-force-scan-winfind diff: 250 Secondswith no warnings
But I'm really surprised about "Windows filesystem cache" behavior for disk shadow mounts.
I'll check if using different mounting ways might prevent it or other ways exist to free the cache...
Anyway, in my scenario, getting the new directory scanning way to work in the end, would be fast enough and NTFS journal scanning might not be required (though I'm still interesting to get it tested ;-).
Regarding NTFS journal: it knows only about filenames (no path names) and File Reference Number, my question for you: Are these the same you are using in SR?
The reason I'm asking is that resolving FRN to path names is not that strait forward and requires maintaining a local "database" that SR is already doing...
Cheers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've now tested SR 7 alpha sync on my system and it looks stable though the "uncached warnings" are a little strange, as they don't disappear even after running sync...
Cheers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Don't worry about these "uncached" warnings. They are present just to confirm that the problem was the Windows cache. And indeed it's.
I will just remove them in the final release.
About this cache problem. It's a general one. And not related to shadow copies. The only official "reference of it is in the MSDN FindFirstFile doc, that says:
Unfortunately in Windows there are a lot of ways to list directories, but none that really read all info and fast :( SnapRAID has to do some kind of magic to get to it.
Anyway, I think that we got a good solution now. Really thanks for your tests!
About your question. Yes. File Reference Numbers are the ones that SnapRAID uses as inode info. Getting the list of FRN of the modified and new files, would allow SnapRAID to avoid to scan the directories.
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
cool, will check on the NTFS Journal. It will be a playback of all changes including file & folder renaming, modification, create and delete. And it should do a fallback to regular scanning when not available or incomplete since last time...
Cheers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm testing some code changes of dir handling in SR on the content data and tommy's to support the NTFS journal changes in scan. First I've tested to treat any dir to be stored in content (not only the emptydir). Also added the dir inode (now remember my first feature request of folder modify date restore - haha ;-). But to fully support journal folder renames and moves within a disk, I'm really thinking the flat file/dir structure should be changed to tree, representing the same hierarchy as it exists on the file system. I'm unsure about that bigger change a little bit in terms of possible performance impacts, memory consumption and general stability. I don't expect significant performance or memory impacts though.
I really want to contribute (at least NTFS) journal support to SR and your feedback on that would really help me finding the right direction.
Once my forked SR git repo works as I expect it and I'm confident with code changes, will push it to the right place for your review/feedback.
Cheers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My recommendation is to keep the changes small. Smallest is the changeset more likely it will be integrated.
Big changes needs a lot of work to get stable. Better to do one small step at a time. For example, restoring the dir timestamp can be a separate feature that can be integrated a lot easier by itself, and it's surely a good starting point.
About the NTFS journal, the first thing you can do is to check about any speed improvement compared to the new fast dir scanning of 7.0. Obviously, using the journal must be faster to justify its integration.
About changing the internal data structure, hmmmm. It's the kind of big change I would avoid. But maybe, it just because I miss the potential advantage of it.
Anyway, I'm really interested to see your work :)
Thanks,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I try to keep the the changes small but organizing in aligned branches is sth. I need/have to do ;-)
I think changes look good, will push to my fork soon.
Quick question in state.c (line 4577/8) / method state_filter: Why are you using filter_dir and not filter_path for each dir? I was testing my changes and tried to fix only one directory with "-f /test/" and realized directories on all disks tried to get restored.
Great job with new version and your support - as always!
Cheers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thx - will review it later again (not that important right now).
Now, though I've made many tests and good progress, I realized now I'm fishing too much in the unknown: WIN API, Journal, SR database extensions, new methods here and there...
So, today I've decided I need test cases for any NTFS changes that get traced back by iterating the journal, before completing the journal support in SR. But first I'll clean up my work and push it before starting the journal test and validation tool.
Brief summary of my changes (out of my head ;-):
-'R'/"Dir"tagsreplace'r'/"dir"withmtime(ns)andinodesupport-'S'/"Symlink"and'A'/"Hardlink"replacelinktagswithmtime(ns)andinodesupport(thoughIthinkImightnotneedthatanymore,asonlysymlinkshavetheirownmtimeandinode->thatwhyIneedatesttooltestingallcases)-renameemptydirmethodsandcommentstodir(alldirs)-addedmtimerestorefordir(succeededalsowithsymlinks,buthardlinksprobablydon't support... for good reasons?!?...)-andprobablysomemore...
I try to define consistent commits, as all in one might be also confusing for me.
(Hey, my first git commit/work ever ;-)
CHeers,
Jens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
this approach might be also interesting for other OS and files systems that have journals.
I was surprised that I couldn't find that "idea/enhancement" in the forum already (or did I miss it?).
I know, the primary focus with snapraid is on "static" data, but why not use journal information to identify added, deleted and modified files (and directories in future?) instead for scanning the entire disk?
Today I've identified 2 silent corruptions that were actually none: application/OS changed the file content and "restored" the original last modified date. That's really bad behavior, but out of my control. Right now, the only option I have is to exclude this data from snapraid...
I couldn't find Microsoft NTFS Change Journal API, but it's public and should be there... somewhere ;-)
If the journals don't contain the logged changes anymore, snapraid could still fall-back to the scanning approach.
Let me know your thoughts.
Cheers,
Jens.
Ok, missed that one https://sourceforge.net/p/snapraid/discussion/1677233/thread/716be5d9/#b54f but was more a question, but Ludwig had the same expectation here: faster than file scan...
I'm more interested in catching silent data changes, that aren't corruptions...
Found this demo accessing NTFS journal incl. src code: http://www.codeproject.com/Articles/11594/Eyes-on-NTFS
Also some C code on MSDN: http://msdn.microsoft.com/en-us/library/windows/desktop/aa365736%28v=vs.85%29.aspx
Will check how to integrate into snapraid next couple days, as in my SR config 95% of the entire sync time goes into data drives folder scanning ;-)
Hi Jens,
Yep. Reading the journal is for sure an interesting approach!
Anyway, my next major TODO entry for SnapRAID will be to use multiple threads to scan disks for changes. This should already provide some kind of improvement. I prefer to do this as first, as it's something that will work everywhere.
But feel free to experiment :)
Ciao,
Andrea
Hi Andrea,
you mean a thread per disk scan? I'm sure that will improve the total scan time.
I was a bit surprised about the scanning disks time I have with SR in my environment (Windows 2012 on HP Microserver N54L), and thought before coding a test tool, why not google for an existing one and found https://sourceforge.net/projects/dirtree/. I thought I could test multi-threaded scanning even on a single disk (though did not ready expect any improvement) with that tool, but it was single threaded as well...
Anyway, I have some interesting finding: though the CPU utilization for scanning a sample disk of some 40k directories with 400k files was between SR and dirtree pretty much the same, dirtree completed the scan perceived 10 times faster...
Are you doing much more than just scanning the disk and retrieving all folders and file details? I cannot really explain that huge difference in scanning performance between SR and dirtree.
Regarding the NTFS journal I found these old but really nice articles from 2009 for my reading before going through the API:
Keeping an Eye on Your NTFS Drives: the Windows 2000 Change Journal Explained
http://www.microsoft.com/msj/0999/journal/journal.aspx
Keeping an Eye on Your NTFS Drives, Part II: Building a Change Journal Application
http://www.microsoft.com/msj/1099/journal2/journal2.aspx
Looks more complex than I initially expected - haha ;-)
Cheers,
Jens.
Hi Jens,
Yes. SnapRAID does something more than a normal directory listing. It has also to gather information about the physical location of the files on the disk.
Anyway, likely some optimizations for Windows are possible.
Could you please try this special 7.0 version at: http://snapraid.sourceforge.net/alpha/
First try a "snapraid diff" and then a "snapraid --test-force-order-dir diff".
The diff command has no risk involved as it's read only, so, it's safe to try.
Is it faster than the normal SnapRAID ?
Ciao,
Andrea
Hi Andrea,
cool, quick alpha binaries and I had to test it today :-)
Here are the even better stats of my tests (avg of two executions per diff-test):
The standard diff performance of 7.0 is already hugely improved, but with --test-force-order-dir over 12 times faster compared to 6.3!
Now, I did not test sync with 7.0 as 3 files on my snapshot disks are having unexpected 0 size that required --force-zero option for complete diff execution. As 6.3 is reporting no differences, I'm a cautious though...
Anyway, the new option improves scanning like hell!
Great job!!!
Once I can use it in production, I'm monitoring general sync through-put changes (if any ;-)!
Cheers,
Jens.
Hi Jens,
Thanks for the promptly report. Now I know I'm working in the right direction :)
Do you have some more info about that three files with 0 size ? Are they normal files ? I suppose they are not really with 0 size. What is their real size ?
This is some way unexpected...
In case you are interested, the reason of these timings, is that 6.3 has to do two slow calls for each file. To read inode and physical address.
Now 7.0 uses a different and fast way to read inode, and it almost halve the time. The --test-force-order-dir option also removes the need of physical address, and then it becomes really fast.
The plan is to be more selective about physical address, as it's needed only when the file is seen for the first time. So, in theory the super-fast speed, is the goal for 7.0 :)
Thanks,
Andrea
Hi Andrea,
regarding the 3 files I'm also surprised. One is actually 0 byte, one is 154 bytes and the last one is 52,445,871 bytes. They only have one thing in common: they had open write file handles during NTFS shadow creation followed by the SR sync. But with SR 7.0 it's the first time I encounter this
"The file 'C:/yadda' has unexpected zero size! If this an expected state
you can 'diff' anyway usinge 'snapraid --force-zero diff'
Instead, it's possible that after a kernel crash this file was lost,
and you can use 'snapraid --filter yadda fix' to recover it.
"-warning...
I'm using disk volume shadow copies as SR data disks for almost a year, really stable...
Cheers,
Jens.
Hi Andrea,
quick update on SR 7 alpha running on VSS snapshot disks:
I'm using commands like mklink /d "F:\$RECYCLE.BIN\snapraid\" "\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy35\" to mount the shadow copy on a path and use "F:\$RECYCLE.BIN\snapraid\" as a data disk. Now I executed a SR 7 sync and got the same zero error. After adding the --force-zero I've received many read errors on one of the files:
Unexpected size change at file 'F:/$RECYCLE.BIN/snapraid/Save.tv & Co/Sick-Beard/mr-orange_Sick-Beard.git/Logs/sickbeard.log' from 79942318 to 79943426.
WARNING! You cannot modify files during a sync.
That's really strange because the data on the mount point is a shadow copy, and that data cannot change.
Is SR 7 resolving mount points somehow, even on the data disk root level?
Cheers,
Jens.
Hi Jens,
The reason of this behavior is the Windows filesystem cache. The new way to read directories use it, and such cache may return information that are up to five minutes older.
Anyway, I've now changed SnapRAID to ensure that any modified file uses the old method to get updated information. This should be a good compromise. The fast method for static files, and the slow one for dynamic files.
The same applies for reading the physical offsets of files. It's now done only for files that really need it. So, now it should be always fast, without the need of using the --test-force-order-dir option.
Please let me know how it behaves.
I've also added a new undocumented option --test-force-scan-winfind, to force the use of the old directory listing on all files.
Available at: http://snapraid.sourceforge.net/alpha/
Ciao,
Andrea
Hi Andrea,
thx for the quick updated alpha. I've tested the performance with and without --test-force-scan-winfind options:
But I'm really surprised about "Windows filesystem cache" behavior for disk shadow mounts.
I'll check if using different mounting ways might prevent it or other ways exist to free the cache...
Anyway, in my scenario, getting the new directory scanning way to work in the end, would be fast enough and NTFS journal scanning might not be required (though I'm still interesting to get it tested ;-).
Regarding NTFS journal: it knows only about filenames (no path names) and File Reference Number, my question for you: Are these the same you are using in SR?
The reason I'm asking is that resolving FRN to path names is not that strait forward and requires maintaining a local "database" that SR is already doing...
Cheers,
Jens.
Hi Andrea,
I've now tested SR 7 alpha sync on my system and it looks stable though the "uncached warnings" are a little strange, as they don't disappear even after running sync...
Cheers,
Jens.
Hi Jens,
Don't worry about these "uncached" warnings. They are present just to confirm that the problem was the Windows cache. And indeed it's.
I will just remove them in the final release.
About this cache problem. It's a general one. And not related to shadow copies. The only official "reference of it is in the MSDN FindFirstFile doc, that says:
Unfortunately in Windows there are a lot of ways to list directories, but none that really read all info and fast :( SnapRAID has to do some kind of magic to get to it.
Anyway, I think that we got a good solution now. Really thanks for your tests!
About your question. Yes. File Reference Numbers are the ones that SnapRAID uses as inode info. Getting the list of FRN of the modified and new files, would allow SnapRAID to avoid to scan the directories.
Ciao,
Andrea
Hi Andrea,
cool, will check on the NTFS Journal. It will be a playback of all changes including file & folder renaming, modification, create and delete. And it should do a fallback to regular scanning when not available or incomplete since last time...
Cheers,
Jens.
Hi Andrea,
I've reviewed the scan code of SR and have to say it looks really good.
Also the Win API regarding NTFS Journal is nice an very easy to use & compile.
Let's see a far I get :-)
Cheers,
Jens.
Last edit: Jens Bornemann 2014-10-22
Hi Andrea,
I'm testing some code changes of dir handling in SR on the content data and tommy's to support the NTFS journal changes in scan. First I've tested to treat any dir to be stored in content (not only the emptydir). Also added the dir inode (now remember my first feature request of folder modify date restore - haha ;-). But to fully support journal folder renames and moves within a disk, I'm really thinking the flat file/dir structure should be changed to tree, representing the same hierarchy as it exists on the file system. I'm unsure about that bigger change a little bit in terms of possible performance impacts, memory consumption and general stability. I don't expect significant performance or memory impacts though.
I really want to contribute (at least NTFS) journal support to SR and your feedback on that would really help me finding the right direction.
Once my forked SR git repo works as I expect it and I'm confident with code changes, will push it to the right place for your review/feedback.
Cheers,
Jens.
Hi Jens,
My recommendation is to keep the changes small. Smallest is the changeset more likely it will be integrated.
Big changes needs a lot of work to get stable. Better to do one small step at a time. For example, restoring the dir timestamp can be a separate feature that can be integrated a lot easier by itself, and it's surely a good starting point.
About the NTFS journal, the first thing you can do is to check about any speed improvement compared to the new fast dir scanning of 7.0. Obviously, using the journal must be faster to justify its integration.
About changing the internal data structure, hmmmm. It's the kind of big change I would avoid. But maybe, it just because I miss the potential advantage of it.
Anyway, I'm really interested to see your work :)
Thanks,
Andrea
Hi Andrea,
I try to keep the the changes small but organizing in aligned branches is sth. I need/have to do ;-)
I think changes look good, will push to my fork soon.
Quick question in state.c (line 4577/8) / method state_filter: Why are you using filter_dir and not filter_path for each dir? I was testing my changes and tried to fix only one directory with "-f /test/" and realized directories on all disks tried to get restored.
Great job with new version and your support - as always!
Cheers,
Jens.
Hi Jens,
The difference between filter_dir() and filter_path() is to tell if the passed string is a dir o a file.
We have different exclusion rules for them, so the filter has to know what is it.
Not sure about restoring all dirs. In true that part is only for empty dirs, but likely you modified that.
Ciao,
Andrea
Hi Andrea,
thx - will review it later again (not that important right now).
Now, though I've made many tests and good progress, I realized now I'm fishing too much in the unknown: WIN API, Journal, SR database extensions, new methods here and there...
So, today I've decided I need test cases for any NTFS changes that get traced back by iterating the journal, before completing the journal support in SR. But first I'll clean up my work and push it before starting the journal test and validation tool.
Brief summary of my changes (out of my head ;-):
I try to define consistent commits, as all in one might be also confusing for me.
(Hey, my first git commit/work ever ;-)
CHeers,
Jens.