Thread: [sleuthkit-users] hashing a file system
Brought to you by:
carrier
From: Stuart M. <st...@ap...> - 2014-09-04 22:27:27
|
I am tracking recent efforts in STIX and Cybox and all things Mitre. One indicator of compromise is an md5 hash of some file. Presumably you compare the hash with all files on some file system to see if there is a match. Obviously this requires a walk of the host fs, using e.g. fls or fiwalk or the tsk library in general. Is this a common activity, the hashing of a complete filesystem that is? If yes, some experiments I have done with minimising total disk seek time by ordering Runs, reading content from the ordered Runs and piecing each file's hash back together would show that this is indeed a worthy optimization since it can decrease the time spent deriving the full hash table considerably. I did see a slide deck by Simson G where he alluded to a similar win situation when disk reads are ordered so as to minimise seek time, but wonder if much has been published on the topic, specifically relating to the digital forensics arena, i.e. when an entire file system contents is to be read in a single pass, for the purposes of producing an 'md5 -> file path' map. Opinions and comments welcomed. Stuart |
From: Stuart M. <st...@ap...> - 2014-09-04 23:16:17
|
On 09/04/2014 03:46 PM, Simson Garfinkel wrote: > Hi Stuart. > > You are correct — I put this in numerous presentations but never published it. > > The MD5 algorithm won't let you combine a partial hash from the middle of the file with one from the beginning. You need to start at the beginning and hash through the end. (That's one of the many problems with MD5 for forensics, BTW.) So I believe that the only approach is sorting the files by the sector number of the first run, and just leaving it at that. > > I saw speedup with both HDs and SSDs, strangely enough, but not as much with SSDs. There may be a prefetch thing going on here. > > I think that the Autopsy framework should hash this way, but currently it doesn't. On the other hand, it may be more useful to hash based on the "importance" of the files. > > Simson > > Hi Simson, currently I have just got as far as noting the 'seek distances' between consecutive runs, across ALL files. I have yet to actually read the file content. But I don't think it's that hard. As you point out, md5 summing must be done with the file content in correct order. I see an analogy between the 'runs ordered by block address but not necessarily file offset' and the problem the IP layer has in tcp/ip as it tries to reassemble the fragments of a datagram that may arrive in any order. We may have to have some 'pending data' structure for runs whose content has been read but which cannot yet be offered to the md5 hasher due to an as yet unread run being needed first. I'll let you know if/when I nail this. Pehaps Autopsy could benefit? Is fiwalk doing it the 'regular way' too, i.e reading all file content of each file as the walk proceeds? Stuart |
From: Simson G. <si...@ac...> - 2014-09-05 00:22:25
|
Yes, fiwalk hashes in SleuthKit order. If you want to hash in block order you need to generate the DFXML for the entire drive, sort by the index of the first On Sep 4, 2014, at 7:53 PM, Stuart Maclean <st...@ap...> wrote: > On 09/04/2014 03:46 PM, Simson Garfinkel wrote: >> Hi Stuart. >> >> You are correct — I put this in numerous presentations but never published it. >> >> The MD5 algorithm won't let you combine a partial hash from the middle of the file with one from the beginning. You need to start at the beginning and hash through the end. (That's one of the many problems with MD5 for forensics, BTW.) So I believe that the only approach is sorting the files by the sector number of the first run, and just leaving it at that. >> >> I saw speedup with both HDs and SSDs, strangely enough, but not as much with SSDs. There may be a prefetch thing going on here. >> >> I think that the Autopsy framework should hash this way, but currently it doesn't. On the other hand, it may be more useful to hash based on the "importance" of the files. >> >> Simson >> >> > Hi Simson, currently I have just got as far as noting the 'seek distances' between consecutive runs, across ALL files. I have yet to actually read the file content. But I don't think it's that hard. As you point out, md5 summing must be done with the file content in correct order. I see an analogy between the 'runs ordered by block address but not necessarily file offset' and the problem the IP layer has in tcp/ip as it tries to reassemble the fragments of a datagram that may arrive in any order. We may have to have some 'pending data' structure for runs whose content has been read but which cannot yet be offered to the md5 hasher due to an as yet unread run being needed first. > > I'll let you know if/when I nail this. Pehaps Autopsy could benefit? Is fiwalk doing it the 'regular way' too, i.e reading all file content of each file as the walk proceeds? > > Stuart |
From: Simson G. <si...@ac...> - 2014-09-04 23:31:44
|
Hi Stuart. You are correct — I put this in numerous presentations but never published it. The MD5 algorithm won't let you combine a partial hash from the middle of the file with one from the beginning. You need to start at the beginning and hash through the end. (That's one of the many problems with MD5 for forensics, BTW.) So I believe that the only approach is sorting the files by the sector number of the first run, and just leaving it at that. I saw speedup with both HDs and SSDs, strangely enough, but not as much with SSDs. There may be a prefetch thing going on here. I think that the Autopsy framework should hash this way, but currently it doesn't. On the other hand, it may be more useful to hash based on the "importance" of the files. Simson On Sep 4, 2014, at 7:04 PM, Stuart Maclean <st...@ap...> wrote: > I am tracking recent efforts in STIX and Cybox and all things Mitre. > One indicator of compromise is an md5 hash of some file. Presumably you > compare the hash with all files on some file system to see if there is a > match. Obviously this requires a walk of the host fs, using e.g. fls or > fiwalk or the tsk library in general. > > Is this a common activity, the hashing of a complete filesystem that > is? If yes, some experiments I have done with minimising total disk > seek time by ordering Runs, reading content from the ordered Runs and > piecing each file's hash back together would show that this is indeed a > worthy optimization since it can decrease the time spent deriving the > full hash table considerably. > > I did see a slide deck by Simson G where he alluded to a similar win > situation when disk reads are ordered so as to minimise seek time, but > wonder if much has been published on the topic, specifically relating to > the digital forensics arena, i.e. when an entire file system contents is > to be read in a single pass, for the purposes of producing an 'md5 -> > file path' map. > > Opinions and comments welcomed. > > Stuart > > > ------------------------------------------------------------------------------ > Slashdot TV. > Video for Nerds. Stuff that matters. > http://tv.slashdot.org/ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: Luís F. N. <lfc...@gm...> - 2014-09-05 23:02:09
|
Hi Simson, I have had thoughts about implementing this "sort by sector number of first run" approach in a forensic tool based on TskJavaBindings, but I did not see how to get the file first sector number through the API. Do you know if it is possible with tsk java bindings? Regards, Luis Nassif 2014-09-04 20:13 GMT-03:00 Simson Garfinkel <si...@ac...>: > Hi Stuart. > > You are correct — I put this in numerous presentations but never published > it. > > The MD5 algorithm won't let you combine a partial hash from the middle of > the file with one from the beginning. You need to start at the beginning > and hash through the end. (That's one of the many problems with MD5 for > forensics, BTW.) So I believe that the only approach is sorting the files > by the sector number of the first run, and just leaving it at that. > > I saw speedup with both HDs and SSDs, strangely enough, but not as much > with SSDs. There may be a prefetch thing going on here. > > I think that the Autopsy framework should hash this way, but currently it > doesn't. On the other hand, it may be more useful to hash based on the > "importance" of the files. > > Simson > > > > > On Sep 4, 2014, at 7:04 PM, Stuart Maclean <st...@ap...> > wrote: > > > I am tracking recent efforts in STIX and Cybox and all things Mitre. > > One indicator of compromise is an md5 hash of some file. Presumably you > > compare the hash with all files on some file system to see if there is a > > match. Obviously this requires a walk of the host fs, using e.g. fls or > > fiwalk or the tsk library in general. > > > > Is this a common activity, the hashing of a complete filesystem that > > is? If yes, some experiments I have done with minimising total disk > > seek time by ordering Runs, reading content from the ordered Runs and > > piecing each file's hash back together would show that this is indeed a > > worthy optimization since it can decrease the time spent deriving the > > full hash table considerably. > > > > I did see a slide deck by Simson G where he alluded to a similar win > > situation when disk reads are ordered so as to minimise seek time, but > > wonder if much has been published on the topic, specifically relating to > > the digital forensics arena, i.e. when an entire file system contents is > > to be read in a single pass, for the purposes of producing an 'md5 -> > > file path' map. > > > > Opinions and comments welcomed. > > > > Stuart > > > > > > > ------------------------------------------------------------------------------ > > Slashdot TV. > > Video for Nerds. Stuff that matters. > > http://tv.slashdot.org/ > > _______________________________________________ > > sleuthkit-users mailing list > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > http://www.sleuthkit.org > > > > ------------------------------------------------------------------------------ > Slashdot TV. > Video for Nerds. Stuff that matters. > http://tv.slashdot.org/ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org > |
From: RB <ao...@gm...> - 2014-09-05 00:51:31
|
On Thu, Sep 4, 2014 at 5:04 PM, Stuart Maclean <st...@ap...> wrote: > Is this a common activity, the hashing of a complete filesystem that > is? If yes, some experiments I have done with minimising total disk > seek time by ordering Runs, reading content from the ordered Runs and > piecing each file's hash back together would show that this is indeed a > worthy optimization since it can decrease the time spent deriving the > full hash table considerably. Yes, this is a question we've actually discussed on this list in recent memory. Fiwalk/DFXML is great for automation, but you can use "tsk_gettimes -m" to both pull listings and checksums at the same time for a quick win. The output is in traditional bodyfile form with the md5 field actually populated (instead of being "0"). I've incorporated it into a script that does other things (pulls other useful files from the disk), but all told it takes 8-40 minutes (averaging around 20) to burn through an average 120-300GB disk, usually CPU or IOPS-bound. This is, as you indirectly noted, heavily affected by fragmentation. Although md5 is not a subdivisible hash (as Simson pointed out), one could conceivably still do a single-pass checksum of a filesystem, the tradeoff would be the memory consumption of tens of thousands of "simultaneous" checksum calculation states. |
From: Simson G. <si...@ac...> - 2014-09-05 01:01:01
|
On Sep 4, 2014, at 8:51 PM, RB <ao...@gm...> wrote: > > Although md5 is not a subdivisible hash (as Simson pointed out), one > could conceivably still do a single-pass checksum of a filesystem, the > tradeoff would be the memory consumption of tens of thousands of > "simultaneous" checksum calculation states. This doesn't work unless you are prepared to buffer the later fragments of a file when they appear on disk before earlier fragments. So in the worst case, you need to hold the entire disk in RAM. |
From: RB <ao...@gm...> - 2014-09-05 01:11:34
|
On Thu, Sep 4, 2014 at 7:01 PM, Simson Garfinkel <si...@ac...> wrote: > This doesn't work unless you are prepared to buffer the later fragments of a file when they appear on disk before earlier fragments. So in the worst case, you need to hold the entire disk in RAM. Perhaps I'm being dense, but "dd if=file | md5sum - " in no way holds the entire file in RAM, and the process can be slept/interrupted/etc; all this means that md5 can be calculated over a stream. Looking at the API for Perl & Python MD5 libraries (expected to be the simplest), they have standard functionality for adding data to a hash object, and I don't expect it holds that in memory either. This would mean you should be able to make a linear scan through the disk and, as you read blocks associated with a file, append them to the md5 object for that file, and move on. You'd have a lot of md5 objects in-memory, but it shouldn't be of a size equivalent to the entire [used] disk. |
From: Simson G. <si...@ac...> - 2014-09-05 01:15:59
|
On Sep 4, 2014, at 9:11 PM, RB <ao...@gm...> wrote: > On Thu, Sep 4, 2014 at 7:01 PM, Simson Garfinkel <si...@ac...> wrote: >> This doesn't work unless you are prepared to buffer the later fragments of a file when they appear on disk before earlier fragments. So in the worst case, you need to hold the entire disk in RAM. > > Perhaps I'm being dense, but "dd if=file | md5sum - " in no way holds > the entire file in RAM, and the process can be slept/interrupted/etc; > all this means that md5 can be calculated over a stream. You are confused between the physical layout of the disk and the logical layout of the files. You have proposed reading the disk in physical block order. If you are reading the disk in block order, what happens if you have a 30GB file where the first block of the file is at the end of the disk and the rest of the file is at the beginning? You have to buffer the portions of the file that come first on the disk but logically later in the file. Then when you reach the beginning of the file (at the end of the disk) you can start hashing. The problem is that files are fragmented, and frequently the second fragment of a file comes earlier on the disk than the first fragment. > > Looking at the API for Perl & Python MD5 libraries (expected to be the > simplest), they have standard functionality for adding data to a hash > object, and I don't expect it holds that in memory either. This would > mean you should be able to make a linear scan through the disk and, as > you read blocks associated with a file, append them to the md5 object > for that file, and move on. You'd have a lot of md5 objects > in-memory, but it shouldn't be of a size equivalent to the entire > [used] disk. |
From: RB <ao...@gm...> - 2014-09-05 01:21:19
|
On Thu, Sep 4, 2014 at 7:15 PM, Simson Garfinkel <si...@ac...> wrote: > You are confused between the physical layout of the disk and the logical layout of the files. <snip> > The problem is that files are fragmented, and frequently the second fragment of a file comes earlier on the disk than the first fragment. Thanks for your patience, I was indeed failing to take this into account. |
From: Stuart M. <st...@ap...> - 2014-09-05 17:09:35
|
Hi all, I'm glad to have provoked some conversation on the merits (or otherwise!) of md5 sums as useful representations of the state of a file system. Can anyone enlighten me as to the meaning of the 'flags' member in a TSK_FS_ATTR_RUN? Specifically, what does this comment mean? TSK_FS_ATTR_RUN_FLAG_FILLER = 0x01, ///< Entry is a filler for a run that has not been seen yet in the processing (or has been lost) In a fs I am walking and inspecting the runs for, I am seeing run structs with addr 0 and flags 1. I was under the impression that any run address of 0 represented a 'missing run' i.e. that this part of the file content is N zeros, where N = run.length * fs.blocksize. I presume that would be the case were the run flags value 2: TSK_FS_ATTR_RUN_FLAG_SPARSE = 0x02 ///< Entry is a sparse run where all data in the run is zeros If I use istat, I can see inodes which have certain 'Direct Blocks' of value 0, and when I see M consecutive 0 blocks that matches up to a 'missing run' when inspecting the runs using the tsk lib (actually my tsk4jJava binding, which is now finally showing its worth since I can do all data structure manipulation in Java, nicer than in C, for me at least). I am worried at being 'filler' and not 'sparse', the partial file content represented by the run(s) with addr 0 is not necessarily a sequence of zeros. Anyone shed light on this? Brian? Thanks Stuart |
From: Stuart M. <st...@ap...> - 2014-09-05 19:04:08
|
On 09/04/2014 06:01 PM, Simson Garfinkel wrote: > On Sep 4, 2014, at 8:51 PM, RB <ao...@gm...> wrote: >> Although md5 is not a subdivisible hash (as Simson pointed out), one >> could conceivably still do a single-pass checksum of a filesystem, the >> tradeoff would be the memory consumption of tens of thousands of >> "simultaneous" checksum calculation states. > > This doesn't work unless you are prepared to buffer the later fragments of a file when they appear on disk before earlier fragments. So in the worst case, you need to hold the entire disk in RAM. I have been experimenting with exactly this approach. And yes, you are right, in the worst case there is simply too much to buffer. In a 250GB ext4 image, one file was 8GB. Its block allocation was such that I struggled to buffer 4GB of it, at which point my Java VM collapsed out-of-memory. I guess we could use a hybrid approach of all files under some size limit use the 'in block order' logic for hashing, with monster files defaulting to the 'regular, file-offset' logic. Somehow I think this will largely defeat the whole point of the exercise and negate most of the time gains. Another approach would be to externally store the 'pending data', which might be feasible if you had some Live CD with the tools plus some raw (i.e. no file system) usb or other data drive to use for scratch storage. Stuart Stuart |
From: Simson G. <si...@ac...> - 2014-09-05 19:14:01
|
As I indicated, I spent a significant amount of time looking at this, and decided that simply sorting by the block number of the first file fragment and then imaging each file in order provided excellent speed up. There was no need to do any in-memory buffering. On Sep 5, 2014, at 3:41 PM, Stuart Maclean <st...@ap...> wrote: > On 09/04/2014 06:01 PM, Simson Garfinkel wrote: >> On Sep 4, 2014, at 8:51 PM, RB <ao...@gm...> wrote: >>> Although md5 is not a subdivisible hash (as Simson pointed out), one >>> could conceivably still do a single-pass checksum of a filesystem, the >>> tradeoff would be the memory consumption of tens of thousands of >>> "simultaneous" checksum calculation states. >> >> This doesn't work unless you are prepared to buffer the later fragments of a file when they appear on disk before earlier fragments. So in the worst case, you need to hold the entire disk in RAM. > I have been experimenting with exactly this approach. And yes, you are right, in the worst case there is simply too much to buffer. In a 250GB ext4 image, one file was 8GB. Its block allocation was such that I struggled to buffer 4GB of it, at which point my Java VM collapsed out-of-memory. > > I guess we could use a hybrid approach of all files under some size limit use the 'in block order' logic for hashing, with monster files defaulting to the 'regular, file-offset' logic. Somehow I think this will largely defeat the whole point of the exercise and negate most of the time gains. > > Another approach would be to externally store the 'pending data', which might be feasible if you had some Live CD with the tools plus some raw (i.e. no file system) usb or other data drive to use for scratch storage. > > Stuart > > Stuart |
From: Luís F. N. <lfc...@gm...> - 2014-09-05 23:37:29
|
Hi Stuart, Yes, I think so. I can read file contents from some starting offset within the file, but did not know how to query the file data runs. The API enables to convert a virtual file (eg. unallocated) offset to an image offset, but not a regular file offset. I think the idea to sort by file starting offset before doing any king of processing with the files will result in great speedups when ingesting images stored into spinning magnetic drives, as said by Simson. Luis 2014-09-05 20:53 GMT-03:00 Stuart Maclean <st...@ap...>: > On 09/05/2014 04:02 PM, Luís Filipe Nassif wrote: > >> Hi Simson, >> >> I have had thoughts about implementing this "sort by sector number of >> first run" approach in a forensic tool based on TskJavaBindings, but I did >> not see how to get the file first sector number through the API. Do you >> know if it is possible with tsk java bindings? >> >> Hi Luis, I have slowly been developing my own set of Java bindings to > tsk. The ones that exist seem to only be for extraction of data from some > db?? I wanted to use Java in the actual data acquisition phase. I have > yet to upload it to github but will do so shortly. > > Stuart > > |
From: Brian C. <ca...@sl...> - 2014-09-10 02:06:05
|
Sorry to join the party late. I'm curious what types of speed ups you see by doing this sorting. In terms of if Autopsy could benefit, I think it depends on what type of investigation you are doing and if you are more interested in fastest overall time or interesting results sooner. Autopsy currently has an assumption that you are more interested in analysis results from user content ASAP more than you are interested in overall run time. I say this because of two "features": - Files inside of "\Documents and Settings" or "\Users" are analyzed before other files. - The keyword search module will commit its index every 5 minutes and do a search of pre-defined keywords. This makes the analysis process take longer, but means that you have keyword results in minutes versus hours or days. So, yes the overall analysis time of Autopsy could benefit from doing this type of sorting, but it could mean that for the first 60 minutes, that Autopsy is just analyzing Windows OS files and the user is patiently waiting for interesting results. brian On Sep 4, 2014, at 7:53 PM, Stuart Maclean <st...@ap...> wrote: > On 09/04/2014 03:46 PM, Simson Garfinkel wrote: >> Hi Stuart. >> >> You are correct — I put this in numerous presentations but never published it. >> >> The MD5 algorithm won't let you combine a partial hash from the middle of the file with one from the beginning. You need to start at the beginning and hash through the end. (That's one of the many problems with MD5 for forensics, BTW.) So I believe that the only approach is sorting the files by the sector number of the first run, and just leaving it at that. >> >> I saw speedup with both HDs and SSDs, strangely enough, but not as much with SSDs. There may be a prefetch thing going on here. >> >> I think that the Autopsy framework should hash this way, but currently it doesn't. On the other hand, it may be more useful to hash based on the "importance" of the files. >> >> Simson >> >> > Hi Simson, currently I have just got as far as noting the 'seek > distances' between consecutive runs, across ALL files. I have yet to > actually read the file content. But I don't think it's that hard. As > you point out, md5 summing must be done with the file content in correct > order. I see an analogy between the 'runs ordered by block address but > not necessarily file offset' and the problem the IP layer has in tcp/ip > as it tries to reassemble the fragments of a datagram that may arrive in > any order. We may have to have some 'pending data' structure for runs > whose content has been read but which cannot yet be offered to the md5 > hasher due to an as yet unread run being needed first. > > I'll let you know if/when I nail this. Pehaps Autopsy could benefit? > Is fiwalk doing it the 'regular way' too, i.e reading all file content > of each file as the walk proceeds? |
From: Brian C. <ca...@sl...> - 2014-09-10 02:13:10
|
The FILLER entries are there for basic record keeping because NTFS makes not guarantees that the runs will be stored in consecutive order. TSK adds the FILLER entries when it gets runs out of order and pops them out as it finds them. Is the data you are describing below from the same Ext4 image you mentioned before? brian On Sep 5, 2014, at 1:46 PM, Stuart Maclean <st...@ap...> wrote: > Hi all, I'm glad to have provoked some conversation on the merits (or > otherwise!) of md5 sums as useful representations of the state of a file > system. > > Can anyone enlighten me as to the meaning of the 'flags' member in a > TSK_FS_ATTR_RUN? Specifically, what does this comment mean? > > TSK_FS_ATTR_RUN_FLAG_FILLER = 0x01, ///< Entry is a filler for a run > that has not been seen yet in the processing (or has been lost) > > In a fs I am walking and inspecting the runs for, I am seeing run > structs with addr 0 and flags 1. I was under the impression that any > run address of 0 represented a 'missing run' i.e. that this part of the > file content is N zeros, where N = run.length * fs.blocksize. I presume > that would be the case were the run flags value 2: > > TSK_FS_ATTR_RUN_FLAG_SPARSE = 0x02 ///< Entry is a sparse run where > all data in the run is zeros > > If I use istat, I can see inodes which have certain 'Direct Blocks' of > value 0, and when I see M consecutive 0 blocks that matches up to a > 'missing run' when inspecting the runs using the tsk lib (actually my > tsk4jJava binding, which is now finally showing its worth since I can do > all data structure manipulation in Java, nicer than in C, for me at least). > > I am worried at being 'filler' and not 'sparse', the partial file > content represented by the run(s) with addr 0 is not necessarily a > sequence of zeros. > > Anyone shed light on this? Brian? > > Thanks > > Stuart > > ------------------------------------------------------------------------------ > Slashdot TV. > Video for Nerds. Stuff that matters. > http://tv.slashdot.org/ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: Stuart M. <st...@ap...> - 2014-09-11 18:34:30
|
On 09/09/2014 07:13 PM, Brian Carrier wrote: > The FILLER entries are there for basic record keeping because NTFS makes not guarantees that the runs will be stored in consecutive order. TSK adds the FILLER entries when it gets runs out of order and pops them out as it finds them. > > Is the data you are describing below from the same Ext4 image you mentioned before? > > brian > > Hi Brian, yes it is. Does that shed any light on things? I am still confused as to what fillers actually are and whether a file system is suspect should tsk tools announce that they have found some ;) Stuart |
From: Brian C. <ca...@sl...> - 2014-09-10 02:15:42
|
We have just started an effort to make a STIX / Cybox module in Autopsy as part of a DHS S&T effort. In Autopsy, the hash value is stored in the DB after the hash lookup module runs, so you can also do the Cybox analysis on each file as it is analyzed or after all of the files have been analyzed. On Sep 4, 2014, at 7:04 PM, Stuart Maclean <st...@ap...> wrote: > I am tracking recent efforts in STIX and Cybox and all things Mitre. > One indicator of compromise is an md5 hash of some file. Presumably you > compare the hash with all files on some file system to see if there is a > match. Obviously this requires a walk of the host fs, using e.g. fls or > fiwalk or the tsk library in general. > > Is this a common activity, the hashing of a complete filesystem that > is? If yes, some experiments I have done with minimising total disk > seek time by ordering Runs, reading content from the ordered Runs and > piecing each file's hash back together would show that this is indeed a > worthy optimization since it can decrease the time spent deriving the > full hash table considerably. > > I did see a slide deck by Simson G where he alluded to a similar win > situation when disk reads are ordered so as to minimise seek time, but > wonder if much has been published on the topic, specifically relating to > the digital forensics arena, i.e. when an entire file system contents is > to be read in a single pass, for the purposes of producing an 'md5 -> > file path' map. > > Opinions and comments welcomed. > > Stuart > > > ------------------------------------------------------------------------------ > Slashdot TV. > Video for Nerds. Stuff that matters. > http://tv.slashdot.org/ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: Simson G. <si...@ac...> - 2014-09-10 11:16:52
|
Brian, You could sector-sort the files in the "\Users" and the "\Documents and Settings" folders for improved performnace. On Sep 9, 2014, at 10:05 PM, Brian Carrier <ca...@sl...> wrote: > Sorry to join the party late. > > I'm curious what types of speed ups you see by doing this sorting. > > In terms of if Autopsy could benefit, I think it depends on what type of investigation you are doing and if you are more interested in fastest overall time or interesting results sooner. Autopsy currently has an assumption that you are more interested in analysis results from user content ASAP more than you are interested in overall run time. I say this because of two "features": > - Files inside of "\Documents and Settings" or "\Users" are analyzed before other files. > - The keyword search module will commit its index every 5 minutes and do a search of pre-defined keywords. This makes the analysis process take longer, but means that you have keyword results in minutes versus hours or days. > > So, yes the overall analysis time of Autopsy could benefit from doing this type of sorting, but it could mean that for the first 60 minutes, that Autopsy is just analyzing Windows OS files and the user is patiently waiting for interesting results. > > brian > > > On Sep 4, 2014, at 7:53 PM, Stuart Maclean <st...@ap...> wrote: > >> On 09/04/2014 03:46 PM, Simson Garfinkel wrote: >>> Hi Stuart. >>> >>> You are correct — I put this in numerous presentations but never published it. >>> >>> The MD5 algorithm won't let you combine a partial hash from the middle of the file with one from the beginning. You need to start at the beginning and hash through the end. (That's one of the many problems with MD5 for forensics, BTW.) So I believe that the only approach is sorting the files by the sector number of the first run, and just leaving it at that. >>> >>> I saw speedup with both HDs and SSDs, strangely enough, but not as much with SSDs. There may be a prefetch thing going on here. >>> >>> I think that the Autopsy framework should hash this way, but currently it doesn't. On the other hand, it may be more useful to hash based on the "importance" of the files. >>> >>> Simson >>> >>> >> Hi Simson, currently I have just got as far as noting the 'seek >> distances' between consecutive runs, across ALL files. I have yet to >> actually read the file content. But I don't think it's that hard. As >> you point out, md5 summing must be done with the file content in correct >> order. I see an analogy between the 'runs ordered by block address but >> not necessarily file offset' and the problem the IP layer has in tcp/ip >> as it tries to reassemble the fragments of a datagram that may arrive in >> any order. We may have to have some 'pending data' structure for runs >> whose content has been read but which cannot yet be offered to the md5 >> hasher due to an as yet unread run being needed first. >> >> I'll let you know if/when I nail this. Pehaps Autopsy could benefit? >> Is fiwalk doing it the 'regular way' too, i.e reading all file content >> of each file as the walk proceeds? > > > > |
From: Jon S. <jo...@li...> - 2014-09-10 13:12:54
|
Sorry to veer off-topic with this thread (stupid gmail won't let me change the subject), but I'm now more confused/concerned by this explanation regarding FILLER entries. 1. Under what circumstances can you get a FILLER ATTR_RUN? 2. What can you do about it? How does one wait on TSK to go find the missing run? Thanks, Jon On Tue, Sep 9, 2014 at 10:13 PM, Brian Carrier <ca...@sl...> wrote: > The FILLER entries are there for basic record keeping because NTFS makes not guarantees that the runs will be stored in consecutive order. TSK adds the FILLER entries when it gets runs out of order and pops them out as it finds them. > > Is the data you are describing below from the same Ext4 image you mentioned before? > > brian > > > > On Sep 5, 2014, at 1:46 PM, Stuart Maclean <st...@ap...> wrote: > >> Hi all, I'm glad to have provoked some conversation on the merits (or >> otherwise!) of md5 sums as useful representations of the state of a file >> system. >> >> Can anyone enlighten me as to the meaning of the 'flags' member in a >> TSK_FS_ATTR_RUN? Specifically, what does this comment mean? >> >> TSK_FS_ATTR_RUN_FLAG_FILLER = 0x01, ///< Entry is a filler for a run >> that has not been seen yet in the processing (or has been lost) >> >> In a fs I am walking and inspecting the runs for, I am seeing run >> structs with addr 0 and flags 1. I was under the impression that any >> run address of 0 represented a 'missing run' i.e. that this part of the >> file content is N zeros, where N = run.length * fs.blocksize. I presume >> that would be the case were the run flags value 2: >> >> TSK_FS_ATTR_RUN_FLAG_SPARSE = 0x02 ///< Entry is a sparse run where >> all data in the run is zeros >> >> If I use istat, I can see inodes which have certain 'Direct Blocks' of >> value 0, and when I see M consecutive 0 blocks that matches up to a >> 'missing run' when inspecting the runs using the tsk lib (actually my >> tsk4jJava binding, which is now finally showing its worth since I can do >> all data structure manipulation in Java, nicer than in C, for me at least). >> >> I am worried at being 'filler' and not 'sparse', the partial file >> content represented by the run(s) with addr 0 is not necessarily a >> sequence of zeros. >> >> Anyone shed light on this? Brian? >> >> Thanks >> >> Stuart >> >> ------------------------------------------------------------------------------ >> Slashdot TV. >> Video for Nerds. Stuff that matters. >> http://tv.slashdot.org/ >> _______________________________________________ >> sleuthkit-users mailing list >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >> http://www.sleuthkit.org > > > ------------------------------------------------------------------------------ > Want excitement? > Manually upgrade your production database. > When you want reliability, choose Perforce > Perforce version control. Predictably reliable. > http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org -- Jon Stewart, Principal (646) 719-0317 | jo...@li... | Arlington, VA |
From: Brian C. <ca...@sl...> - 2014-09-10 16:05:32
|
If all goes well, you'll never see them. The caller to the API never sees the attribute until it has been fully populated and for good files all of the filler entries will have been pushed out. The only times that you will see them are if: - The file system is corrupt and you don't have all of the run info. This can occur in NTFS if the run list is stored across multiple MFT entries and some of them have been re-used. - There is a bug in TSK. You won't need to wait. There is nothing to wait for. On Sep 10, 2014, at 8:46 AM, Jon Stewart <jo...@li...> wrote: > Sorry to veer off-topic with this thread (stupid gmail won't let me > change the subject), but I'm now more confused/concerned by this > explanation regarding FILLER entries. > > 1. Under what circumstances can you get a FILLER ATTR_RUN? > > 2. What can you do about it? How does one wait on TSK to go find the > missing run? > > Thanks, > > Jon > > On Tue, Sep 9, 2014 at 10:13 PM, Brian Carrier <ca...@sl...> wrote: >> The FILLER entries are there for basic record keeping because NTFS makes not guarantees that the runs will be stored in consecutive order. TSK adds the FILLER entries when it gets runs out of order and pops them out as it finds them. >> >> Is the data you are describing below from the same Ext4 image you mentioned before? >> >> brian >> >> >> >> On Sep 5, 2014, at 1:46 PM, Stuart Maclean <st...@ap...> wrote: >> >>> Hi all, I'm glad to have provoked some conversation on the merits (or >>> otherwise!) of md5 sums as useful representations of the state of a file >>> system. >>> >>> Can anyone enlighten me as to the meaning of the 'flags' member in a >>> TSK_FS_ATTR_RUN? Specifically, what does this comment mean? >>> >>> TSK_FS_ATTR_RUN_FLAG_FILLER = 0x01, ///< Entry is a filler for a run >>> that has not been seen yet in the processing (or has been lost) >>> >>> In a fs I am walking and inspecting the runs for, I am seeing run >>> structs with addr 0 and flags 1. I was under the impression that any >>> run address of 0 represented a 'missing run' i.e. that this part of the >>> file content is N zeros, where N = run.length * fs.blocksize. I presume >>> that would be the case were the run flags value 2: >>> >>> TSK_FS_ATTR_RUN_FLAG_SPARSE = 0x02 ///< Entry is a sparse run where >>> all data in the run is zeros >>> >>> If I use istat, I can see inodes which have certain 'Direct Blocks' of >>> value 0, and when I see M consecutive 0 blocks that matches up to a >>> 'missing run' when inspecting the runs using the tsk lib (actually my >>> tsk4jJava binding, which is now finally showing its worth since I can do >>> all data structure manipulation in Java, nicer than in C, for me at least). >>> >>> I am worried at being 'filler' and not 'sparse', the partial file >>> content represented by the run(s) with addr 0 is not necessarily a >>> sequence of zeros. >>> >>> Anyone shed light on this? Brian? >>> >>> Thanks >>> >>> Stuart >>> >>> ------------------------------------------------------------------------------ >>> Slashdot TV. >>> Video for Nerds. Stuff that matters. >>> http://tv.slashdot.org/ >>> _______________________________________________ >>> sleuthkit-users mailing list >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>> http://www.sleuthkit.org >> >> >> ------------------------------------------------------------------------------ >> Want excitement? >> Manually upgrade your production database. >> When you want reliability, choose Perforce >> Perforce version control. Predictably reliable. >> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk >> _______________________________________________ >> sleuthkit-users mailing list >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >> http://www.sleuthkit.org > > > > -- > Jon Stewart, Principal > (646) 719-0317 | jo...@li... | Arlington, VA > > ------------------------------------------------------------------------------ > Want excitement? > Manually upgrade your production database. > When you want reliability, choose Perforce > Perforce version control. Predictably reliable. > http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: Jon S. <jo...@li...> - 2014-09-10 16:15:28
|
Cool, thanks for clarifying. Jon On Wed, Sep 10, 2014 at 12:05 PM, Brian Carrier <ca...@sl...> wrote: > If all goes well, you'll never see them. The caller to the API never sees the attribute until it has been fully populated and for good files all of the filler entries will have been pushed out. The only times that you will see them are if: > - The file system is corrupt and you don't have all of the run info. This can occur in NTFS if the run list is stored across multiple MFT entries and some of them have been re-used. > - There is a bug in TSK. > > You won't need to wait. There is nothing to wait for. > > > > On Sep 10, 2014, at 8:46 AM, Jon Stewart <jo...@li...> wrote: > >> Sorry to veer off-topic with this thread (stupid gmail won't let me >> change the subject), but I'm now more confused/concerned by this >> explanation regarding FILLER entries. >> >> 1. Under what circumstances can you get a FILLER ATTR_RUN? >> >> 2. What can you do about it? How does one wait on TSK to go find the >> missing run? >> >> Thanks, >> >> Jon >> >> On Tue, Sep 9, 2014 at 10:13 PM, Brian Carrier <ca...@sl...> wrote: >>> The FILLER entries are there for basic record keeping because NTFS makes not guarantees that the runs will be stored in consecutive order. TSK adds the FILLER entries when it gets runs out of order and pops them out as it finds them. >>> >>> Is the data you are describing below from the same Ext4 image you mentioned before? >>> >>> brian >>> >>> >>> >>> On Sep 5, 2014, at 1:46 PM, Stuart Maclean <st...@ap...> wrote: >>> >>>> Hi all, I'm glad to have provoked some conversation on the merits (or >>>> otherwise!) of md5 sums as useful representations of the state of a file >>>> system. >>>> >>>> Can anyone enlighten me as to the meaning of the 'flags' member in a >>>> TSK_FS_ATTR_RUN? Specifically, what does this comment mean? >>>> >>>> TSK_FS_ATTR_RUN_FLAG_FILLER = 0x01, ///< Entry is a filler for a run >>>> that has not been seen yet in the processing (or has been lost) >>>> >>>> In a fs I am walking and inspecting the runs for, I am seeing run >>>> structs with addr 0 and flags 1. I was under the impression that any >>>> run address of 0 represented a 'missing run' i.e. that this part of the >>>> file content is N zeros, where N = run.length * fs.blocksize. I presume >>>> that would be the case were the run flags value 2: >>>> >>>> TSK_FS_ATTR_RUN_FLAG_SPARSE = 0x02 ///< Entry is a sparse run where >>>> all data in the run is zeros >>>> >>>> If I use istat, I can see inodes which have certain 'Direct Blocks' of >>>> value 0, and when I see M consecutive 0 blocks that matches up to a >>>> 'missing run' when inspecting the runs using the tsk lib (actually my >>>> tsk4jJava binding, which is now finally showing its worth since I can do >>>> all data structure manipulation in Java, nicer than in C, for me at least). >>>> >>>> I am worried at being 'filler' and not 'sparse', the partial file >>>> content represented by the run(s) with addr 0 is not necessarily a >>>> sequence of zeros. >>>> >>>> Anyone shed light on this? Brian? >>>> >>>> Thanks >>>> >>>> Stuart >>>> >>>> ------------------------------------------------------------------------------ >>>> Slashdot TV. >>>> Video for Nerds. Stuff that matters. >>>> http://tv.slashdot.org/ >>>> _______________________________________________ >>>> sleuthkit-users mailing list >>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>>> http://www.sleuthkit.org >>> >>> >>> ------------------------------------------------------------------------------ >>> Want excitement? >>> Manually upgrade your production database. >>> When you want reliability, choose Perforce >>> Perforce version control. Predictably reliable. >>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> sleuthkit-users mailing list >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>> http://www.sleuthkit.org >> >> >> >> -- >> Jon Stewart, Principal >> (646) 719-0317 | jo...@li... | Arlington, VA >> >> ------------------------------------------------------------------------------ >> Want excitement? >> Manually upgrade your production database. >> When you want reliability, choose Perforce >> Perforce version control. Predictably reliable. >> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk >> _______________________________________________ >> sleuthkit-users mailing list >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >> http://www.sleuthkit.org > -- Jon Stewart, Principal (646) 719-0317 | jo...@li... | Arlington, VA |
From: Brian C. <ca...@sl...> - 2014-09-10 16:06:19
|
What types of performance improvements are we talking about? On Sep 10, 2014, at 7:16 AM, Simson Garfinkel <si...@ac...> wrote: > Brian, > > You could sector-sort the files in the "\Users" and the "\Documents and Settings" folders for improved performnace. > > > On Sep 9, 2014, at 10:05 PM, Brian Carrier <ca...@sl...> wrote: > >> Sorry to join the party late. >> >> I'm curious what types of speed ups you see by doing this sorting. >> >> In terms of if Autopsy could benefit, I think it depends on what type of investigation you are doing and if you are more interested in fastest overall time or interesting results sooner. Autopsy currently has an assumption that you are more interested in analysis results from user content ASAP more than you are interested in overall run time. I say this because of two "features": >> - Files inside of "\Documents and Settings" or "\Users" are analyzed before other files. >> - The keyword search module will commit its index every 5 minutes and do a search of pre-defined keywords. This makes the analysis process take longer, but means that you have keyword results in minutes versus hours or days. >> >> So, yes the overall analysis time of Autopsy could benefit from doing this type of sorting, but it could mean that for the first 60 minutes, that Autopsy is just analyzing Windows OS files and the user is patiently waiting for interesting results. >> >> brian >> >> >> On Sep 4, 2014, at 7:53 PM, Stuart Maclean <st...@ap...> wrote: >> >>> On 09/04/2014 03:46 PM, Simson Garfinkel wrote: >>>> Hi Stuart. >>>> >>>> You are correct — I put this in numerous presentations but never published it. >>>> >>>> The MD5 algorithm won't let you combine a partial hash from the middle of the file with one from the beginning. You need to start at the beginning and hash through the end. (That's one of the many problems with MD5 for forensics, BTW.) So I believe that the only approach is sorting the files by the sector number of the first run, and just leaving it at that. >>>> >>>> I saw speedup with both HDs and SSDs, strangely enough, but not as much with SSDs. There may be a prefetch thing going on here. >>>> >>>> I think that the Autopsy framework should hash this way, but currently it doesn't. On the other hand, it may be more useful to hash based on the "importance" of the files. >>>> >>>> Simson >>>> >>>> >>> Hi Simson, currently I have just got as far as noting the 'seek >>> distances' between consecutive runs, across ALL files. I have yet to >>> actually read the file content. But I don't think it's that hard. As >>> you point out, md5 summing must be done with the file content in correct >>> order. I see an analogy between the 'runs ordered by block address but >>> not necessarily file offset' and the problem the IP layer has in tcp/ip >>> as it tries to reassemble the fragments of a datagram that may arrive in >>> any order. We may have to have some 'pending data' structure for runs >>> whose content has been read but which cannot yet be offered to the md5 >>> hasher due to an as yet unread run being needed first. >>> >>> I'll let you know if/when I nail this. Pehaps Autopsy could benefit? >>> Is fiwalk doing it the 'regular way' too, i.e reading all file content >>> of each file as the walk proceeds? >> >> >> >> > > > ------------------------------------------------------------------------------ > Want excitement? > Manually upgrade your production database. > When you want reliability, choose Perforce > Perforce version control. Predictably reliable. > http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: Simson G. <si...@ac...> - 2014-09-10 16:08:39
|
Perhaps a 2x - 5x speedup. On Sep 10, 2014, at 12:06 PM, Brian Carrier <ca...@sl...> wrote: > What types of performance improvements are we talking about? > > > On Sep 10, 2014, at 7:16 AM, Simson Garfinkel <si...@ac...> wrote: > >> Brian, >> >> You could sector-sort the files in the "\Users" and the "\Documents and Settings" folders for improved performnace. >> >> >> On Sep 9, 2014, at 10:05 PM, Brian Carrier <ca...@sl...> wrote: >> >>> Sorry to join the party late. >>> >>> I'm curious what types of speed ups you see by doing this sorting. >>> >>> In terms of if Autopsy could benefit, I think it depends on what type of investigation you are doing and if you are more interested in fastest overall time or interesting results sooner. Autopsy currently has an assumption that you are more interested in analysis results from user content ASAP more than you are interested in overall run time. I say this because of two "features": >>> - Files inside of "\Documents and Settings" or "\Users" are analyzed before other files. >>> - The keyword search module will commit its index every 5 minutes and do a search of pre-defined keywords. This makes the analysis process take longer, but means that you have keyword results in minutes versus hours or days. >>> >>> So, yes the overall analysis time of Autopsy could benefit from doing this type of sorting, but it could mean that for the first 60 minutes, that Autopsy is just analyzing Windows OS files and the user is patiently waiting for interesting results. >>> >>> brian >>> >>> >>> On Sep 4, 2014, at 7:53 PM, Stuart Maclean <st...@ap...> wrote: >>> >>>> On 09/04/2014 03:46 PM, Simson Garfinkel wrote: >>>>> Hi Stuart. >>>>> >>>>> You are correct — I put this in numerous presentations but never published it. >>>>> >>>>> The MD5 algorithm won't let you combine a partial hash from the middle of the file with one from the beginning. You need to start at the beginning and hash through the end. (That's one of the many problems with MD5 for forensics, BTW.) So I believe that the only approach is sorting the files by the sector number of the first run, and just leaving it at that. >>>>> >>>>> I saw speedup with both HDs and SSDs, strangely enough, but not as much with SSDs. There may be a prefetch thing going on here. >>>>> >>>>> I think that the Autopsy framework should hash this way, but currently it doesn't. On the other hand, it may be more useful to hash based on the "importance" of the files. >>>>> >>>>> Simson >>>>> >>>>> >>>> Hi Simson, currently I have just got as far as noting the 'seek >>>> distances' between consecutive runs, across ALL files. I have yet to >>>> actually read the file content. But I don't think it's that hard. As >>>> you point out, md5 summing must be done with the file content in correct >>>> order. I see an analogy between the 'runs ordered by block address but >>>> not necessarily file offset' and the problem the IP layer has in tcp/ip >>>> as it tries to reassemble the fragments of a datagram that may arrive in >>>> any order. We may have to have some 'pending data' structure for runs >>>> whose content has been read but which cannot yet be offered to the md5 >>>> hasher due to an as yet unread run being needed first. >>>> >>>> I'll let you know if/when I nail this. Pehaps Autopsy could benefit? >>>> Is fiwalk doing it the 'regular way' too, i.e reading all file content >>>> of each file as the walk proceeds? >>> >>> >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Want excitement? >> Manually upgrade your production database. >> When you want reliability, choose Perforce >> Perforce version control. Predictably reliable. >> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk >> _______________________________________________ >> sleuthkit-users mailing list >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >> http://www.sleuthkit.org > |