Thread: RE: [sleuthkit-developers] Re: IO Subsystem patch for fstools
Brought to you by:
carrier
From: Paul B. <ba...@fo...> - 2004-02-19 10:30:08
|
Hi Michael.... This sounds very good and cool.... (I haven't looked at yours patch yet = though...).. But I just wanted to indicate that the combination of your IO Subsystem = patch for fstool and my searchtools (Indexed Searching) patch create a = system that is very powerful. The only thing really missing is a subsystem that makes it possible to = "read" fileformats on the image with a specific interpreter. That would = enable us to "read" PDF files, PST files, etc...=20 If all these 3 are in place, I think sleuthkit is a product that is more = powerful than any of the other products I use... Paul Bakker -----Oorspronkelijk bericht----- Van: Michael Cohen [mailto:mic...@ne...] Verzonden: donderdag 19 februari 2004 9:17 Aan: Brian Carrier CC: sle...@li... Onderwerp: [sleuthkit-developers] Re: IO Subsystem patch for fstools Hi Brian, I have started implementing the changes according to your = suggestions, and=20 Its coming along great. I quite like the method of extending the basic=20 IO_INFO struct with more specific structs and then casting them back and = forth. The code has a great OO feeling about it. Thanks for the = pointers. Attached is an incomplete patch just to check that im on the right = track.=20 Only the fls tool is working with the new system currently, although it=20 should compile ok. Here is a summary of the changes: 1) fls uses the -i commandline paramter to choose the subsystem to use, = and=20 then gets an IO_INFO object by doing an: io=3Dio_open(io_subsys); If no subsystem is specified, it uses "standard" which is the current=20 default. 2) options are parsed into the io object by calling: =09 io_parse_options(io,argv[optind++]); (options are option=3Dvalue format or else if they do not contain = '=3D', we take=20 it as a filename). 3) Once we parse all of the options, we create an fs object using the io = object: fs =3D fs_open(io, fstype); This will initialise a fs->io paramter with the io subsystem object.=20 4) All filesystem code uses the io object to actually do the reading. = e.g.: ext2fs->fs_info.io->read_random(ext2fs->fs_info.io, (FS_INFO *) = ext2fs,(char=20 *) gd, sizeof(ext2fs_gd),offs, "group descriptor"); the io object has a number of methods. For example: io->constructor io->read_random io->read_block This has a cool object oriented feel about it. 5) There is an array in fs_io.c which acts like a class and is = initialised to=20 produce new objects of all IO_INFO derived types. (i.e. all new io = objects=20 are basically copies of this struct with extra stuff appended) Eg. static IO_INFO subsystems[] =3D{ { "standard","Standard Sleuthkit IO Subsystem", sizeof(IO_INFO_STD),=20 &io_constructor, &free, &std_help, &std_initialiser, &std_read_block, =20 &std_read_random, &std_open, &std_close}, Note that IO_INFO_STD is a derived object of IO_INFO: struct IO_INFO_STD { IO_INFO io; char *name; int fd; }; i.e. it has the same methods like IO_INFO, but extra attributes. All = the=20 other subsystems use this method to add extra attributes to the basic = IO_INFO=20 pointer which is carried around the place as an IO_INFO (cast from=20 IO_INFO_ADV etc). From a user perspective we can now do this: - Read in a simple dd partition (compatibility with old fls): fls -r -f linux-ext2 honeypot.hda5.dd - Read in a partition file from a hdd dd image. (i.e. use an offset): fls -r -i advanced offset=3D100 honeypot.hda5.dd -Read in a split dd file (after splitting with split): fls -r -i advanced /tmp/xa* Or (same thing just to emphesize the fact that multiple files can be = specified=20 on the same command line): fls -r -i advanced /tmp/xaa /tmp/xab /tmp/xac /tmp/xad -Read in an sgziped file: fls -i sgzip -r -f linux-ext2 honeypot.hda5.dd.sgz - List files in directory inode 62446: fls -i sgzip -r -f linux-ext2 honeypot.hda5.dd.sgz 62446 Whats left to do: As I mentioned, this is only an intermediate patch to check that im on = the=20 right track. These are the things that need to be finished: 1) Add a config file parser option to allow options to be passed from a = config=20 file, ie: fls -r -c config -i advanced honeypot.hda5.dd where config is just a files with lines like: offset=3D1024 2) Change offset to be in blocks (and add a blocks keyword to override = the fs=20 default). 3) Update all the other tools other than fls to support the new syntax. 4) I also have been working on an expert witness subsystem. Expert = witness is=20 the format used by encase, ftk etc. I have a filter working atm to = convert=20 these files to straight dd, but i want to implement a subsystem so we = can=20 work on these files directly in sleuthkit. This is coming very soon, = maybe=20 this weekend. This will obviously require lots of tender loving testing=20 because i only have encase ver 3 to play with. Please let me know what you think about this. Also let me know if there = are=20 other things that need to be completed still regards this patch. Michael. |
From: Paul B. <ba...@fo...> - 2004-02-19 12:21:57
|
Hi again... > > But I just wanted to indicate that the combination of your=20 > IO Subsystem > > patch for fstool and my searchtools (Indexed Searching)=20 > patch create a > > system that is very powerful. > Indeed, your indexing support looks very cool. I havent=20 > played with it just=20 > yet though (gotta find some time :-) It seems we got the same problem: time.... ;-) But I'm already making Searchtools(Indexed searching) ready for your patch. Normally in Raw index mode I just read the raw image file. I'm now updating Searchtools to use Sleuthkit image reading, so when your patch comes out only minor changes in my code are needed to enable indexing of split dd files or Encase images..... > > The only thing really missing is a subsystem that makes it=20 > possible to > > "read" fileformats on the image with a specific=20 > interpreter. That would > > enable us to "read" PDF files, PST files, etc... > Im not sure I know what you mean, the IO subsystem is done at=20 > a very low level=20 > (well at the IO level)... The interpretation of different=20 > files on the=20 > filesystem is surely the job of a higher level application?=20 Yes sorry to confuse anybody... I meant that Sleuthkit as a whole should contain a generic way for accessing filetypes found on the images. At a higher level than the IO subsystem.. But indeed integrated with Sleuthkit. Otherwise one has to extract files from the image before they can be processed (For instance indexed (Hint!)). Autopsy would benefit from that as it would be possible to integrate FTK-like functionality to read PDF/PST files from the web interface. And it would make it possible to index files inside the image based on the text therein (Also files inside ZIP files and such).. Paul Bakker |
From: Paul B. <ba...@fo...> - 2004-02-23 08:11:55
|
> Excellent. >=20 > > 4) All filesystem code uses the io object to actually do=20 > the reading.=20 > > e.g.: > > ext2fs->fs_info.io->read_random(ext2fs->fs_info.io, (FS_INFO *)=20 > > ext2fs,(char > > *) gd, sizeof(ext2fs_gd),offs, "group descriptor"); > > > > the io object has a number of methods. For example: > > io->constructor >=20 > What is the constructor used for? >=20 > I quickly looked the changes over. Is there a need to pass=20 > FS_INFO to=20 > the read functions? I didn't see you using them and it would be nice=20 > if we could avoid doing that (so that we don't have to restrict=20 > ourselves to file system code). Well I do see advantages.... I already wanted to ask this... The problem with the current code is that it is not possible to=20 "read_random" an image efficiently because it cannot check the current offset in the image.. This results in unnecessary seeks.. And seeks are very expensive if they come in millions.... For Indexed Searching it would be very handy if their would come either: = a generic fs_read_random() function. If this function would check for the current offset in the image and = thus not seek if the reads where all in succession, whis would be great... Paul Bakker |
From: Michael C. <mic...@ne...> - 2004-02-23 09:42:13
|
Hi Paul, > Well I do see advantages.... I already wanted to ask this... > > The problem with the current code is that it is not possible to > "read_random" an image efficiently because it cannot check the current > offset in the image.. This results in unnecessary seeks.. And seeks are > very expensive if they come in millions.... I agree - this is particularly bad if the underlying image is a compressed format like encase or sgzip because then each seek/read corresponds to a decompression of at least one block. > For Indexed Searching it would be very handy if their would come either: a > generic fs_read_random() function. > > If this function would check for the current offset in the image and thus > not seek if the reads where all in succession, whis would be great... This really depends on the specific subsystem, for example when reading an encase file you need to decompress at least one chunk for each seek so if you read lots of little runs of data all over the file its gonna run slow. The solution to this problem, i think, is to implement some kind of caching in memory. A cache system can solve all those problems very efficiently, particularly for the case where you make lots of small reads, very close together (i.e. no seeks). A simple cache (with a simple policy) can be implemented quite easily i think, and will be effective for the scenario you are describing. What kind of IO do you do for indexing? Is it very localised? If you were to cache a block into memory, what would be the optimal size of the block? (say 1 mb or more like 32kb?) If you were to cache 1 mb in memory, how many reads would you get out of it on average? Michael |
From: Brian C. <ca...@sl...> - 2004-02-23 14:31:30
|
On Feb 23, 2004, at 3:03 AM, Paul Bakker wrote: >> I quickly looked the changes over. Is there a need to pass >> FS_INFO to >> the read functions? I didn't see you using them and it would be nice >> if we could avoid doing that (so that we don't have to restrict >> ourselves to file system code). > > Well I do see advantages.... I already wanted to ask this... > > The problem with the current code is that it is not possible to > "read_random" an image efficiently because it cannot check the current > offset in the image.. This results in unnecessary seeks.. And seeks are > very expensive if they come in millions.... We can easily fix that though with some data in an IMG_INFO struct, which is a more appropriate location than the FS_INFO structure. For the code that uses split or RAID images, we will need multiple copies of the current offset data. Actually, I'm not even sure why fs_read_random doesn't use the fs->read_pos value before it does the seek. brian |
From: Paul B. <ba...@fo...> - 2004-02-23 11:21:09
|
Hi Michael, > The solution to this problem, i think, is to implement some=20 > kind of caching in=20 > memory. A cache system can solve all those problems very=20 > efficiently, =20 > particularly for the case where you make lots of small reads,=20 > very close=20 > together (i.e. no seeks). A simple cache (with a simple=20 > policy) can be=20 > implemented quite easily i think, and will be effective for=20 > the scenario you=20 > are describing. >=20 > What kind of IO do you do for indexing? Is it very localised?=20 > If you were to=20 > cache a block into memory, what would be the optimal size of=20 > the block? (say=20 > 1 mb or more like 32kb?) If you were to cache 1 mb in memory,=20 > how many reads=20 > would you get out of it on average? I currently support two modes (As will the new release)... The first is raw mode, meaning that all data on the disk is indexed as it is!.. That means that the whole disk is walked sequentially in (currently) 64k blocks... This can be enlarged if that would increase performance of the underlying subsystem.. The second is raw_fragment mode, meaning that all fragmented pieces of files are indexed in a similar manner as icat runs through them... I use both inode_walk and file_walk.. Thus this consists of more small reads. As fragmented parts usually comprise only a very small amount of the disk, this should not be used as an indication of access.. Especially the first mode (raw) is a real time/disk access/processor roughy... In it's current form it does not use any seeks, as this greatly increases speed (Almost double if otherwise)... Paul Bakker |
From: Michael C. <mic...@ne...> - 2004-02-23 12:42:10
|
On Mon, 23 Feb 2004 10:12 pm, Paul Bakker wrote: > Hi Michael, Hi Paul, > I currently support two modes (As will the new release)... > > The first is raw mode, meaning that all data on the disk is indexed > as it is!.. That means that the whole disk is walked sequentially in > (currently) 64k blocks... This can be enlarged if that would increase > performance of the underlying subsystem.. It is possible for you then to just malloc a buffer (say 1mb) and fill it=20 sequentially, and then just index the buffer? Or will that complicate=20 matters? > The second is raw_fragment mode, meaning that all fragmented pieces > of files are indexed in a similar manner as icat runs through them... > I use both inode_walk and file_walk.. Thus this consists of more small > reads. Have a look at the dbtools patch, because David collett has done something= =20 similar for flag. In his file walk he is simply building a linked list of=20 blocks in the file (without actually reading these blocks), and then in his= =20 print_blocks he is saving entire runs of blocks as unique entries.=20 So for example suppose you have a file thats 30 blocks big (~120kb). While = the=20 file_walk might call the callback 30 times, (once for each block), In the e= nd=20 David's print_blocks function will print a single entry for 30 consecutive= =20 blocks. So the idea is that you might use this information to preallocate a buffer = 30=20 blocks big and read a large chunk into it- then index that buffer. The resu= lt=20 is that you need to do 30 times less reading on the actual file (in this=20 example).=20 =46rom experinece, most files are not really fragmented and at most I have = seen=20 large files fragmented into 2-3 parts. Thats an average of 2-3 reads for=20 large files, and a single read for small files - not too expensive at all.= =20 (contrast this with reading the block on each callback you will need to rea= d=20 every 4kb in every file, or upto several hundred times for each large file). I just ran icat under gdb again to confirm what im saying here. This is the= =20 result: (gdb) b icat_action (gdb) r -f linux-ext2 /var/tmp/honeypot.hda1.dd 16 > /tmp/vmlinuz Breakpoint 1, icat_action (fs=3D0x8064e18, addr=3D2080, buf=3D0x8065c48 ... , size=3D1024, flags=3D1028, ptr=3D0x805d934 "") The size is the size that icat writes in every call to icat_action, and it= =20 seems to be alway 1024 here (block size). So the icat_action callback is=20 called for every single block. The other problem I can think about (again, I havent seen your code yet so = im=20 sorry if im talking crap here), is that if you only do the indexing on each= =20 buffer returned by the file_walk callback, then wouldnt you be missing word= s=20 that happen to be cut by the block boundary? i.e. half a word will be on on= e=20 block and the other half on the next block? This problem will be alleviated= =20 to some extent by indexing larger buffers than blocksizes. Another suggestion... The way we currently have string indexing done in fla= g=20 (its mostly in cvs and not quite finished, but we should have it finished=20 soon :-) is by using David's dbtool to extract _all_ the meta information=20 about the image into the database - this includes all the inodes, files, an= d=20 the blocks these files occupy (in other words file_walk and inode_walk) - W= e=20 do not read the blocks themselves, we just note where they are. We then ind= ex=20 the entire image file storing all the offsets for the strings in the=20 database. Then its quite easy to tell if a string is in an allocated file,= =20 and exactly which file its in (and inode). We can extract entire files by=20 simply reading the complete block run (as i described above). The result is= =20 that we dont really seek very much, we seek a bit in the dbtool to pull out= =20 all the meta data, but then we just read sequentially for indexing - and ve= ry=20 large reads at that (I think about 1mb buffers). That said indexing is a tough job and it does take a long time... its=20 inescapable. Im interested in your indexing implementation, because the=20 database implementation requires a copy of all the strings to live in the d= b,=20 which basically blows the db up to about 1/3-1/2 the size of the=20 (uncompressed) image. This is not really practical. > As fragmented parts usually comprise only a very small amount of the > disk, this should not be used as an indication of access.. Especially > the first mode (raw) is a real time/disk access/processor roughy... In > it's current form it does not use any seeks, as this greatly increases > speed (Almost double if otherwise)... Caching can certainly help here. Although if you can restructure your code = so=20 that you do few big reads rather than lots of small reads it would aleviate= =20 the need for caching.=20 This is especially important when you think about the possibility of your c= ode=20 directly operating on encase evidence files or compressed volumes, where in= =20 that case the major cost is the decompression overheads. Remember that with= =20 encase the minimum buffer is 32kb, so even if you wanted to read 1 byte, it= =20 will still need to decompress the whole 32kb chunk to give you that byte -= =20 very expensive. In that case the cost of seeking is negligent relative to t= he=20 cost of decompression. Michael. |
From: Paul B. <ba...@fo...> - 2004-02-23 13:26:45
|
Hi Michael, > > I currently support two modes (As will the new release)... > > > > The first is raw mode, meaning that all data on the disk is indexed > > as it is!.. That means that the whole disk is walked sequentially in > > (currently) 64k blocks... This can be enlarged if that=20 > would increase > > performance of the underlying subsystem.. > It is possible for you then to just malloc a buffer (say 1mb)=20 > and fill it=20 > sequentially, and then just index the buffer? Or will that complicate=20 > matters? That's what I do.. (Only with a 64kb buffer)... But the current = read_random, will do a seek every time.. And that is not good... (Time/Seekwise)... The size is adaptable and will not change the concept.... Only the implementation of the current functions do not allow me to read the data of an image without them performing any seeks.. > So for example suppose you have a file thats 30 blocks big=20 > (~120kb). While the=20 > file_walk might call the callback 30 times, (once for each=20 > block), In the end=20 > David's print_blocks function will print a single entry for=20 > 30 consecutive=20 > blocks. OK this might be handy, but will only change how many times the callback is called.. No data in the blocks is read on calling the callback (Because of the FS_FLAG_AONLY flag). > From experinece, most files are not really fragmented and at=20 > most I have seen=20 > large files fragmented into 2-3 parts. Thats an average of=20 > 2-3 reads for=20 > large files, and a single read for small files - not too=20 > expensive at all.=20 > (contrast this with reading the block on each callback you=20 > will need to read=20 > every 4kb in every file, or upto several hundred times for=20 > each large file). See above (This does not happen...).... I only read the last 10 bytes from the first block before the fragment and first 10 bytes from the block after the fragment. =20 > I just ran icat under gdb again to confirm what im saying=20 > here. This is the=20 > result: > (gdb) b icat_action > (gdb) r -f linux-ext2 /var/tmp/honeypot.hda1.dd 16 > /tmp/vmlinuz >=20 > Breakpoint 1, icat_action (fs=3D0x8064e18, addr=3D2080, > buf=3D0x8065c48 ... , size=3D1024, flags=3D1028, ptr=3D0x805d934 = "") >=20 > The size is the size that icat writes in every call to=20 > icat_action, and it=20 > seems to be alway 1024 here (block size). So the icat_action=20 > callback is=20 > called for every single block. See above.. Icat does read every block, but the FS_FLAG_AONLY flag makes it NOT read the data in the block, so no seeks will happen there. > The other problem I can think about (again, I havent seen=20 > your code yet so im=20 > sorry if im talking crap here), is that if you only do the=20 > indexing on each=20 > buffer returned by the file_walk callback, then wouldnt you=20 > be missing words=20 > that happen to be cut by the block boundary? i.e. half a word=20 > will be on one=20 > block and the other half on the next block? This problem will=20 > be alleviated=20 > to some extent by indexing larger buffers than blocksizes. You're talking crap! ;-)) Joking..=20 No Michael.. This will not happen.. The raw_fragment mode only indexes sdtrings in the fragmented part (Thus 10 bytes before and 10 bytes after the fragmented part) (Or 25... Depends.. hehe) The raw mode (The real mode) will use the 64kb blocks in a "walking buffer" kind of way... Every time a new block is loaded, the last xx (25) bytes of the old block will be prepended and also indexed... That way no data will ever get missed... <Snipped a part about the flag implementation> > That said indexing is a tough job and it does take a long time... its=20 > inescapable. Im interested in your indexing implementation,=20 > because the=20 > database implementation requires a copy of all the strings to=20 > live in the db,=20 > which basically blows the db up to about 1/3-1/2 the size of the=20 > (uncompressed) image. This is not really practical. Well that could be called small.. It all depends on the text elements inside the image file.. If an image contains only text files, a index can even become larger that the image itself. For the size part, I think that my fileformat will result in smaller size files than a database, as the format is optimized for containing index trees.. I have reviewed the formats inside (As did my brother) and we have sqeezed the format even smaller in the upcoming release.. <Snipped the rest> Paul Bakker |
From: Michael C. <mic...@ne...> - 2004-02-23 14:20:21
|
> Hi Michael, Hi Paul, > That's what I do.. (Only with a 64kb buffer)... But the current > read_random, will do a seek every time.. And that is not good... > (Time/Seekwise)... Paul, Im not sure if I understand you right here - you claim that sequential reads are slower than seek and then reads to the same position? I wasnt sure about this and so I tested it (reading 100 byte lots out of 1gb so there is no disk cache): #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include <fcntl.h> int main() { int i; int fd; char buf[100]; fd=open("/var/tmp/honeypot.hda5.dd",O_RDONLY); for(i=0;i<1000000000;i+=100) { lseek(fd,i,SEEK_SET); if(read(fd,buf,100)<100) { printf("Read failed"); }; }; }; First I ran with the seek in there (the seek doesnt really do anything - its only seeking to the same spot it would be in anyway - which i think is what you mean in this situation): $ time ./test real 0m27.228s user 0m2.060s sys 0m9.250s And without the seek (just commented out): $ time ./test real 0m27.000s user 0m1.340s sys 0m8.770s I dont find any difference really???? I thought the only overhead in a seek is a system call because the kernel would already have that in the disk cache and doesnt really need to seek the physical disk at all. Maybe its an OS thing? Im using kernel 2.4.21. > The size is adaptable and will not change the concept.... Only the > implementation of the current functions do not allow me to read the data > of an image without them performing any seeks.. > > > So for example suppose you have a file thats 30 blocks big > > (~120kb). While the > > file_walk might call the callback 30 times, (once for each > > block), In the end > > David's print_blocks function will print a single entry for > > 30 consecutive > > blocks. > > OK this might be handy, but will only change how many times the > callback is called.. No data in the blocks is read on calling > the callback (Because of the FS_FLAG_AONLY flag). Cool, so when do you actually do the reading of the blocks? Or do you just use the file_walk and inode_walk to find out if a string is in a file or out of the file without reading any blocks? > See above.. Icat does read every block, but the FS_FLAG_AONLY flag > makes it NOT read the data in the block, so no seeks will happen > there. > > The other problem I can think about (again, I havent seen > > your code yet so im > > sorry if im talking crap here), is that if you only do the > > indexing on each > > buffer returned by the file_walk callback, then wouldnt you > > be missing words > > that happen to be cut by the block boundary? i.e. half a word > > will be on one > > block and the other half on the next block? This problem will > > be alleviated > > to some extent by indexing larger buffers than blocksizes. > > You're talking crap! ;-)) Joking.. > No Michael.. This will not happen.. The raw_fragment mode only > indexes sdtrings in the fragmented part (Thus 10 bytes before and > 10 bytes after the fragmented part) (Or 25... Depends.. hehe) Cool, thats great!!! > The raw mode (The real mode) will use the 64kb blocks in a > "walking buffer" kind of way... Every time a new block is loaded, > the last xx (25) bytes of the old block will be prepended and also > indexed... That way no data will ever get missed... Thats great. I also noticed (I only have the current version which is on the web site - without all the bells and wistles) that the buffer is user settable, so if seeking proves to be too much of a problem, users can just set the rolling buffer to be really large. > > That said indexing is a tough job and it does take a long time... its > > inescapable. Im interested in your indexing implementation, > > because the > > database implementation requires a copy of all the strings to > > live in the db, > > which basically blows the db up to about 1/3-1/2 the size of the > > (uncompressed) image. This is not really practical. > > Well that could be called small.. It all depends on the text elements > inside the image file.. If an image contains only text files, a index > can even become larger that the image itself. > For the size part, I think that my fileformat will result in smaller > size files than a database, as the format is optimized for containing > index trees.. I have reviewed the formats inside (As did my brother) > and we have sqeezed the format even smaller in the upcoming release.. Cool that sounds very promising. I am looking forward to seeing the next release, in the meantime I shall play with the current release. Are there many large changes in the new release? Michael. |
From: Paul B. <ba...@fo...> - 2004-02-23 15:02:43
|
> Actually, I'm not even sure why fs_read_random doesn't use the=20 > fs->read_pos value before it does the seek. My thoughts exactly ;-) |
From: Paul B. <ba...@fo...> - 2004-02-23 15:10:31
|
Hi Michael, (Again! ;-) We should start to call on the phone! ;-)) <Snip about seek performance> > I dont find any difference really???? I thought the only=20 > overhead in a seek is=20 > a system call because the kernel would already have that in=20 > the disk cache=20 > and doesnt really need to seek the physical disk at all.=20 > Maybe its an OS=20 > thing? Im using kernel 2.4.21. You're right.. It is not as bad as it used to be... (Partly because of the way my code now works)... I will implement everything using fs_read_random ;-).. Problem solved! hehe... > Cool, so when do you actually do the reading of the blocks?=20 > Or do you just use=20 > the file_walk and inode_walk to find out if a string is in a=20 > file or out of=20 > the file without reading any blocks? See below!... All actual reading is done if I find a read fragment (Fragment is two non-sequential blocks). > > The raw mode (The real mode) will use the 64kb blocks in a > > "walking buffer" kind of way... Every time a new block is loaded, > > the last xx (25) bytes of the old block will be prepended and also > > indexed... That way no data will ever get missed... >=20 > Thats great. I also noticed (I only have the current version=20 > which is on the=20 > web site - without all the bells and wistles) that the buffer is user=20 > settable, so if seeking proves to be too much of a problem,=20 > users can just=20 > set the rolling buffer to be really large. Indeed ;-))... The default will be larger if that proves to be better!... > Cool that sounds very promising. I am looking forward to=20 > seeing the next=20 > release, in the meantime I shall play with the current=20 > release. Are there=20 > many large changes in the new release? The largest change is the support for fragmented strings (Strings located on two non sequential blocks). Furthermore the internal format has changed so almost twice the amount of "data" can be stored in memory... On disk only a small 15% increase has been booked (Storage-wise), but becuase more can be stored in memory, less redundant parts are stored, so the total profit on disk storage can be around = 33%.. Storage has also changed in the way that you only have to specify a directory and the rest will be handled (So no more specifying a config = file, and a index file (Or multiple index files))... Version support to recognize older Index files and not blindly use them. A few handy tools for checking index files and such.... Owh.. And I'm currently busy changing my code to use the fstools = fs_read_random() function and not use libc's fread() ;-) Paul Bakker |
From: Michael C. <mic...@ne...> - 2004-02-19 10:53:02
|
On Thu, 19 Feb 2004 09:24 pm, Paul Bakker wrote: > Hi Michael.... Hi Paul, > This sounds very good and cool.... (I haven't looked at yours patch yet > though...).. Thanks... > But I just wanted to indicate that the combination of your IO Subsystem > patch for fstool and my searchtools (Indexed Searching) patch create a > system that is very powerful. Indeed, your indexing support looks very cool. I havent played with it just yet though (gotta find some time :-) > The only thing really missing is a subsystem that makes it possible to > "read" fileformats on the image with a specific interpreter. That would > enable us to "read" PDF files, PST files, etc... Im not sure I know what you mean, the IO subsystem is done at a very low level (well at the IO level)... The interpretation of different files on the filesystem is surely the job of a higher level application? For example in flag (http://sourceforge.net/projects/pyflag/), we are using exgrep (which is similar i gather to foreman) to extract files from the image and then use magic (and NSRL) to classify those and do some post processing. The GUI is then able to use the correct facility for displaying those images (usually by setting the correct mime type and asking the browser to display it but not necessarily). If you want to index the contents of binary files (say zip files or gziped files), maybe the best place to do so is by postprocessing at the higher level application? I am working also on reimplementing exgrep to use a python file-like object created using the proposed sleuthkit io subsystem. This way we can use exgrep to extract files from any type of image. For example we can find deleted and other wise un-recoverable images from an encase image etc. It would be cool if higher level programs (like autopsy or flag) can operate directly on the io subsystem for other file - like operations (like running foreman, indexing whatever). To this end I am working on a swig interface for this io subsystem, so we could use perl or python to directly access all those images. > If all these 3 are in place, I think sleuthkit is a product that is more > powerful than any of the other products I use... I concur with you. I had a bit of a play with encase and there is much room for encase to improve before it could be usable. (although as i mentioned I only had encase v 3, maybe 4 is better). Michael. |