Re: [sleuthkit-developers] Re: IO Subsystem patch for fstools

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> Hi Michael,
Hi Paul,

> That's what I do.. (Only with a 64kb buffer)... But the current
> read_random, will do a seek every time.. And that is not good...
> (Time/Seekwise)...
Paul, Im not sure if I understand you right here - you claim that sequential 
reads are slower than seek and then reads to the same position? I wasnt sure 
about this and so I tested it (reading 100 byte lots out of 1gb so there is 
no disk cache):
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>

int main() {
  int i;
  int fd;
  char buf[100];

  fd=open("/var/tmp/honeypot.hda5.dd",O_RDONLY);
  for(i=0;i<1000000000;i+=100) {
    lseek(fd,i,SEEK_SET);
    if(read(fd,buf,100)<100) {
      printf("Read failed");
    };
  };
};

First I ran with the seek in there (the seek doesnt really do anything - its 
only seeking to the same spot it would be in anyway - which i think is what 
you mean in this situation):
$ time ./test 
real    0m27.228s
user    0m2.060s
sys     0m9.250s

And without the seek (just commented out):
$ time ./test
real    0m27.000s
user    0m1.340s
sys     0m8.770s

I dont find any difference really???? I thought the only overhead in a seek is 
a system call because the kernel would already have that in the disk cache 
and doesnt really need to seek the physical disk at all. Maybe its an OS 
thing? Im using kernel 2.4.21.

> The size is adaptable and will not change the concept.... Only the
> implementation of the current functions do not allow me to read the data
> of an image without them performing any seeks..
>
> > So for example suppose you have a file thats 30 blocks big
> > (~120kb). While the
> > file_walk might call the callback 30 times, (once for each
> > block), In the end
> > David's print_blocks function will print a single entry for
> > 30 consecutive
> > blocks.
>
> OK this might be handy, but will only change how many times the
> callback is called.. No data in the blocks is read on calling
> the callback (Because of the FS_FLAG_AONLY flag).

Cool, so when do you actually do the reading of the blocks? Or do you just use 
the file_walk and inode_walk to find out if a string is in a file or out of 
the file without reading any blocks?

> See above.. Icat does read every block, but the FS_FLAG_AONLY flag
> makes it NOT read the data in the block, so no seeks will happen
> there.

> > The other problem I can think about (again, I havent seen
> > your code yet so im
> > sorry if im talking crap here), is that if you only do the
> > indexing on each
> > buffer returned by the file_walk callback, then wouldnt you
> > be missing words
> > that happen to be cut by the block boundary? i.e. half a word
> > will be on one
> > block and the other half on the next block? This problem will
> > be alleviated
> > to some extent by indexing larger buffers than blocksizes.
>
> You're talking crap! ;-)) Joking..
> No Michael.. This will not happen.. The raw_fragment mode only
> indexes sdtrings in the fragmented part (Thus 10 bytes before and
> 10 bytes after the fragmented part) (Or 25... Depends.. hehe)

Cool, thats great!!!

> The raw mode (The real mode) will use the 64kb blocks in a
> "walking buffer" kind of way... Every time a new block is loaded,
> the last xx (25) bytes of the old block will be prepended and also
> indexed... That way no data will ever get missed...

Thats great. I also noticed (I only have the current version which is on the 
web site - without all the bells and wistles) that the buffer is user 
settable, so if seeking proves to be too much of a problem, users can just 
set the rolling buffer to be really large.

> > That said indexing is a tough job and it does take a long time... its
> > inescapable. Im interested in your indexing implementation,
> > because the
> > database implementation requires a copy of all the strings to
> > live in the db,
> > which basically blows the db up to about 1/3-1/2 the size of the
> > (uncompressed) image. This is not really practical.
>
> Well that could be called small.. It all depends on the text elements
> inside the image file.. If an image contains only text files, a index
> can even become larger that the image itself.
> For the size part, I think that my fileformat will result in smaller
> size files than a database, as the format is optimized for containing
> index trees.. I have reviewed the formats inside (As did my brother)
> and we have sqeezed the format even smaller in the upcoming release..

Cool that sounds very promising. I am looking forward to seeing the next 
release, in the meantime I shall play with the current release. Are there 
many large changes in the new release?

Michael.