RE: [sleuthkit-developers] Re: IO Subsystem patch for fstools

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Michael,

> > I currently support two modes (As will the new release)...
> >
> > The first is raw mode, meaning that all data on the disk is indexed
> > as it is!.. That means that the whole disk is walked sequentially in
> > (currently) 64k blocks... This can be enlarged if that=20
> would increase
> > performance of the underlying subsystem..
> It is possible for you then to just malloc a buffer (say 1mb)=20
> and fill it=20
> sequentially, and then just index the buffer? Or will that complicate=20
> matters?
That's what I do.. (Only with a 64kb buffer)... But the current =
read_random,
will do a seek every time.. And that is not good... (Time/Seekwise)...

The size is adaptable and will not change the concept.... Only the
implementation of the current functions do not allow me to read the data
of an image without them performing any seeks..

> So for example suppose you have a file thats 30 blocks big=20
> (~120kb). While the=20
> file_walk might call the callback 30 times, (once for each=20
> block), In the end=20
> David's print_blocks function will print a single entry for=20
> 30 consecutive=20
> blocks.
OK this might be handy, but will only change how many times the
callback is called.. No data in the blocks is read on calling
the callback (Because of the FS_FLAG_AONLY flag).

> From experinece, most files are not really fragmented and at=20
> most I have seen=20
> large files fragmented into 2-3 parts. Thats an average of=20
> 2-3 reads for=20
> large files, and a single read for small files - not too=20
> expensive at all.=20
> (contrast this with reading the block on each callback you=20
> will need to read=20
> every 4kb in every file, or upto several hundred times for=20
> each large file).
See above (This does not happen...).... I only read the last
10 bytes from the first block before the fragment and first
10 bytes from the block after the fragment.

=20
> I just ran icat under gdb again to confirm what im saying=20
> here. This is the=20
> result:
> (gdb) b icat_action
> (gdb) r -f linux-ext2 /var/tmp/honeypot.hda1.dd 16 > /tmp/vmlinuz
>=20
> Breakpoint 1, icat_action (fs=3D0x8064e18, addr=3D2080,
>     buf=3D0x8065c48 ... , size=3D1024, flags=3D1028, ptr=3D0x805d934 =
"")
>=20
> The size is the size that icat writes in every call to=20
> icat_action, and it=20
> seems to be alway 1024 here (block size). So the icat_action=20
> callback is=20
> called for every single block.
See above.. Icat does read every block, but the FS_FLAG_AONLY flag
makes it NOT read the data in the block, so no seeks will happen
there.

> The other problem I can think about (again, I havent seen=20
> your code yet so im=20
> sorry if im talking crap here), is that if you only do the=20
> indexing on each=20
> buffer returned by the file_walk callback, then wouldnt you=20
> be missing words=20
> that happen to be cut by the block boundary? i.e. half a word=20
> will be on one=20
> block and the other half on the next block? This problem will=20
> be alleviated=20
> to some extent by indexing larger buffers than blocksizes.
You're talking crap! ;-)) Joking..=20
No Michael.. This will not happen.. The raw_fragment mode only
indexes sdtrings in the fragmented part (Thus 10 bytes before and
10 bytes after the fragmented part) (Or 25... Depends.. hehe)

The raw mode (The real mode) will use the 64kb blocks in a
"walking buffer" kind of way... Every time a new block is loaded,
the last xx (25) bytes of the old block will be prepended and also
indexed... That way no data will ever get missed...

<Snipped a part about the flag implementation>

> That said indexing is a tough job and it does take a long time... its=20
> inescapable. Im interested in your indexing implementation,=20
> because the=20
> database implementation requires a copy of all the strings to=20
> live in the db,=20
> which basically blows the db up to about 1/3-1/2 the size of the=20
> (uncompressed) image. This is not really practical.
Well that could be called small.. It all depends on the text elements
inside the image file.. If an image contains only text files, a index
can even become larger that the image itself.
For the size part, I think that my fileformat will result in smaller
size files than a database, as the format is optimized for containing
index trees.. I have reviewed the formats inside (As did my brother)
and we have sqeezed the format even smaller in the upcoming release..

<Snipped the rest>

Paul Bakker