sleuthkit-developers Mailing List for The Sleuth Kit (Page 40)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Mon, 23 Feb 2004 10:12 pm, Paul Bakker wrote:
> Hi Michael,
Hi Paul,

> I currently support two modes (As will the new release)...
>
> The first is raw mode, meaning that all data on the disk is indexed
> as it is!.. That means that the whole disk is walked sequentially in
> (currently) 64k blocks... This can be enlarged if that would increase
> performance of the underlying subsystem..
It is possible for you then to just malloc a buffer (say 1mb) and fill it=20
sequentially, and then just index the buffer? Or will that complicate=20
matters?

> The second is raw_fragment mode, meaning that all fragmented pieces
> of files are indexed in a similar manner as icat runs through them...
> I use both inode_walk and file_walk.. Thus this consists of more small
> reads.
Have a look at the dbtools patch, because David collett has done something=
=20
similar for flag. In his file walk he is simply building a linked list of=20
blocks in the file (without actually reading these blocks), and then in his=
=20
print_blocks he is saving entire runs of blocks as unique entries.=20

So for example suppose you have a file thats 30 blocks big (~120kb). While =
the=20
file_walk might call the callback 30 times, (once for each block), In the e=
nd=20
David's print_blocks function will print a single entry for 30 consecutive=
=20
blocks.

So the idea is that you might use this information to preallocate a buffer =
30=20
blocks big and read a large chunk into it- then index that buffer. The resu=
lt=20
is that you need to do 30 times less reading on the actual file (in this=20
example).=20

=46rom experinece, most files are not really fragmented and at most I have =
seen=20
large files fragmented into 2-3 parts. Thats an average of 2-3 reads for=20
large files, and a single read for small files - not too expensive at all.=
=20
(contrast this with reading the block on each callback you will need to rea=
d=20
every 4kb in every file, or upto several hundred times for each large file).

I just ran icat under gdb again to confirm what im saying here. This is the=
=20
result:
(gdb) b icat_action
(gdb) r -f linux-ext2 /var/tmp/honeypot.hda1.dd 16 > /tmp/vmlinuz

Breakpoint 1, icat_action (fs=3D0x8064e18, addr=3D2080,
    buf=3D0x8065c48 ... , size=3D1024, flags=3D1028, ptr=3D0x805d934 "")

The size is the size that icat writes in every call to icat_action, and it=
=20
seems to be alway 1024 here (block size). So the icat_action callback is=20
called for every single block.

The other problem I can think about (again, I havent seen your code yet so =
im=20
sorry if im talking crap here), is that if you only do the indexing on each=
=20
buffer returned by the file_walk callback, then wouldnt you be missing word=
s=20
that happen to be cut by the block boundary? i.e. half a word will be on on=
e=20
block and the other half on the next block? This problem will be alleviated=
=20
to some extent by indexing larger buffers than blocksizes.

Another suggestion... The way we currently have string indexing done in fla=
g=20
(its mostly in cvs and not quite finished, but we should have it finished=20
soon :-) is by using David's dbtool to extract _all_ the meta information=20
about the image into the database - this includes all the inodes, files, an=
d=20
the blocks these files occupy (in other words file_walk and inode_walk) - W=
e=20
do not read the blocks themselves, we just note where they are. We then ind=
ex=20
the entire image file storing all the offsets for the strings in the=20
database. Then its quite easy to tell if a string is in an allocated file,=
=20
and exactly which file its in (and inode). We can extract entire files by=20
simply reading the complete block run (as i described above). The result is=
=20
that we dont really seek very much, we seek a bit in the dbtool to pull out=
=20
all the meta data, but then we just read sequentially for indexing - and ve=
ry=20
large reads at that (I think about 1mb buffers).

That said indexing is a tough job and it does take a long time... its=20
inescapable. Im interested in your indexing implementation, because the=20
database implementation requires a copy of all the strings to live in the d=
b,=20
which basically blows the db up to about 1/3-1/2 the size of the=20
(uncompressed) image. This is not really practical.

> As fragmented parts usually comprise only a very small amount of the
> disk, this should not be used as an indication of access.. Especially
> the first mode (raw) is a real time/disk access/processor roughy... In
> it's current form it does not use any seeks, as this greatly increases
> speed (Almost double if otherwise)...
Caching can certainly help here. Although if you can restructure your code =
so=20
that you do few big reads rather than lots of small reads it would aleviate=
=20
the need for caching.=20

This is especially important when you think about the possibility of your c=
ode=20
directly operating on encase evidence files or compressed volumes, where in=
=20
that case the major cost is the decompression overheads. Remember that with=
=20
encase the minimum buffer is 32kb, so even if you wanted to read 1 byte, it=
=20
will still need to decompress the whole 32kb chunk to give you that byte -=
=20
very expensive. In that case the cost of seeking is negligent relative to t=
he=20
cost of decompression.

Michael.

2003	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (10)	Sep (2)	Oct	Nov (1)	Dec
2004	Jan (22)	Feb (39)	Mar (8)	Apr (17)	May (10)	Jun (2)	Jul (6)	Aug (4)	Sep (1)	Oct (3)	Nov	Dec
2005	Jan (2)	Feb (6)	Mar (2)	Apr (2)	May (13)	Jun (2)	Jul	Aug	Sep (5)	Oct	Nov (2)	Dec
2006	Jan	Feb	Mar (1)	Apr	May (2)	Jun (9)	Jul (4)	Aug (2)	Sep	Oct (1)	Nov (9)	Dec (4)
2007	Jan (1)	Feb (2)	Mar	Apr (3)	May	Jun	Jul (6)	Aug	Sep (4)	Oct	Nov	Dec (2)
2008	Jan (4)	Feb	Mar	Apr (1)	May	Jun (9)	Jul (14)	Aug	Sep (5)	Oct (10)	Nov (4)	Dec (7)
2009	Jan (7)	Feb (10)	Mar (10)	Apr (19)	May (16)	Jun (3)	Jul (9)	Aug (5)	Sep (5)	Oct (16)	Nov (35)	Dec (30)
2010	Jan (4)	Feb (24)	Mar (25)	Apr (31)	May (11)	Jun (9)	Jul (11)	Aug (31)	Sep (11)	Oct (10)	Nov (15)	Dec (3)
2011	Jan (8)	Feb (17)	Mar (14)	Apr (2)	May (4)	Jun (4)	Jul (3)	Aug (7)	Sep (18)	Oct (8)	Nov (16)	Dec (1)
2012	Jan (9)	Feb (2)	Mar (3)	Apr (13)	May (10)	Jun (7)	Jul (1)	Aug (5)	Sep	Oct (3)	Nov (19)	Dec (3)
2013	Jan (16)	Feb (3)	Mar (2)	Apr (4)	May	Jun (3)	Jul (2)	Aug (17)	Sep (6)	Oct (1)	Nov	Dec (4)
2014	Jan (2)	Feb	Mar (3)	Apr (7)	May (6)	Jun (1)	Jul (18)	Aug	Sep (3)	Oct (1)	Nov (26)	Dec (7)
2015	Jan (5)	Feb (1)	Mar (2)	Apr	May (1)	Jun (1)	Jul (5)	Aug (7)	Sep (4)	Oct (1)	Nov (1)	Dec
2016	Jan (3)	Feb	Mar (1)	Apr	May (1)	Jun (13)	Jul (23)	Aug (2)	Sep (11)	Oct	Nov (1)	Dec
2017	Jan (4)	Feb	Mar	Apr (2)	May	Jun	Jul	Aug	Sep	Oct	Nov (2)	Dec
2018	Jan	Feb	Mar (2)	Apr	May (1)	Jun (3)	Jul	Aug	Sep (2)	Oct	Nov (2)	Dec
2019	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (2)	Sep	Oct	Nov	Dec
2020	Jan (4)	Feb	Mar	Apr	May	Jun (3)	Jul (5)	Aug (1)	Sep	Oct	Nov	Dec
2021	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2024	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (1)

sleuthkit-developers Mailing List for The Sleuth Kit (Page 40)

sleuthkit-developers — A list to discuss development of new features of The Sleuth Kit and Autopsy