Re: [sleuthkit-users] Slow Add Image Process Cause

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Fixing my last email, the test was run with the indexes AND Brian's fix.
Then I removed the index patch and loadDb took the same 1 hour to finish
with only Brian's fix. So the index patch did not help improving database
look up for parent_id.

Sorry for mistake,
Nassif

2014-05-02 10:54 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:

> I tested loadDb with a create index on meta_addr and fs_obj_id patch. The
> image with 433.321 files, previously taking 2h45min to load, now takes 1h
> to finish loadDb with the indexes. That is a good speed up, but completely
> disabling the database parent_id look up, it only takes 7min to finish. Is
> there another thing we can do to improve the parent_id database look up?
>
> Regards,
> Nassif
>
>
> 2014-05-02 9:35 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>
> Ok, tested in 2 images. Fix resolved a lot of misses:
>>
>> ntfs image w/ 127.408 files: from 19.558 to 6.511 misses
>> ntfs image w/ 433.321 files: from 182.256 to 19.908 misses
>>
>> I also think creating an index on tsk_files(meta_addr) and
>> tsk_files(fs_obj_id) could help improving the database look up for those
>> deleted files not found in local cache, what do you think? The database
>> look up seems too slow, as described in my first email.
>>
>> Thank you for taking a look so quickly.
>> Nassif
>>
>>
>> 2014-05-01 23:47 GMT-03:00 Brian Carrier <ca...@sl...>:
>>
>> Well that was an easy and embarrassing fix:
>>>
>>>      if (TSK_FS_TYPE_ISNTFS(fs_file->fs_info->ftype)) {
>>> -        seq = fs_file->name->meta_seq;
>>> +        seq = fs_file->name->par_seq;
>>>      }
>>>
>>> Turns out we've been having  a lot of cache misses because of this
>>> stupid bug. Can you replace that line and see if it helps.  It certainly
>>> did on my test image.
>>>
>>> thanks,
>>> brian
>>>
>>>
>>> On May 1, 2014, at 10:24 PM, Brian Carrier <ca...@sl...>
>>> wrote:
>>>
>>> > Thanks for the tests.  I wonder if it has to do with an incorrect
>>> sequence number. NTFS increments the sequence number each time a file is
>>> re-allocated. Deleted orphan files could be getting misses.  I'll add some
>>> logging on my system and see what kind of misses I get.
>>> >
>>> > brian
>>> >
>>> > On May 1, 2014, at 8:39 PM, Luís Filipe Nassif <lfc...@gm...>
>>> wrote:
>>> >
>>> >> Ok, tests 1 and 3 done. I do not have sleuthkit code inside an ide,
>>> so did not use breakpoints. Instead, I changed TskDbSqlite::findParObjId()
>>> to return the parent_meta_addr when it is not found and return 1 when it is
>>> found in the cache map.
>>> >>
>>> >> Performing queries on the generated sqlite, there were 19.558 cache
>>> misses from an image with 3 ntfs partitions and 127.408 files. I confirmed
>>> that many parent_meta_addr missed from cache (now stored in
>>> tsk_objects.par_obj_id) are into tsk_files.meta_addr. The complete paths
>>> corresponding to these meta_addr are parents of those files whose
>>> processing have not found them in cache.
>>> >>
>>> >> Other tests resulted in:
>>> >> 182.256 cache misses from 433.321 files (ntfs)
>>> >> 892.359 misses from 1.811.393 files (ntfs)
>>> >> 169.819 misses from 3.177.917 files (hfs)
>>> >>
>>> >> Luis Nassif
>>> >>
>>> >>
>>> >>
>>> >> 2014-05-01 16:14 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>>> >> Forgot to mention: we are using sleuthkit 4.1.3
>>> >>
>>> >> Em 01/05/2014 16:09, "Luís Filipe Nassif" <lfc...@gm...>
>>> escreveu:
>>> >>
>>> >> Hi Brian,
>>> >>
>>> >> The 3 cases above were ntfs. I also tested with hfs and canceled
>>> loaddb after 1 day. The modified version finished after 8hours and added
>>> about 3 million entries. We will try to do the tests you have suggested.
>>> >>
>>> >> Em 01/05/2014 15:48, "Brian Carrier" <ca...@sl...>
>>> escreveu:
>>> >> Hi Luis,
>>> >>
>>> >> What kind of file system was it? I fixed a bug a little while ago in
>>> that code for HFS file systems that resulted in a lot of cache misses.
>>> >>
>>> >> In theory, everything should be cached.  It sounds like a bug if you
>>> are getting so many misses.  The basic idea of this code is that everything
>>> in the DB gets assigned a unique object ID and we make associations between
>>> files and their parent folder's unique ID.
>>> >>
>>> >> Since you seem to be comfortable with a debugger in the code, can you
>>> set a breakpoint for when the miss happens and:
>>> >> 1) Determine the path of the file that was being added to the DB and
>>> the parent address that was trying to be found.
>>> >> 2) Use the 'ffind' TSK tool to then map that parent address to a
>>> path.  Is it a subset of the path from #1?
>>> >> 3) Open the DB in a SQLite tool and do something like this:
>>> >>
>>> >> SELECT * from tsk_files where meta_addr == META_ADDR_FROM_ABOVE
>>> >>
>>> >> Is it in the DB?
>>> >>
>>> >> Thanks!
>>> >>
>>> >> brian
>>> >>
>>> >>
>>> >> On May 1, 2014, at 11:58 AM, Luís Filipe Nassif <lfc...@gm...>
>>> wrote:
>>> >>
>>> >>> Hi,
>>> >>>
>>> >>> We have investigated a bit why the add image process is too slow in
>>> some cases. The add image process time seems to be quadratic with the
>>> number of files in the image.
>>> >>>
>>> >>> We detected that the function TskDbSqlite::findParObjId(), in
>>> db_sqlite.cpp, is not finding the parent_meta_addr -> parent_file_id
>>> mapping in the local cache for a lot of files, causing it to search for the
>>> mapping in the database (not sure if it is an non-indexed search?)
>>> >>>
>>> >>> For testing purposes, we added a "return 1;" line right after the
>>> cache look up, disabling the database look up, and this resulted in great
>>> speed ups:
>>> >>>
>>> >>> number of files / default load_db time / patched load_db time
>>> >>> ~80.000 / 20min / 2min
>>> >>> ~300.000 / 3h / 7min
>>> >>> ~700.000 / 48h / 27min
>>> >>>
>>> >>> We wonder if it is possible to store all par_meta_addr -> par_id
>>> mappings into local cache (better) or doing an improved (indexed?) search
>>> for the mapping in the database. We think that someone with more knowledge
>>> of load_db code could help a lot here.
>>> >>>
>>> ------------------------------------------------------------------------------
>>> >>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
>>> FREE
>>> >>> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>>> >>> unparalleled scalability from the best Selenium testing platform
>>> available.
>>> >>> Simple to use. Nothing to install. Get started now for free."
>>> >>>
>>> http://p.sf.net/sfu/SauceLabs_______________________________________________
>>> >>> sleuthkit-users mailing list
>>> >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>> >>> http://www.sleuthkit.org
>>> >>
>>> >>
>>> >>
>>> ------------------------------------------------------------------------------
>>> >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>> >> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>>> >> unparalleled scalability from the best Selenium testing platform
>>> available.
>>> >> Simple to use. Nothing to install. Get started now for free."
>>> >>
>>> http://p.sf.net/sfu/SauceLabs_______________________________________________
>>> >> sleuthkit-users mailing list
>>> >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>> >> http://www.sleuthkit.org
>>> >
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>> > Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>>> > unparalleled scalability from the best Selenium testing platform
>>> available.
>>> > Simple to use. Nothing to install. Get started now for free."
>>> > http://p.sf.net/sfu/SauceLabs
>>> > _______________________________________________
>>> > sleuthkit-users mailing list
>>> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>> > http://www.sleuthkit.org
>>>
>>>
>>
>