Re: [sleuthkit-users] Slow Add Image Process Cause

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I have done one last test, because it was very strange to me that indexing
meta_addr and fs_obj_id had not improved the parent_id lookup. We suspected
that the indexes were not being used by sqlite, maybe because the whole
data is not commited before add image process finishes (i am not a sqlite
expert, is it possible?). So we inserted a commit for each 5.000 files
added to database. The add image process time decreased from 1hour to
30min, so we think that the indexes were not being used.

Why add image process do not commit the data while it is being added to
database?

Nassif

2014-05-02 13:37 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:

> Fixing my last email, the test was run with the indexes AND Brian's fix.
> Then I removed the index patch and loadDb took the same 1 hour to finish
> with only Brian's fix. So the index patch did not help improving database
> look up for parent_id.
>
> Sorry for mistake,
> Nassif
>
>
> 2014-05-02 10:54 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>
> I tested loadDb with a create index on meta_addr and fs_obj_id patch. The
>> image with 433.321 files, previously taking 2h45min to load, now takes 1h
>> to finish loadDb with the indexes. That is a good speed up, but completely
>> disabling the database parent_id look up, it only takes 7min to finish. Is
>> there another thing we can do to improve the parent_id database look up?
>>
>> Regards,
>> Nassif
>>
>>
>> 2014-05-02 9:35 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>>
>> Ok, tested in 2 images. Fix resolved a lot of misses:
>>>
>>> ntfs image w/ 127.408 files: from 19.558 to 6.511 misses
>>> ntfs image w/ 433.321 files: from 182.256 to 19.908 misses
>>>
>>> I also think creating an index on tsk_files(meta_addr) and
>>> tsk_files(fs_obj_id) could help improving the database look up for those
>>> deleted files not found in local cache, what do you think? The database
>>> look up seems too slow, as described in my first email.
>>>
>>> Thank you for taking a look so quickly.
>>> Nassif
>>>
>>>
>>> 2014-05-01 23:47 GMT-03:00 Brian Carrier <ca...@sl...>:
>>>
>>> Well that was an easy and embarrassing fix:
>>>>
>>>>      if (TSK_FS_TYPE_ISNTFS(fs_file->fs_info->ftype)) {
>>>> -        seq = fs_file->name->meta_seq;
>>>> +        seq = fs_file->name->par_seq;
>>>>      }
>>>>
>>>> Turns out we've been having  a lot of cache misses because of this
>>>> stupid bug. Can you replace that line and see if it helps.  It certainly
>>>> did on my test image.
>>>>
>>>> thanks,
>>>> brian
>>>>
>>>>
>>>> On May 1, 2014, at 10:24 PM, Brian Carrier <ca...@sl...>
>>>> wrote:
>>>>
>>>> > Thanks for the tests.  I wonder if it has to do with an incorrect
>>>> sequence number. NTFS increments the sequence number each time a file is
>>>> re-allocated. Deleted orphan files could be getting misses.  I'll add some
>>>> logging on my system and see what kind of misses I get.
>>>> >
>>>> > brian
>>>> >
>>>> > On May 1, 2014, at 8:39 PM, Luís Filipe Nassif <lfc...@gm...>
>>>> wrote:
>>>> >
>>>> >> Ok, tests 1 and 3 done. I do not have sleuthkit code inside an ide,
>>>> so did not use breakpoints. Instead, I changed TskDbSqlite::findParObjId()
>>>> to return the parent_meta_addr when it is not found and return 1 when it is
>>>> found in the cache map.
>>>> >>
>>>> >> Performing queries on the generated sqlite, there were 19.558 cache
>>>> misses from an image with 3 ntfs partitions and 127.408 files. I confirmed
>>>> that many parent_meta_addr missed from cache (now stored in
>>>> tsk_objects.par_obj_id) are into tsk_files.meta_addr. The complete paths
>>>> corresponding to these meta_addr are parents of those files whose
>>>> processing have not found them in cache.
>>>> >>
>>>> >> Other tests resulted in:
>>>> >> 182.256 cache misses from 433.321 files (ntfs)
>>>> >> 892.359 misses from 1.811.393 files (ntfs)
>>>> >> 169.819 misses from 3.177.917 files (hfs)
>>>> >>
>>>> >> Luis Nassif
>>>> >>
>>>> >>
>>>> >>
>>>> >> 2014-05-01 16:14 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>>>> >> Forgot to mention: we are using sleuthkit 4.1.3
>>>> >>
>>>> >> Em 01/05/2014 16:09, "Luís Filipe Nassif" <lfc...@gm...>
>>>> escreveu:
>>>> >>
>>>> >> Hi Brian,
>>>> >>
>>>> >> The 3 cases above were ntfs. I also tested with hfs and canceled
>>>> loaddb after 1 day. The modified version finished after 8hours and added
>>>> about 3 million entries. We will try to do the tests you have suggested.
>>>> >>
>>>> >> Em 01/05/2014 15:48, "Brian Carrier" <ca...@sl...>
>>>> escreveu:
>>>> >> Hi Luis,
>>>> >>
>>>> >> What kind of file system was it? I fixed a bug a little while ago in
>>>> that code for HFS file systems that resulted in a lot of cache misses.
>>>> >>
>>>> >> In theory, everything should be cached.  It sounds like a bug if you
>>>> are getting so many misses.  The basic idea of this code is that everything
>>>> in the DB gets assigned a unique object ID and we make associations between
>>>> files and their parent folder's unique ID.
>>>> >>
>>>> >> Since you seem to be comfortable with a debugger in the code, can
>>>> you set a breakpoint for when the miss happens and:
>>>> >> 1) Determine the path of the file that was being added to the DB and
>>>> the parent address that was trying to be found.
>>>> >> 2) Use the 'ffind' TSK tool to then map that parent address to a
>>>> path.  Is it a subset of the path from #1?
>>>> >> 3) Open the DB in a SQLite tool and do something like this:
>>>> >>
>>>> >> SELECT * from tsk_files where meta_addr == META_ADDR_FROM_ABOVE
>>>> >>
>>>> >> Is it in the DB?
>>>> >>
>>>> >> Thanks!
>>>> >>
>>>> >> brian
>>>> >>
>>>> >>
>>>> >> On May 1, 2014, at 11:58 AM, Luís Filipe Nassif <lfc...@gm...>
>>>> wrote:
>>>> >>
>>>> >>> Hi,
>>>> >>>
>>>> >>> We have investigated a bit why the add image process is too slow in
>>>> some cases. The add image process time seems to be quadratic with the
>>>> number of files in the image.
>>>> >>>
>>>> >>> We detected that the function TskDbSqlite::findParObjId(), in
>>>> db_sqlite.cpp, is not finding the parent_meta_addr -> parent_file_id
>>>> mapping in the local cache for a lot of files, causing it to search for the
>>>> mapping in the database (not sure if it is an non-indexed search?)
>>>> >>>
>>>> >>> For testing purposes, we added a "return 1;" line right after the
>>>> cache look up, disabling the database look up, and this resulted in great
>>>> speed ups:
>>>> >>>
>>>> >>> number of files / default load_db time / patched load_db time
>>>> >>> ~80.000 / 20min / 2min
>>>> >>> ~300.000 / 3h / 7min
>>>> >>> ~700.000 / 48h / 27min
>>>> >>>
>>>> >>> We wonder if it is possible to store all par_meta_addr -> par_id
>>>> mappings into local cache (better) or doing an improved (indexed?) search
>>>> for the mapping in the database. We think that someone with more knowledge
>>>> of load_db code could help a lot here.
>>>> >>>
>>>> ------------------------------------------------------------------------------
>>>> >>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
>>>> FREE
>>>> >>> Instantly run your Selenium tests across 300+ browser/OS combos.
>>>>  Get
>>>> >>> unparalleled scalability from the best Selenium testing platform
>>>> available.
>>>> >>> Simple to use. Nothing to install. Get started now for free."
>>>> >>>
>>>> http://p.sf.net/sfu/SauceLabs_______________________________________________
>>>> >>> sleuthkit-users mailing list
>>>> >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>>> >>> http://www.sleuthkit.org
>>>> >>
>>>> >>
>>>> >>
>>>> ------------------------------------------------------------------------------
>>>> >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
>>>> FREE
>>>> >> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>>>> >> unparalleled scalability from the best Selenium testing platform
>>>> available.
>>>> >> Simple to use. Nothing to install. Get started now for free."
>>>> >>
>>>> http://p.sf.net/sfu/SauceLabs_______________________________________________
>>>> >> sleuthkit-users mailing list
>>>> >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>>> >> http://www.sleuthkit.org
>>>> >
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>>> > Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>>>> > unparalleled scalability from the best Selenium testing platform
>>>> available.
>>>> > Simple to use. Nothing to install. Get started now for free."
>>>> > http://p.sf.net/sfu/SauceLabs
>>>> > _______________________________________________
>>>> > sleuthkit-users mailing list
>>>> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>>> > http://www.sleuthkit.org
>>>>
>>>>
>>>
>>
>