Re: [sleuthkit-users] Slow Add Image Process Cause

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Guthrie,

As I just mentioned in the last e-mail on this topic, we weren't missing data. It's just that we had long cycle times to insert rows into the database because the cache (which is supposed to improve performance) had some bugs so the potential performance benefits were lost.

thanks,
brian

On May 14, 2014, at 2:16 PM, Guthrie Quill <qu...@po...> wrote:

> Brian, Jason et al,
> 
> I have followed the threat about the unusually long process times (
> which I experienced myself).
> Can you give some guidance on whether TSK/Autopsy is reliable for case
> work? I am concerned about potentially missing large numbers of files.
> I am a big advocate for FOSS in general ( and Autopsy in particular)
> in my shop. I want to be able to defend its efficacy.
> 
> Sent from my iPhone
> 
>> On May 13, 2014, at 18:08, Brian Carrier <ca...@sl...> wrote:
>> 
>> Hi Luis,
>> 
>> Yea,  A couple of my regression tests found some strange results with that fix.  I pushed a new one up that is less strict on the filtering of the deleted files. There is still one issue I need to work out though.
>> 
>> thanks,
>> brian
>> 
>> 
>> 
>>> On May 13, 2014, at 5:37 PM, Luís Filipe Nassif <lfc...@gm...> wrote:
>>> 
>>> Forgot to mention, now loaddb is adding a different number of entries compared to previous version. Is it ok?
>>> 
>>> Thank you
>>> Nassif
>>> 
>>> Em 13/05/2014 18:19, "Luís Filipe Nassif" <lfc...@gm...> escreveu:
>>> Excelent, Brian! All misses were resolved. Tested in the 3 ntfs images above and 1 hfs with 3 million entries.
>>> 
>>> Also, images taking hours or days to finish now takes minutes with both patches!
>>> 
>>> Thank you very much for addressing this issue.
>>> Nassif
>>> 
>>> Nope, this one from earlier today:
>>> 
>>> https://github.com/sleuthkit/sleuthkit/commit/f0672805c18a634ffffeff1f39f793501ddb7702
>>> 
>>> 
>>>> On May 12, 2014, at 3:23 PM, Luís Filipe Nassif <lfc...@gm...> wrote:
>>>> 
>>>> Hi Brian,
>>>> 
>>>> Do you mean this fix? https://github.com/sleuthkit/sleuthkit/commit/7b257b6c8252f9e9a7202990710e3a0ef31bf6b7
>>>> 
>>>> It resolved a lot of the misses as i described before (05/02). But i am still getting thousands of misses. So I took a look at 2 generated sqlites and discovered that the remaining misses were from deleted folders. I think that deleted folders are not being stored into local cache...
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2014-05-12 13:55 GMT-03:00 Brian Carrier <ca...@sl...>:
>>>> Hi Luis,
>>>> 
>>>> The develop branch on github has a fix that removed the remaining misses on my test image, but I had far fewer than you did.   Can you pull it and try that one?
>>>> 
>>>> thanks,
>>>> brian
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On May 8, 2014, at 9:57 AM, Luís Filipe Nassif <lfc...@gm...> wrote:
>>>>> 
>>>>> Hi Brian and Simson,
>>>>> 
>>>>> I have done a number of tests since yesterday. I have not restarted the computer between the tests, because i think it would be better to use the OS IO cache to focus on CPU processing time, without io interference, except for the first run. I used another computer, faster than before. Results below:
>>>>> 
>>>>> ntfs image w/ 127.408 files   1       2       3       4
>>>>> no patch (only Brian's fix)   3min 27s        3min 3s 3min 3s 3min 2s
>>>>> disabling database parent_id look up  35s     11s     11s     12s
>>>>> index on meta_addr,fs_obj_id and commit for each 5000 files   4min 11s        3min 48s        3min 48s        3min 47s
>>>>> 
>>>>> 
>>>>> ntfs image w/ 216.693 files   1       2       3       4
>>>>> no patch (only Brian's fix)   5min 53s        4min 59s        4min 58s        5min 2s
>>>>> disabling database parent_id look up  2min 8s 1min 21s        1min 21s        1min 21s
>>>>> index on meta_addr,fs_obj_id and commit for each 5000 files   6min 38s        5min 46s        5min 43s        5min 43s
>>>>> 
>>>>> 
>>>>> ntfs image w/ 433.321 files   1       2       3       4
>>>>> no patch (only Brian's fix)   21min 38s       21min 40s       21min 10s       21min 10s
>>>>> disabling database parent_id look up  3min 59s        2min 47s        2min 47s        2min 47s
>>>>> index on meta_addr,fs_obj_id and commit for each 5000 files   (not run based on previous results)
>>>>> 
>>>>> So, Brian was right, the commits increased processing time. And as you can see, it would be great if it is possible to eliminate the misses and remove the database parent_id look up.
>>>>> 
>>>>> With that in mind, I took a look at one of the sqlites, and I think I discovered the cause of a lot of the misses (maybe all of them). The misses happened with deleted files inside deleted folders. I think deleted folders can not being stored into local cache.
>>>>> 
>>>>> Regards,
>>>>> Nassif
>>>>> 
>>>>> 
>>>>> 2014-05-07 10:53 GMT-03:00 Brian Carrier <ca...@sl...>:
>>>>> Hi Nassif,
>>>>> 
>>>>> As Simson mentioned, the current setup was intended to be the fastest.  Doing frequent commits takes longer and more indexes makes commits take longer. This is the only process that we know about that does this type of lookup and would use those indexes.
>>>>> 
>>>>> The bigger question for me is why we are getting these cache misses and I need to spend some more time with some more images to find out.  The lookup is to find the ID of the parent and we process from the root directory down.  So, in theory, we have already processed the parent folder before the children and it should be in the cache. We need to figure out why the parent isn't in the cache...
>>>>> 
>>>>> brian
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 7, 2014, at 8:24 AM, Luís Filipe Nassif <lfc...@gm...> wrote:
>>>>>> 
>>>>>> I have done one last test, because it was very strange to me that indexing meta_addr and fs_obj_id had not improved the parent_id lookup. We suspected that the indexes were not being used by sqlite, maybe because the whole data is not commited before add image process finishes (i am not a sqlite expert, is it possible?). So we inserted a commit for each 5.000 files added to database. The add image process time decreased from 1hour to 30min, so we think that the indexes were not being used.
>>>>>> 
>>>>>> Why add image process do not commit the data while it is being added to database?
>>>>>> 
>>>>>> Nassif
>>>>>> 
>>>>>> 
>>>>>> 2014-05-02 13:37 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>>>>>> Fixing my last email, the test was run with the indexes AND Brian's fix. Then I removed the index patch and loadDb took the same 1 hour to finish with only Brian's fix. So the index patch did not help improving database look up for parent_id.
>>>>>> 
>>>>>> Sorry for mistake,
>>>>>> Nassif
>>>>>> 
>>>>>> 
>>>>>> 2014-05-02 10:54 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>>>>>> 
>>>>>> I tested loadDb with a create index on meta_addr and fs_obj_id patch. The image with 433.321 files, previously taking 2h45min to load, now takes 1h to finish loadDb with the indexes. That is a good speed up, but completely disabling the database parent_id look up, it only takes 7min to finish. Is there another thing we can do to improve the parent_id database look up?
>>>>>> 
>>>>>> Regards,
>>>>>> Nassif
>>>>>> 
>>>>>> 
>>>>>> 2014-05-02 9:35 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>>>>>> 
>>>>>> Ok, tested in 2 images. Fix resolved a lot of misses:
>>>>>> 
>>>>>> ntfs image w/ 127.408 files: from 19.558 to 6.511 misses
>>>>>> ntfs image w/ 433.321 files: from 182.256 to 19.908 misses
>>>>>> 
>>>>>> I also think creating an index on tsk_files(meta_addr) and tsk_files(fs_obj_id) could help improving the database look up for those deleted files not found in local cache, what do you think? The database look up seems too slow, as described in my first email.
>>>>>> 
>>>>>> Thank you for taking a look so quickly.
>>>>>> Nassif
>>>>>> 
>>>>>> 
>>>>>> 2014-05-01 23:47 GMT-03:00 Brian Carrier <ca...@sl...>:
>>>>>> 
>>>>>> Well that was an easy and embarrassing fix:
>>>>>> 
>>>>>>    if (TSK_FS_TYPE_ISNTFS(fs_file->fs_info->ftype)) {
>>>>>> -        seq = fs_file->name->meta_seq;
>>>>>> +        seq = fs_file->name->par_seq;
>>>>>>    }
>>>>>> 
>>>>>> Turns out we've been having  a lot of cache misses because of this stupid bug. Can you replace that line and see if it helps.  It certainly did on my test image.
>>>>>> 
>>>>>> thanks,
>>>>>> brian
>>>>>> 
>>>>>> 
>>>>>>> On May 1, 2014, at 10:24 PM, Brian Carrier <ca...@sl...> wrote:
>>>>>>> 
>>>>>>> Thanks for the tests.  I wonder if it has to do with an incorrect sequence number. NTFS increments the sequence number each time a file is re-allocated. Deleted orphan files could be getting misses.  I'll add some logging on my system and see what kind of misses I get.
>>>>>>> 
>>>>>>> brian
>>>>>>> 
>>>>>>>> On May 1, 2014, at 8:39 PM, Luís Filipe Nassif <lfc...@gm...> wrote:
>>>>>>>> 
>>>>>>>> Ok, tests 1 and 3 done. I do not have sleuthkit code inside an ide, so did not use breakpoints. Instead, I changed TskDbSqlite::findParObjId() to return the parent_meta_addr when it is not found and return 1 when it is found in the cache map.
>>>>>>>> 
>>>>>>>> Performing queries on the generated sqlite, there were 19.558 cache misses from an image with 3 ntfs partitions and 127.408 files. I confirmed that many parent_meta_addr missed from cache (now stored in tsk_objects.par_obj_id) are into tsk_files.meta_addr. The complete paths corresponding to these meta_addr are parents of those files whose processing have not found them in cache.
>>>>>>>> 
>>>>>>>> Other tests resulted in:
>>>>>>>> 182.256 cache misses from 433.321 files (ntfs)
>>>>>>>> 892.359 misses from 1.811.393 files (ntfs)
>>>>>>>> 169.819 misses from 3.177.917 files (hfs)
>>>>>>>> 
>>>>>>>> Luis Nassif
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2014-05-01 16:14 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>>>>>>>> Forgot to mention: we are using sleuthkit 4.1.3
>>>>>>>> 
>>>>>>>> Em 01/05/2014 16:09, "Luís Filipe Nassif" <lfc...@gm...> escreveu:
>>>>>>>> 
>>>>>>>> Hi Brian,
>>>>>>>> 
>>>>>>>> The 3 cases above were ntfs. I also tested with hfs and canceled loaddb after 1 day. The modified version finished after 8hours and added about 3 million entries. We will try to do the tests you have suggested.
>>>>>>>> 
>>>>>>>> Em 01/05/2014 15:48, "Brian Carrier" <ca...@sl...> escreveu:
>>>>>>>> Hi Luis,
>>>>>>>> 
>>>>>>>> What kind of file system was it? I fixed a bug a little while ago in that code for HFS file systems that resulted in a lot of cache misses.
>>>>>>>> 
>>>>>>>> In theory, everything should be cached.  It sounds like a bug if you are getting so many misses.  The basic idea of this code is that everything in the DB gets assigned a unique object ID and we make associations between files and their parent folder's unique ID.
>>>>>>>> 
>>>>>>>> Since you seem to be comfortable with a debugger in the code, can you set a breakpoint for when the miss happens and:
>>>>>>>> 1) Determine the path of the file that was being added to the DB and the parent address that was trying to be found.
>>>>>>>> 2) Use the 'ffind' TSK tool to then map that parent address to a path.  Is it a subset of the path from #1?
>>>>>>>> 3) Open the DB in a SQLite tool and do something like this:
>>>>>>>> 
>>>>>>>> SELECT * from tsk_files where meta_addr == META_ADDR_FROM_ABOVE
>>>>>>>> 
>>>>>>>> Is it in the DB?
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> brian
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On May 1, 2014, at 11:58 AM, Luís Filipe Nassif <lfc...@gm...> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> We have investigated a bit why the add image process is too slow in some cases. The add image process time seems to be quadratic with the number of files in the image.
>>>>>>>>> 
>>>>>>>>> We detected that the function TskDbSqlite::findParObjId(), in db_sqlite.cpp, is not finding the parent_meta_addr -> parent_file_id mapping in the local cache for a lot of files, causing it to search for the mapping in the database (not sure if it is an non-indexed search?)
>>>>>>>>> 
>>>>>>>>> For testing purposes, we added a "return 1;" line right after the cache look up, disabling the database look up, and this resulted in great speed ups:
>>>>>>>>> 
>>>>>>>>> number of files / default load_db time / patched load_db time
>>>>>>>>> ~80.000 / 20min / 2min
>>>>>>>>> ~300.000 / 3h / 7min
>>>>>>>>> ~700.000 / 48h / 27min
>>>>>>>>> 
>>>>>>>>> We wonder if it is possible to store all par_meta_addr -> par_id mappings into local cache (better) or doing an improved (indexed?) search for the mapping in the database. We think that someone with more knowledge of load_db code could help a lot here.
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>>>>>>>> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>>>>>>>>> unparalleled scalability from the best Selenium testing platform available.
>>>>>>>>> Simple to use. Nothing to install. Get started now for free."
>>>>>>>>> http://p.sf.net/sfu/SauceLabs_______________________________________________
>>>>>>>>> sleuthkit-users mailing list
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>>>>>>>> http://www.sleuthkit.org
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>>>>>>> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>>>>>>>> unparalleled scalability from the best Selenium testing platform available.
>>>>>>>> Simple to use. Nothing to install. Get started now for free."
>>>>>>>> http://p.sf.net/sfu/SauceLabs_______________________________________________
>>>>>>>> sleuthkit-users mailing list
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>>>>>>> http://www.sleuthkit.org
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>>>>>> Instantly run your Selenium tests across 300+ browser/OS combos.  Get
>>>>>>> unparalleled scalability from the best Selenium testing platform available.
>>>>>>> Simple to use. Nothing to install. Get started now for free."
>>>>>>> http://p.sf.net/sfu/SauceLabs
>>>>>>> _______________________________________________
>>>>>>> sleuthkit-users mailing list
>>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>>>>>> http://www.sleuthkit.org
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
>>>>>> &#149; 3 signs your SCM is hindering your productivity
>>>>>> &#149; Requirements for releasing software faster
>>>>>> &#149; Expert tips and advice for migrating your SCM now
>>>>>> http://p.sf.net/sfu/perforce_______________________________________________
>>>>>> sleuthkit-users mailing list
>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>>>>> http://www.sleuthkit.org
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
>>>>> &#149; 3 signs your SCM is hindering your productivity
>>>>> &#149; Requirements for releasing software faster
>>>>> &#149; Expert tips and advice for migrating your SCM now
>>>>> http://p.sf.net/sfu/perforce_______________________________________________
>>>>> sleuthkit-users mailing list
>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>>>> http://www.sleuthkit.org
>>> 
>>> ------------------------------------------------------------------------------
>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>> Instantly run your Selenium tests across 300+ browser/OS combos.
>>> Get unparalleled scalability from the best Selenium testing platform available
>>> Simple to use. Nothing to install. Get started now for free."
>>> http://p.sf.net/sfu/SauceLabs_______________________________________________
>>> sleuthkit-users mailing list
>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>>> http://www.sleuthkit.org
>> 
>> 
>> ------------------------------------------------------------------------------
>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> Instantly run your Selenium tests across 300+ browser/OS combos.
>> Get unparalleled scalability from the best Selenium testing platform available
>> Simple to use. Nothing to install. Get started now for free."
>> http://p.sf.net/sfu/SauceLabs
>> _______________________________________________
>> sleuthkit-users mailing list
>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>> http://www.sleuthkit.org
> 
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get unparalleled scalability from the best Selenium testing platform available
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> sleuthkit-users mailing list
> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> http://www.sleuthkit.org