Re: [sleuthkit-users] Slow Add Image Process Cause

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Brian,

Pushed latest version of develop. Tested with the same 3 ntfs images and 1
hfs w/ 3million entries. No misses found and loadDd finished very fast.

Thanks,
Luis Nassif

2014-05-14 16:20 GMT-03:00 Brian Carrier <ca...@sl...>:

> Hi Luis,
>
> Pull the latest version of develop.   I'm not seeing any cache misses on
> any images and the regression tests are all good.
>
> For reference on what has gone on in this thread:
> - NTFS has a sequence value that it increments each time that an MFT entry
> is re-used.
> - TSK (and other tools) used to ignore this value, but that changed about
> a year ago when a law enforcement person showed me some data where half of
> the tools showed an important file in one folder and another half showed it
> in a different folder.  The difference was because some tools were ignoring
> the sequence and putting the file in a less than accurate place.
> - The tsk_loaddb tool (which is what Autopsy basically uses to make the
> SQLite Database) has a caching mechanism to find information about
> previously analyzed files so that it doesn't have to query the database
> each time, which is slow.
> - This exercise brought to light two bugs.  One was that the wrong ID was
> being used in the cache lookup and it never hit.  The other is even after
> that was fixed, some deleted files were still not being found in the cache
> because the wrong sequence value was being used.
> - In both cases, we weren't getting wrong results.  We were simply being
> inefficient and that is why the ingest took longer.
>
> thanks,
> brian
>
>
> On May 13, 2014, at 6:08 PM, Brian Carrier <ca...@sl...> wrote:
>
> > Hi Luis,
> >
> > Yea,  A couple of my regression tests found some strange results with
> that fix.  I pushed a new one up that is less strict on the filtering of
> the deleted files. There is still one issue I need to work out though.
> >
> > thanks,
> > brian
> >
> >
> >
> > On May 13, 2014, at 5:37 PM, Luís Filipe Nassif <lfc...@gm...>
> wrote:
> >
> >> Forgot to mention, now loaddb is adding a different number of entries
> compared to previous version. Is it ok?
> >>
> >> Thank you
> >> Nassif
> >>
> >> Em 13/05/2014 18:19, "Luís Filipe Nassif" <lfc...@gm...>
> escreveu:
> >> Excelent, Brian! All misses were resolved. Tested in the 3 ntfs images
> above and 1 hfs with 3 million entries.
> >>
> >> Also, images taking hours or days to finish now takes minutes with both
> patches!
> >>
> >> Thank you very much for addressing this issue.
> >> Nassif
> >>
> >> Nope, this one from earlier today:
> >>
> >>
> https://github.com/sleuthkit/sleuthkit/commit/f0672805c18a634ffffeff1f39f793501ddb7702
> >>
> >>
> >> On May 12, 2014, at 3:23 PM, Luís Filipe Nassif <lfc...@gm...>
> wrote:
> >>
> >>> Hi Brian,
> >>>
> >>> Do you mean this fix?
> https://github.com/sleuthkit/sleuthkit/commit/7b257b6c8252f9e9a7202990710e3a0ef31bf6b7
> >>>
> >>> It resolved a lot of the misses as i described before (05/02). But i
> am still getting thousands of misses. So I took a look at 2 generated
> sqlites and discovered that the remaining misses were from deleted folders.
> I think that deleted folders are not being stored into local cache...
> >>>
> >>>
> >>>
> >>>
> >>> 2014-05-12 13:55 GMT-03:00 Brian Carrier <ca...@sl...>:
> >>> Hi Luis,
> >>>
> >>> The develop branch on github has a fix that removed the remaining
> misses on my test image, but I had far fewer than you did.   Can you pull
> it and try that one?
> >>>
> >>> thanks,
> >>> brian
> >>>
> >>>
> >>>
> >>>
> >>> On May 8, 2014, at 9:57 AM, Luís Filipe Nassif <lfc...@gm...>
> wrote:
> >>>
> >>>> Hi Brian and Simson,
> >>>>
> >>>> I have done a number of tests since yesterday. I have not restarted
> the computer between the tests, because i think it would be better to use
> the OS IO cache to focus on CPU processing time, without io interference,
> except for the first run. I used another computer, faster than before.
> Results below:
> >>>>
> >>>> ntfs image w/ 127.408 files   1       2       3       4
> >>>> no patch (only Brian's fix)   3min 27s        3min 3s 3min 3s 3min 2s
> >>>> disabling database parent_id look up  35s     11s     11s     12s
> >>>> index on meta_addr,fs_obj_id and commit for each 5000 files   4min
> 11s        3min 48s        3min 48s        3min 47s
> >>>>
> >>>>
> >>>> ntfs image w/ 216.693 files   1       2       3       4
> >>>> no patch (only Brian's fix)   5min 53s        4min 59s        4min
> 58s        5min 2s
> >>>> disabling database parent_id look up  2min 8s 1min 21s        1min
> 21s        1min 21s
> >>>> index on meta_addr,fs_obj_id and commit for each 5000 files   6min
> 38s        5min 46s        5min 43s        5min 43s
> >>>>
> >>>>
> >>>> ntfs image w/ 433.321 files   1       2       3       4
> >>>> no patch (only Brian's fix)   21min 38s       21min 40s       21min
> 10s       21min 10s
> >>>> disabling database parent_id look up  3min 59s        2min 47s
>  2min 47s        2min 47s
> >>>> index on meta_addr,fs_obj_id and commit for each 5000 files   (not
> run based on previous results)
> >>>>
> >>>> So, Brian was right, the commits increased processing time. And as
> you can see, it would be great if it is possible to eliminate the misses
> and remove the database parent_id look up.
> >>>>
> >>>> With that in mind, I took a look at one of the sqlites, and I think I
> discovered the cause of a lot of the misses (maybe all of them). The misses
> happened with deleted files inside deleted folders. I think deleted folders
> can not being stored into local cache.
> >>>>
> >>>> Regards,
> >>>> Nassif
> >>>>
> >>>>
> >>>> 2014-05-07 10:53 GMT-03:00 Brian Carrier <ca...@sl...>:
> >>>> Hi Nassif,
> >>>>
> >>>> As Simson mentioned, the current setup was intended to be the
> fastest.  Doing frequent commits takes longer and more indexes makes
> commits take longer. This is the only process that we know about that does
> this type of lookup and would use those indexes.
> >>>>
> >>>> The bigger question for me is why we are getting these cache misses
> and I need to spend some more time with some more images to find out.  The
> lookup is to find the ID of the parent and we process from the root
> directory down.  So, in theory, we have already processed the parent folder
> before the children and it should be in the cache. We need to figure out
> why the parent isn't in the cache...
> >>>>
> >>>> brian
> >>>>
> >>>>
> >>>>
> >>>> On May 7, 2014, at 8:24 AM, Luís Filipe Nassif <lfc...@gm...>
> wrote:
> >>>>
> >>>>> I have done one last test, because it was very strange to me that
> indexing meta_addr and fs_obj_id had not improved the parent_id lookup. We
> suspected that the indexes were not being used by sqlite, maybe because the
> whole data is not commited before add image process finishes (i am not a
> sqlite expert, is it possible?). So we inserted a commit for each 5.000
> files added to database. The add image process time decreased from 1hour to
> 30min, so we think that the indexes were not being used.
> >>>>>
> >>>>> Why add image process do not commit the data while it is being added
> to database?
> >>>>>
> >>>>> Nassif
> >>>>>
> >>>>>
> >>>>> 2014-05-02 13:37 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
> >>>>> Fixing my last email, the test was run with the indexes AND Brian's
> fix. Then I removed the index patch and loadDb took the same 1 hour to
> finish with only Brian's fix. So the index patch did not help improving
> database look up for parent_id.
> >>>>>
> >>>>> Sorry for mistake,
> >>>>> Nassif
> >>>>>
> >>>>>
> >>>>> 2014-05-02 10:54 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
> >>>>>
> >>>>> I tested loadDb with a create index on meta_addr and fs_obj_id
> patch. The image with 433.321 files, previously taking 2h45min to load, now
> takes 1h to finish loadDb with the indexes. That is a good speed up, but
> completely disabling the database parent_id look up, it only takes 7min to
> finish. Is there another thing we can do to improve the parent_id database
> look up?
> >>>>>
> >>>>> Regards,
> >>>>> Nassif
> >>>>>
> >>>>>
> >>>>> 2014-05-02 9:35 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
> >>>>>
> >>>>> Ok, tested in 2 images. Fix resolved a lot of misses:
> >>>>>
> >>>>> ntfs image w/ 127.408 files: from 19.558 to 6.511 misses
> >>>>> ntfs image w/ 433.321 files: from 182.256 to 19.908 misses
> >>>>>
> >>>>> I also think creating an index on tsk_files(meta_addr) and
> tsk_files(fs_obj_id) could help improving the database look up for those
> deleted files not found in local cache, what do you think? The database
> look up seems too slow, as described in my first email.
> >>>>>
> >>>>> Thank you for taking a look so quickly.
> >>>>> Nassif
> >>>>>
> >>>>>
> >>>>> 2014-05-01 23:47 GMT-03:00 Brian Carrier <ca...@sl...>:
> >>>>>
> >>>>> Well that was an easy and embarrassing fix:
> >>>>>
> >>>>>     if (TSK_FS_TYPE_ISNTFS(fs_file->fs_info->ftype)) {
> >>>>> -        seq = fs_file->name->meta_seq;
> >>>>> +        seq = fs_file->name->par_seq;
> >>>>>     }
> >>>>>
> >>>>> Turns out we've been having  a lot of cache misses because of this
> stupid bug. Can you replace that line and see if it helps.  It certainly
> did on my test image.
> >>>>>
> >>>>> thanks,
> >>>>> brian
> >>>>>
> >>>>>
> >>>>> On May 1, 2014, at 10:24 PM, Brian Carrier <ca...@sl...>
> wrote:
> >>>>>
> >>>>>> Thanks for the tests.  I wonder if it has to do with an incorrect
> sequence number. NTFS increments the sequence number each time a file is
> re-allocated. Deleted orphan files could be getting misses.  I'll add some
> logging on my system and see what kind of misses I get.
> >>>>>>
> >>>>>> brian
> >>>>>>
> >>>>>> On May 1, 2014, at 8:39 PM, Luís Filipe Nassif <lfc...@gm...>
> wrote:
> >>>>>>
> >>>>>>> Ok, tests 1 and 3 done. I do not have sleuthkit code inside an
> ide, so did not use breakpoints. Instead, I changed
> TskDbSqlite::findParObjId() to return the parent_meta_addr when it is not
> found and return 1 when it is found in the cache map.
> >>>>>>>
> >>>>>>> Performing queries on the generated sqlite, there were 19.558
> cache misses from an image with 3 ntfs partitions and 127.408 files. I
> confirmed that many parent_meta_addr missed from cache (now stored in
> tsk_objects.par_obj_id) are into tsk_files.meta_addr. The complete paths
> corresponding to these meta_addr are parents of those files whose
> processing have not found them in cache.
> >>>>>>>
> >>>>>>> Other tests resulted in:
> >>>>>>> 182.256 cache misses from 433.321 files (ntfs)
> >>>>>>> 892.359 misses from 1.811.393 files (ntfs)
> >>>>>>> 169.819 misses from 3.177.917 files (hfs)
> >>>>>>>
> >>>>>>> Luis Nassif
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 2014-05-01 16:14 GMT-03:00 Luís Filipe Nassif <lfc...@gm...
> >:
> >>>>>>> Forgot to mention: we are using sleuthkit 4.1.3
> >>>>>>>
> >>>>>>> Em 01/05/2014 16:09, "Luís Filipe Nassif" <lfc...@gm...>
> escreveu:
> >>>>>>>
> >>>>>>> Hi Brian,
> >>>>>>>
> >>>>>>> The 3 cases above were ntfs. I also tested with hfs and canceled
> loaddb after 1 day. The modified version finished after 8hours and added
> about 3 million entries. We will try to do the tests you have suggested.
> >>>>>>>
> >>>>>>> Em 01/05/2014 15:48, "Brian Carrier" <ca...@sl...>
> escreveu:
> >>>>>>> Hi Luis,
> >>>>>>>
> >>>>>>> What kind of file system was it? I fixed a bug a little while ago
> in that code for HFS file systems that resulted in a lot of cache misses.
> >>>>>>>
> >>>>>>> In theory, everything should be cached.  It sounds like a bug if
> you are getting so many misses.  The basic idea of this code is that
> everything in the DB gets assigned a unique object ID and we make
> associations between files and their parent folder's unique ID.
> >>>>>>>
> >>>>>>> Since you seem to be comfortable with a debugger in the code, can
> you set a breakpoint for when the miss happens and:
> >>>>>>> 1) Determine the path of the file that was being added to the DB
> and the parent address that was trying to be found.
> >>>>>>> 2) Use the 'ffind' TSK tool to then map that parent address to a
> path.  Is it a subset of the path from #1?
> >>>>>>> 3) Open the DB in a SQLite tool and do something like this:
> >>>>>>>
> >>>>>>> SELECT * from tsk_files where meta_addr == META_ADDR_FROM_ABOVE
> >>>>>>>
> >>>>>>> Is it in the DB?
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> brian
> >>>>>>>
> >>>>>>>
> >>>>>>> On May 1, 2014, at 11:58 AM, Luís Filipe Nassif <
> lfc...@gm...> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> We have investigated a bit why the add image process is too slow
> in some cases. The add image process time seems to be quadratic with the
> number of files in the image.
> >>>>>>>>
> >>>>>>>> We detected that the function TskDbSqlite::findParObjId(), in
> db_sqlite.cpp, is not finding the parent_meta_addr -> parent_file_id
> mapping in the local cache for a lot of files, causing it to search for the
> mapping in the database (not sure if it is an non-indexed search?)
> >>>>>>>>
> >>>>>>>> For testing purposes, we added a "return 1;" line right after the
> cache look up, disabling the database look up, and this resulted in great
> speed ups:
> >>>>>>>>
> >>>>>>>> number of files / default load_db time / patched load_db time
> >>>>>>>> ~80.000 / 20min / 2min
> >>>>>>>> ~300.000 / 3h / 7min
> >>>>>>>> ~700.000 / 48h / 27min
> >>>>>>>>
> >>>>>>>> We wonder if it is possible to store all par_meta_addr -> par_id
> mappings into local cache (better) or doing an improved (indexed?) search
> for the mapping in the database. We think that someone with more knowledge
> of load_db code could help a lot here.
> >>>>>>>>
> ------------------------------------------------------------------------------
> >>>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
> FREE
> >>>>>>>> Instantly run your Selenium tests across 300+ browser/OS combos.
>  Get
> >>>>>>>> unparalleled scalability from the best Selenium testing platform
> available.
> >>>>>>>> Simple to use. Nothing to install. Get started now for free."
> >>>>>>>>
> http://p.sf.net/sfu/SauceLabs_______________________________________________
> >>>>>>>> sleuthkit-users mailing list
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>>>>>> http://www.sleuthkit.org
> >>>>>>>
> >>>>>>>
> >>>>>>>
> ------------------------------------------------------------------------------
> >>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
> FREE
> >>>>>>> Instantly run your Selenium tests across 300+ browser/OS combos.
>  Get
> >>>>>>> unparalleled scalability from the best Selenium testing platform
> available.
> >>>>>>> Simple to use. Nothing to install. Get started now for free."
> >>>>>>>
> http://p.sf.net/sfu/SauceLabs_______________________________________________
> >>>>>>> sleuthkit-users mailing list
> >>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>>>>> http://www.sleuthkit.org
> >>>>>>
> >>>>>>
> >>>>>>
> ------------------------------------------------------------------------------
> >>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
> FREE
> >>>>>> Instantly run your Selenium tests across 300+ browser/OS combos.
>  Get
> >>>>>> unparalleled scalability from the best Selenium testing platform
> available.
> >>>>>> Simple to use. Nothing to install. Get started now for free."
> >>>>>> http://p.sf.net/sfu/SauceLabs
> >>>>>> _______________________________________________
> >>>>>> sleuthkit-users mailing list
> >>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>>>> http://www.sleuthkit.org
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> ------------------------------------------------------------------------------
> >>>>> Is your legacy SCM system holding you back? Join Perforce May 7 to
> find out:
> >>>>> &#149; 3 signs your SCM is hindering your productivity
> >>>>> &#149; Requirements for releasing software faster
> >>>>> &#149; Expert tips and advice for migrating your SCM now
> >>>>>
> http://p.sf.net/sfu/perforce_______________________________________________
> >>>>> sleuthkit-users mailing list
> >>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>>> http://www.sleuthkit.org
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------------
> >>>> Is your legacy SCM system holding you back? Join Perforce May 7 to
> find out:
> >>>> &#149; 3 signs your SCM is hindering your productivity
> >>>> &#149; Requirements for releasing software faster
> >>>> &#149; Expert tips and advice for migrating your SCM now
> >>>>
> http://p.sf.net/sfu/perforce_______________________________________________
> >>>> sleuthkit-users mailing list
> >>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>> http://www.sleuthkit.org
> >>>
> >>>
> >>
> >>
> ------------------------------------------------------------------------------
> >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> >> Instantly run your Selenium tests across 300+ browser/OS combos.
> >> Get unparalleled scalability from the best Selenium testing platform
> available
> >> Simple to use. Nothing to install. Get started now for free."
> >>
> http://p.sf.net/sfu/SauceLabs_______________________________________________
> >> sleuthkit-users mailing list
> >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >> http://www.sleuthkit.org
> >
> >
> >
> ------------------------------------------------------------------------------
> > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> > Instantly run your Selenium tests across 300+ browser/OS combos.
> > Get unparalleled scalability from the best Selenium testing platform
> available
> > Simple to use. Nothing to install. Get started now for free."
> > http://p.sf.net/sfu/SauceLabs
> > _______________________________________________
> > sleuthkit-users mailing list
> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> > http://www.sleuthkit.org
>
>