Re: [sleuthkit-users] Slow Add Image Process Cause
Brought to you by:
carrier
From: Luís F. N. <lfc...@gm...> - 2014-05-15 18:07:40
|
Hi Brian, Pushed latest version of develop. Tested with the same 3 ntfs images and 1 hfs w/ 3million entries. No misses found and loadDd finished very fast. Thanks, Luis Nassif 2014-05-14 16:20 GMT-03:00 Brian Carrier <ca...@sl...>: > Hi Luis, > > Pull the latest version of develop. I'm not seeing any cache misses on > any images and the regression tests are all good. > > For reference on what has gone on in this thread: > - NTFS has a sequence value that it increments each time that an MFT entry > is re-used. > - TSK (and other tools) used to ignore this value, but that changed about > a year ago when a law enforcement person showed me some data where half of > the tools showed an important file in one folder and another half showed it > in a different folder. The difference was because some tools were ignoring > the sequence and putting the file in a less than accurate place. > - The tsk_loaddb tool (which is what Autopsy basically uses to make the > SQLite Database) has a caching mechanism to find information about > previously analyzed files so that it doesn't have to query the database > each time, which is slow. > - This exercise brought to light two bugs. One was that the wrong ID was > being used in the cache lookup and it never hit. The other is even after > that was fixed, some deleted files were still not being found in the cache > because the wrong sequence value was being used. > - In both cases, we weren't getting wrong results. We were simply being > inefficient and that is why the ingest took longer. > > thanks, > brian > > > On May 13, 2014, at 6:08 PM, Brian Carrier <ca...@sl...> wrote: > > > Hi Luis, > > > > Yea, A couple of my regression tests found some strange results with > that fix. I pushed a new one up that is less strict on the filtering of > the deleted files. There is still one issue I need to work out though. > > > > thanks, > > brian > > > > > > > > On May 13, 2014, at 5:37 PM, Luís Filipe Nassif <lfc...@gm...> > wrote: > > > >> Forgot to mention, now loaddb is adding a different number of entries > compared to previous version. Is it ok? > >> > >> Thank you > >> Nassif > >> > >> Em 13/05/2014 18:19, "Luís Filipe Nassif" <lfc...@gm...> > escreveu: > >> Excelent, Brian! All misses were resolved. Tested in the 3 ntfs images > above and 1 hfs with 3 million entries. > >> > >> Also, images taking hours or days to finish now takes minutes with both > patches! > >> > >> Thank you very much for addressing this issue. > >> Nassif > >> > >> Nope, this one from earlier today: > >> > >> > https://github.com/sleuthkit/sleuthkit/commit/f0672805c18a634ffffeff1f39f793501ddb7702 > >> > >> > >> On May 12, 2014, at 3:23 PM, Luís Filipe Nassif <lfc...@gm...> > wrote: > >> > >>> Hi Brian, > >>> > >>> Do you mean this fix? > https://github.com/sleuthkit/sleuthkit/commit/7b257b6c8252f9e9a7202990710e3a0ef31bf6b7 > >>> > >>> It resolved a lot of the misses as i described before (05/02). But i > am still getting thousands of misses. So I took a look at 2 generated > sqlites and discovered that the remaining misses were from deleted folders. > I think that deleted folders are not being stored into local cache... > >>> > >>> > >>> > >>> > >>> 2014-05-12 13:55 GMT-03:00 Brian Carrier <ca...@sl...>: > >>> Hi Luis, > >>> > >>> The develop branch on github has a fix that removed the remaining > misses on my test image, but I had far fewer than you did. Can you pull > it and try that one? > >>> > >>> thanks, > >>> brian > >>> > >>> > >>> > >>> > >>> On May 8, 2014, at 9:57 AM, Luís Filipe Nassif <lfc...@gm...> > wrote: > >>> > >>>> Hi Brian and Simson, > >>>> > >>>> I have done a number of tests since yesterday. I have not restarted > the computer between the tests, because i think it would be better to use > the OS IO cache to focus on CPU processing time, without io interference, > except for the first run. I used another computer, faster than before. > Results below: > >>>> > >>>> ntfs image w/ 127.408 files 1 2 3 4 > >>>> no patch (only Brian's fix) 3min 27s 3min 3s 3min 3s 3min 2s > >>>> disabling database parent_id look up 35s 11s 11s 12s > >>>> index on meta_addr,fs_obj_id and commit for each 5000 files 4min > 11s 3min 48s 3min 48s 3min 47s > >>>> > >>>> > >>>> ntfs image w/ 216.693 files 1 2 3 4 > >>>> no patch (only Brian's fix) 5min 53s 4min 59s 4min > 58s 5min 2s > >>>> disabling database parent_id look up 2min 8s 1min 21s 1min > 21s 1min 21s > >>>> index on meta_addr,fs_obj_id and commit for each 5000 files 6min > 38s 5min 46s 5min 43s 5min 43s > >>>> > >>>> > >>>> ntfs image w/ 433.321 files 1 2 3 4 > >>>> no patch (only Brian's fix) 21min 38s 21min 40s 21min > 10s 21min 10s > >>>> disabling database parent_id look up 3min 59s 2min 47s > 2min 47s 2min 47s > >>>> index on meta_addr,fs_obj_id and commit for each 5000 files (not > run based on previous results) > >>>> > >>>> So, Brian was right, the commits increased processing time. And as > you can see, it would be great if it is possible to eliminate the misses > and remove the database parent_id look up. > >>>> > >>>> With that in mind, I took a look at one of the sqlites, and I think I > discovered the cause of a lot of the misses (maybe all of them). The misses > happened with deleted files inside deleted folders. I think deleted folders > can not being stored into local cache. > >>>> > >>>> Regards, > >>>> Nassif > >>>> > >>>> > >>>> 2014-05-07 10:53 GMT-03:00 Brian Carrier <ca...@sl...>: > >>>> Hi Nassif, > >>>> > >>>> As Simson mentioned, the current setup was intended to be the > fastest. Doing frequent commits takes longer and more indexes makes > commits take longer. This is the only process that we know about that does > this type of lookup and would use those indexes. > >>>> > >>>> The bigger question for me is why we are getting these cache misses > and I need to spend some more time with some more images to find out. The > lookup is to find the ID of the parent and we process from the root > directory down. So, in theory, we have already processed the parent folder > before the children and it should be in the cache. We need to figure out > why the parent isn't in the cache... > >>>> > >>>> brian > >>>> > >>>> > >>>> > >>>> On May 7, 2014, at 8:24 AM, Luís Filipe Nassif <lfc...@gm...> > wrote: > >>>> > >>>>> I have done one last test, because it was very strange to me that > indexing meta_addr and fs_obj_id had not improved the parent_id lookup. We > suspected that the indexes were not being used by sqlite, maybe because the > whole data is not commited before add image process finishes (i am not a > sqlite expert, is it possible?). So we inserted a commit for each 5.000 > files added to database. The add image process time decreased from 1hour to > 30min, so we think that the indexes were not being used. > >>>>> > >>>>> Why add image process do not commit the data while it is being added > to database? > >>>>> > >>>>> Nassif > >>>>> > >>>>> > >>>>> 2014-05-02 13:37 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > >>>>> Fixing my last email, the test was run with the indexes AND Brian's > fix. Then I removed the index patch and loadDb took the same 1 hour to > finish with only Brian's fix. So the index patch did not help improving > database look up for parent_id. > >>>>> > >>>>> Sorry for mistake, > >>>>> Nassif > >>>>> > >>>>> > >>>>> 2014-05-02 10:54 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > >>>>> > >>>>> I tested loadDb with a create index on meta_addr and fs_obj_id > patch. The image with 433.321 files, previously taking 2h45min to load, now > takes 1h to finish loadDb with the indexes. That is a good speed up, but > completely disabling the database parent_id look up, it only takes 7min to > finish. Is there another thing we can do to improve the parent_id database > look up? > >>>>> > >>>>> Regards, > >>>>> Nassif > >>>>> > >>>>> > >>>>> 2014-05-02 9:35 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > >>>>> > >>>>> Ok, tested in 2 images. Fix resolved a lot of misses: > >>>>> > >>>>> ntfs image w/ 127.408 files: from 19.558 to 6.511 misses > >>>>> ntfs image w/ 433.321 files: from 182.256 to 19.908 misses > >>>>> > >>>>> I also think creating an index on tsk_files(meta_addr) and > tsk_files(fs_obj_id) could help improving the database look up for those > deleted files not found in local cache, what do you think? The database > look up seems too slow, as described in my first email. > >>>>> > >>>>> Thank you for taking a look so quickly. > >>>>> Nassif > >>>>> > >>>>> > >>>>> 2014-05-01 23:47 GMT-03:00 Brian Carrier <ca...@sl...>: > >>>>> > >>>>> Well that was an easy and embarrassing fix: > >>>>> > >>>>> if (TSK_FS_TYPE_ISNTFS(fs_file->fs_info->ftype)) { > >>>>> - seq = fs_file->name->meta_seq; > >>>>> + seq = fs_file->name->par_seq; > >>>>> } > >>>>> > >>>>> Turns out we've been having a lot of cache misses because of this > stupid bug. Can you replace that line and see if it helps. It certainly > did on my test image. > >>>>> > >>>>> thanks, > >>>>> brian > >>>>> > >>>>> > >>>>> On May 1, 2014, at 10:24 PM, Brian Carrier <ca...@sl...> > wrote: > >>>>> > >>>>>> Thanks for the tests. I wonder if it has to do with an incorrect > sequence number. NTFS increments the sequence number each time a file is > re-allocated. Deleted orphan files could be getting misses. I'll add some > logging on my system and see what kind of misses I get. > >>>>>> > >>>>>> brian > >>>>>> > >>>>>> On May 1, 2014, at 8:39 PM, Luís Filipe Nassif <lfc...@gm...> > wrote: > >>>>>> > >>>>>>> Ok, tests 1 and 3 done. I do not have sleuthkit code inside an > ide, so did not use breakpoints. Instead, I changed > TskDbSqlite::findParObjId() to return the parent_meta_addr when it is not > found and return 1 when it is found in the cache map. > >>>>>>> > >>>>>>> Performing queries on the generated sqlite, there were 19.558 > cache misses from an image with 3 ntfs partitions and 127.408 files. I > confirmed that many parent_meta_addr missed from cache (now stored in > tsk_objects.par_obj_id) are into tsk_files.meta_addr. The complete paths > corresponding to these meta_addr are parents of those files whose > processing have not found them in cache. > >>>>>>> > >>>>>>> Other tests resulted in: > >>>>>>> 182.256 cache misses from 433.321 files (ntfs) > >>>>>>> 892.359 misses from 1.811.393 files (ntfs) > >>>>>>> 169.819 misses from 3.177.917 files (hfs) > >>>>>>> > >>>>>>> Luis Nassif > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> 2014-05-01 16:14 GMT-03:00 Luís Filipe Nassif <lfc...@gm... > >: > >>>>>>> Forgot to mention: we are using sleuthkit 4.1.3 > >>>>>>> > >>>>>>> Em 01/05/2014 16:09, "Luís Filipe Nassif" <lfc...@gm...> > escreveu: > >>>>>>> > >>>>>>> Hi Brian, > >>>>>>> > >>>>>>> The 3 cases above were ntfs. I also tested with hfs and canceled > loaddb after 1 day. The modified version finished after 8hours and added > about 3 million entries. We will try to do the tests you have suggested. > >>>>>>> > >>>>>>> Em 01/05/2014 15:48, "Brian Carrier" <ca...@sl...> > escreveu: > >>>>>>> Hi Luis, > >>>>>>> > >>>>>>> What kind of file system was it? I fixed a bug a little while ago > in that code for HFS file systems that resulted in a lot of cache misses. > >>>>>>> > >>>>>>> In theory, everything should be cached. It sounds like a bug if > you are getting so many misses. The basic idea of this code is that > everything in the DB gets assigned a unique object ID and we make > associations between files and their parent folder's unique ID. > >>>>>>> > >>>>>>> Since you seem to be comfortable with a debugger in the code, can > you set a breakpoint for when the miss happens and: > >>>>>>> 1) Determine the path of the file that was being added to the DB > and the parent address that was trying to be found. > >>>>>>> 2) Use the 'ffind' TSK tool to then map that parent address to a > path. Is it a subset of the path from #1? > >>>>>>> 3) Open the DB in a SQLite tool and do something like this: > >>>>>>> > >>>>>>> SELECT * from tsk_files where meta_addr == META_ADDR_FROM_ABOVE > >>>>>>> > >>>>>>> Is it in the DB? > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> brian > >>>>>>> > >>>>>>> > >>>>>>> On May 1, 2014, at 11:58 AM, Luís Filipe Nassif < > lfc...@gm...> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> We have investigated a bit why the add image process is too slow > in some cases. The add image process time seems to be quadratic with the > number of files in the image. > >>>>>>>> > >>>>>>>> We detected that the function TskDbSqlite::findParObjId(), in > db_sqlite.cpp, is not finding the parent_meta_addr -> parent_file_id > mapping in the local cache for a lot of files, causing it to search for the > mapping in the database (not sure if it is an non-indexed search?) > >>>>>>>> > >>>>>>>> For testing purposes, we added a "return 1;" line right after the > cache look up, disabling the database look up, and this resulted in great > speed ups: > >>>>>>>> > >>>>>>>> number of files / default load_db time / patched load_db time > >>>>>>>> ~80.000 / 20min / 2min > >>>>>>>> ~300.000 / 3h / 7min > >>>>>>>> ~700.000 / 48h / 27min > >>>>>>>> > >>>>>>>> We wonder if it is possible to store all par_meta_addr -> par_id > mappings into local cache (better) or doing an improved (indexed?) search > for the mapping in the database. We think that someone with more knowledge > of load_db code could help a lot here. > >>>>>>>> > ------------------------------------------------------------------------------ > >>>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For > FREE > >>>>>>>> Instantly run your Selenium tests across 300+ browser/OS combos. > Get > >>>>>>>> unparalleled scalability from the best Selenium testing platform > available. > >>>>>>>> Simple to use. Nothing to install. Get started now for free." > >>>>>>>> > http://p.sf.net/sfu/SauceLabs_______________________________________________ > >>>>>>>> sleuthkit-users mailing list > >>>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >>>>>>>> http://www.sleuthkit.org > >>>>>>> > >>>>>>> > >>>>>>> > ------------------------------------------------------------------------------ > >>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For > FREE > >>>>>>> Instantly run your Selenium tests across 300+ browser/OS combos. > Get > >>>>>>> unparalleled scalability from the best Selenium testing platform > available. > >>>>>>> Simple to use. Nothing to install. Get started now for free." > >>>>>>> > http://p.sf.net/sfu/SauceLabs_______________________________________________ > >>>>>>> sleuthkit-users mailing list > >>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >>>>>>> http://www.sleuthkit.org > >>>>>> > >>>>>> > >>>>>> > ------------------------------------------------------------------------------ > >>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For > FREE > >>>>>> Instantly run your Selenium tests across 300+ browser/OS combos. > Get > >>>>>> unparalleled scalability from the best Selenium testing platform > available. > >>>>>> Simple to use. Nothing to install. Get started now for free." > >>>>>> http://p.sf.net/sfu/SauceLabs > >>>>>> _______________________________________________ > >>>>>> sleuthkit-users mailing list > >>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >>>>>> http://www.sleuthkit.org > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > ------------------------------------------------------------------------------ > >>>>> Is your legacy SCM system holding you back? Join Perforce May 7 to > find out: > >>>>> • 3 signs your SCM is hindering your productivity > >>>>> • Requirements for releasing software faster > >>>>> • Expert tips and advice for migrating your SCM now > >>>>> > http://p.sf.net/sfu/perforce_______________________________________________ > >>>>> sleuthkit-users mailing list > >>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >>>>> http://www.sleuthkit.org > >>>> > >>>> > >>>> > ------------------------------------------------------------------------------ > >>>> Is your legacy SCM system holding you back? Join Perforce May 7 to > find out: > >>>> • 3 signs your SCM is hindering your productivity > >>>> • Requirements for releasing software faster > >>>> • Expert tips and advice for migrating your SCM now > >>>> > http://p.sf.net/sfu/perforce_______________________________________________ > >>>> sleuthkit-users mailing list > >>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >>>> http://www.sleuthkit.org > >>> > >>> > >> > >> > ------------------------------------------------------------------------------ > >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE > >> Instantly run your Selenium tests across 300+ browser/OS combos. > >> Get unparalleled scalability from the best Selenium testing platform > available > >> Simple to use. Nothing to install. Get started now for free." > >> > http://p.sf.net/sfu/SauceLabs_______________________________________________ > >> sleuthkit-users mailing list > >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >> http://www.sleuthkit.org > > > > > > > ------------------------------------------------------------------------------ > > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE > > Instantly run your Selenium tests across 300+ browser/OS combos. > > Get unparalleled scalability from the best Selenium testing platform > available > > Simple to use. Nothing to install. Get started now for free." > > http://p.sf.net/sfu/SauceLabs > > _______________________________________________ > > sleuthkit-users mailing list > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > http://www.sleuthkit.org > > |