Re: [sleuthkit-users] Slow Add Image Process Cause
Brought to you by:
carrier
From: Luís F. N. <lfc...@gm...> - 2014-05-13 21:37:13
|
Forgot to mention, now loaddb is adding a different number of entries compared to previous version. Is it ok? Thank you Nassif Em 13/05/2014 18:19, "Luís Filipe Nassif" <lfc...@gm...> escreveu: > Excelent, Brian! All misses were resolved. Tested in the 3 ntfs images > above and 1 hfs with 3 million entries. > > Also, images taking hours or days to finish now takes minutes with both > patches! > > Thank you very much for addressing this issue. > Nassif > Nope, this one from earlier today: > > > https://github.com/sleuthkit/sleuthkit/commit/f0672805c18a634ffffeff1f39f793501ddb7702 > > > On May 12, 2014, at 3:23 PM, Luís Filipe Nassif <lfc...@gm...> > wrote: > > > Hi Brian, > > > > Do you mean this fix? > https://github.com/sleuthkit/sleuthkit/commit/7b257b6c8252f9e9a7202990710e3a0ef31bf6b7 > > > > It resolved a lot of the misses as i described before (05/02). But i am > still getting thousands of misses. So I took a look at 2 generated sqlites > and discovered that the remaining misses were from deleted folders. I think > that deleted folders are not being stored into local cache... > > > > > > > > > > 2014-05-12 13:55 GMT-03:00 Brian Carrier <ca...@sl...>: > > Hi Luis, > > > > The develop branch on github has a fix that removed the remaining misses > on my test image, but I had far fewer than you did. Can you pull it and > try that one? > > > > thanks, > > brian > > > > > > > > > > On May 8, 2014, at 9:57 AM, Luís Filipe Nassif <lfc...@gm...> > wrote: > > > > > Hi Brian and Simson, > > > > > > I have done a number of tests since yesterday. I have not restarted > the computer between the tests, because i think it would be better to use > the OS IO cache to focus on CPU processing time, without io interference, > except for the first run. I used another computer, faster than before. > Results below: > > > > > > ntfs image w/ 127.408 files 1 2 3 4 > > > no patch (only Brian's fix) 3min 27s 3min 3s 3min 3s 3min 2s > > > disabling database parent_id look up 35s 11s 11s 12s > > > index on meta_addr,fs_obj_id and commit for each 5000 files 4min 11s > 3min 48s 3min 48s 3min 47s > > > > > > > > > ntfs image w/ 216.693 files 1 2 3 4 > > > no patch (only Brian's fix) 5min 53s 4min 59s 4min 58s > 5min 2s > > > disabling database parent_id look up 2min 8s 1min 21s 1min 21s > 1min 21s > > > index on meta_addr,fs_obj_id and commit for each 5000 files 6min 38s > 5min 46s 5min 43s 5min 43s > > > > > > > > > ntfs image w/ 433.321 files 1 2 3 4 > > > no patch (only Brian's fix) 21min 38s 21min 40s 21min > 10s 21min 10s > > > disabling database parent_id look up 3min 59s 2min 47s > 2min 47s 2min 47s > > > index on meta_addr,fs_obj_id and commit for each 5000 files (not run > based on previous results) > > > > > > So, Brian was right, the commits increased processing time. And as you > can see, it would be great if it is possible to eliminate the misses and > remove the database parent_id look up. > > > > > > With that in mind, I took a look at one of the sqlites, and I think I > discovered the cause of a lot of the misses (maybe all of them). The misses > happened with deleted files inside deleted folders. I think deleted folders > can not being stored into local cache. > > > > > > Regards, > > > Nassif > > > > > > > > > 2014-05-07 10:53 GMT-03:00 Brian Carrier <ca...@sl...>: > > > Hi Nassif, > > > > > > As Simson mentioned, the current setup was intended to be the fastest. > Doing frequent commits takes longer and more indexes makes commits take > longer. This is the only process that we know about that does this type of > lookup and would use those indexes. > > > > > > The bigger question for me is why we are getting these cache misses > and I need to spend some more time with some more images to find out. The > lookup is to find the ID of the parent and we process from the root > directory down. So, in theory, we have already processed the parent folder > before the children and it should be in the cache. We need to figure out > why the parent isn't in the cache... > > > > > > brian > > > > > > > > > > > > On May 7, 2014, at 8:24 AM, Luís Filipe Nassif <lfc...@gm...> > wrote: > > > > > > > I have done one last test, because it was very strange to me that > indexing meta_addr and fs_obj_id had not improved the parent_id lookup. We > suspected that the indexes were not being used by sqlite, maybe because the > whole data is not commited before add image process finishes (i am not a > sqlite expert, is it possible?). So we inserted a commit for each 5.000 > files added to database. The add image process time decreased from 1hour to > 30min, so we think that the indexes were not being used. > > > > > > > > Why add image process do not commit the data while it is being added > to database? > > > > > > > > Nassif > > > > > > > > > > > > 2014-05-02 13:37 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > > > > Fixing my last email, the test was run with the indexes AND Brian's > fix. Then I removed the index patch and loadDb took the same 1 hour to > finish with only Brian's fix. So the index patch did not help improving > database look up for parent_id. > > > > > > > > Sorry for mistake, > > > > Nassif > > > > > > > > > > > > 2014-05-02 10:54 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > > > > > > > > I tested loadDb with a create index on meta_addr and fs_obj_id > patch. The image with 433.321 files, previously taking 2h45min to load, now > takes 1h to finish loadDb with the indexes. That is a good speed up, but > completely disabling the database parent_id look up, it only takes 7min to > finish. Is there another thing we can do to improve the parent_id database > look up? > > > > > > > > Regards, > > > > Nassif > > > > > > > > > > > > 2014-05-02 9:35 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > > > > > > > > Ok, tested in 2 images. Fix resolved a lot of misses: > > > > > > > > ntfs image w/ 127.408 files: from 19.558 to 6.511 misses > > > > ntfs image w/ 433.321 files: from 182.256 to 19.908 misses > > > > > > > > I also think creating an index on tsk_files(meta_addr) and > tsk_files(fs_obj_id) could help improving the database look up for those > deleted files not found in local cache, what do you think? The database > look up seems too slow, as described in my first email. > > > > > > > > Thank you for taking a look so quickly. > > > > Nassif > > > > > > > > > > > > 2014-05-01 23:47 GMT-03:00 Brian Carrier <ca...@sl...>: > > > > > > > > Well that was an easy and embarrassing fix: > > > > > > > > if (TSK_FS_TYPE_ISNTFS(fs_file->fs_info->ftype)) { > > > > - seq = fs_file->name->meta_seq; > > > > + seq = fs_file->name->par_seq; > > > > } > > > > > > > > Turns out we've been having a lot of cache misses because of this > stupid bug. Can you replace that line and see if it helps. It certainly > did on my test image. > > > > > > > > thanks, > > > > brian > > > > > > > > > > > > On May 1, 2014, at 10:24 PM, Brian Carrier <ca...@sl...> > wrote: > > > > > > > > > Thanks for the tests. I wonder if it has to do with an incorrect > sequence number. NTFS increments the sequence number each time a file is > re-allocated. Deleted orphan files could be getting misses. I'll add some > logging on my system and see what kind of misses I get. > > > > > > > > > > brian > > > > > > > > > > On May 1, 2014, at 8:39 PM, Luís Filipe Nassif < > lfc...@gm...> wrote: > > > > > > > > > >> Ok, tests 1 and 3 done. I do not have sleuthkit code inside an > ide, so did not use breakpoints. Instead, I changed > TskDbSqlite::findParObjId() to return the parent_meta_addr when it is not > found and return 1 when it is found in the cache map. > > > > >> > > > > >> Performing queries on the generated sqlite, there were 19.558 > cache misses from an image with 3 ntfs partitions and 127.408 files. I > confirmed that many parent_meta_addr missed from cache (now stored in > tsk_objects.par_obj_id) are into tsk_files.meta_addr. The complete paths > corresponding to these meta_addr are parents of those files whose > processing have not found them in cache. > > > > >> > > > > >> Other tests resulted in: > > > > >> 182.256 cache misses from 433.321 files (ntfs) > > > > >> 892.359 misses from 1.811.393 files (ntfs) > > > > >> 169.819 misses from 3.177.917 files (hfs) > > > > >> > > > > >> Luis Nassif > > > > >> > > > > >> > > > > >> > > > > >> 2014-05-01 16:14 GMT-03:00 Luís Filipe Nassif < > lfc...@gm...>: > > > > >> Forgot to mention: we are using sleuthkit 4.1.3 > > > > >> > > > > >> Em 01/05/2014 16:09, "Luís Filipe Nassif" <lfc...@gm...> > escreveu: > > > > >> > > > > >> Hi Brian, > > > > >> > > > > >> The 3 cases above were ntfs. I also tested with hfs and canceled > loaddb after 1 day. The modified version finished after 8hours and added > about 3 million entries. We will try to do the tests you have suggested. > > > > >> > > > > >> Em 01/05/2014 15:48, "Brian Carrier" <ca...@sl...> > escreveu: > > > > >> Hi Luis, > > > > >> > > > > >> What kind of file system was it? I fixed a bug a little while ago > in that code for HFS file systems that resulted in a lot of cache misses. > > > > >> > > > > >> In theory, everything should be cached. It sounds like a bug if > you are getting so many misses. The basic idea of this code is that > everything in the DB gets assigned a unique object ID and we make > associations between files and their parent folder's unique ID. > > > > >> > > > > >> Since you seem to be comfortable with a debugger in the code, can > you set a breakpoint for when the miss happens and: > > > > >> 1) Determine the path of the file that was being added to the DB > and the parent address that was trying to be found. > > > > >> 2) Use the 'ffind' TSK tool to then map that parent address to a > path. Is it a subset of the path from #1? > > > > >> 3) Open the DB in a SQLite tool and do something like this: > > > > >> > > > > >> SELECT * from tsk_files where meta_addr == META_ADDR_FROM_ABOVE > > > > >> > > > > >> Is it in the DB? > > > > >> > > > > >> Thanks! > > > > >> > > > > >> brian > > > > >> > > > > >> > > > > >> On May 1, 2014, at 11:58 AM, Luís Filipe Nassif < > lfc...@gm...> wrote: > > > > >> > > > > >>> Hi, > > > > >>> > > > > >>> We have investigated a bit why the add image process is too slow > in some cases. The add image process time seems to be quadratic with the > number of files in the image. > > > > >>> > > > > >>> We detected that the function TskDbSqlite::findParObjId(), in > db_sqlite.cpp, is not finding the parent_meta_addr -> parent_file_id > mapping in the local cache for a lot of files, causing it to search for the > mapping in the database (not sure if it is an non-indexed search?) > > > > >>> > > > > >>> For testing purposes, we added a "return 1;" line right after > the cache look up, disabling the database look up, and this resulted in > great speed ups: > > > > >>> > > > > >>> number of files / default load_db time / patched load_db time > > > > >>> ~80.000 / 20min / 2min > > > > >>> ~300.000 / 3h / 7min > > > > >>> ~700.000 / 48h / 27min > > > > >>> > > > > >>> We wonder if it is possible to store all par_meta_addr -> par_id > mappings into local cache (better) or doing an improved (indexed?) search > for the mapping in the database. We think that someone with more knowledge > of load_db code could help a lot here. > > > > >>> > ------------------------------------------------------------------------------ > > > > >>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - > For FREE > > > > >>> Instantly run your Selenium tests across 300+ browser/OS combos. > Get > > > > >>> unparalleled scalability from the best Selenium testing platform > available. > > > > >>> Simple to use. Nothing to install. Get started now for free." > > > > >>> > http://p.sf.net/sfu/SauceLabs_______________________________________________ > > > > >>> sleuthkit-users mailing list > > > > >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > > > >>> http://www.sleuthkit.org > > > > >> > > > > >> > > > > >> > ------------------------------------------------------------------------------ > > > > >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For > FREE > > > > >> Instantly run your Selenium tests across 300+ browser/OS combos. > Get > > > > >> unparalleled scalability from the best Selenium testing platform > available. > > > > >> Simple to use. Nothing to install. Get started now for free." > > > > >> > http://p.sf.net/sfu/SauceLabs_______________________________________________ > > > > >> sleuthkit-users mailing list > > > > >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > > > >> http://www.sleuthkit.org > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For > FREE > > > > > Instantly run your Selenium tests across 300+ browser/OS combos. > Get > > > > > unparalleled scalability from the best Selenium testing platform > available. > > > > > Simple to use. Nothing to install. Get started now for free." > > > > > http://p.sf.net/sfu/SauceLabs > > > > > _______________________________________________ > > > > > sleuthkit-users mailing list > > > > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > > > > http://www.sleuthkit.org > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > Is your legacy SCM system holding you back? Join Perforce May 7 to > find out: > > > > • 3 signs your SCM is hindering your productivity > > > > • Requirements for releasing software faster > > > > • Expert tips and advice for migrating your SCM now > > > > > http://p.sf.net/sfu/perforce_______________________________________________ > > > > sleuthkit-users mailing list > > > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > > > http://www.sleuthkit.org > > > > > > > > > > ------------------------------------------------------------------------------ > > > Is your legacy SCM system holding you back? Join Perforce May 7 to > find out: > > > • 3 signs your SCM is hindering your productivity > > > • Requirements for releasing software faster > > > • Expert tips and advice for migrating your SCM now > > > > http://p.sf.net/sfu/perforce_______________________________________________ > > > sleuthkit-users mailing list > > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > > http://www.sleuthkit.org > > > > > > |