Re: [sleuthkit-users] Slow Add Image Process Cause
Brought to you by:
carrier
From: Luís F. N. <lfc...@gm...> - 2014-05-02 16:38:04
|
Fixing my last email, the test was run with the indexes AND Brian's fix. Then I removed the index patch and loadDb took the same 1 hour to finish with only Brian's fix. So the index patch did not help improving database look up for parent_id. Sorry for mistake, Nassif 2014-05-02 10:54 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > I tested loadDb with a create index on meta_addr and fs_obj_id patch. The > image with 433.321 files, previously taking 2h45min to load, now takes 1h > to finish loadDb with the indexes. That is a good speed up, but completely > disabling the database parent_id look up, it only takes 7min to finish. Is > there another thing we can do to improve the parent_id database look up? > > Regards, > Nassif > > > 2014-05-02 9:35 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > > Ok, tested in 2 images. Fix resolved a lot of misses: >> >> ntfs image w/ 127.408 files: from 19.558 to 6.511 misses >> ntfs image w/ 433.321 files: from 182.256 to 19.908 misses >> >> I also think creating an index on tsk_files(meta_addr) and >> tsk_files(fs_obj_id) could help improving the database look up for those >> deleted files not found in local cache, what do you think? The database >> look up seems too slow, as described in my first email. >> >> Thank you for taking a look so quickly. >> Nassif >> >> >> 2014-05-01 23:47 GMT-03:00 Brian Carrier <ca...@sl...>: >> >> Well that was an easy and embarrassing fix: >>> >>> if (TSK_FS_TYPE_ISNTFS(fs_file->fs_info->ftype)) { >>> - seq = fs_file->name->meta_seq; >>> + seq = fs_file->name->par_seq; >>> } >>> >>> Turns out we've been having a lot of cache misses because of this >>> stupid bug. Can you replace that line and see if it helps. It certainly >>> did on my test image. >>> >>> thanks, >>> brian >>> >>> >>> On May 1, 2014, at 10:24 PM, Brian Carrier <ca...@sl...> >>> wrote: >>> >>> > Thanks for the tests. I wonder if it has to do with an incorrect >>> sequence number. NTFS increments the sequence number each time a file is >>> re-allocated. Deleted orphan files could be getting misses. I'll add some >>> logging on my system and see what kind of misses I get. >>> > >>> > brian >>> > >>> > On May 1, 2014, at 8:39 PM, Luís Filipe Nassif <lfc...@gm...> >>> wrote: >>> > >>> >> Ok, tests 1 and 3 done. I do not have sleuthkit code inside an ide, >>> so did not use breakpoints. Instead, I changed TskDbSqlite::findParObjId() >>> to return the parent_meta_addr when it is not found and return 1 when it is >>> found in the cache map. >>> >> >>> >> Performing queries on the generated sqlite, there were 19.558 cache >>> misses from an image with 3 ntfs partitions and 127.408 files. I confirmed >>> that many parent_meta_addr missed from cache (now stored in >>> tsk_objects.par_obj_id) are into tsk_files.meta_addr. The complete paths >>> corresponding to these meta_addr are parents of those files whose >>> processing have not found them in cache. >>> >> >>> >> Other tests resulted in: >>> >> 182.256 cache misses from 433.321 files (ntfs) >>> >> 892.359 misses from 1.811.393 files (ntfs) >>> >> 169.819 misses from 3.177.917 files (hfs) >>> >> >>> >> Luis Nassif >>> >> >>> >> >>> >> >>> >> 2014-05-01 16:14 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: >>> >> Forgot to mention: we are using sleuthkit 4.1.3 >>> >> >>> >> Em 01/05/2014 16:09, "Luís Filipe Nassif" <lfc...@gm...> >>> escreveu: >>> >> >>> >> Hi Brian, >>> >> >>> >> The 3 cases above were ntfs. I also tested with hfs and canceled >>> loaddb after 1 day. The modified version finished after 8hours and added >>> about 3 million entries. We will try to do the tests you have suggested. >>> >> >>> >> Em 01/05/2014 15:48, "Brian Carrier" <ca...@sl...> >>> escreveu: >>> >> Hi Luis, >>> >> >>> >> What kind of file system was it? I fixed a bug a little while ago in >>> that code for HFS file systems that resulted in a lot of cache misses. >>> >> >>> >> In theory, everything should be cached. It sounds like a bug if you >>> are getting so many misses. The basic idea of this code is that everything >>> in the DB gets assigned a unique object ID and we make associations between >>> files and their parent folder's unique ID. >>> >> >>> >> Since you seem to be comfortable with a debugger in the code, can you >>> set a breakpoint for when the miss happens and: >>> >> 1) Determine the path of the file that was being added to the DB and >>> the parent address that was trying to be found. >>> >> 2) Use the 'ffind' TSK tool to then map that parent address to a >>> path. Is it a subset of the path from #1? >>> >> 3) Open the DB in a SQLite tool and do something like this: >>> >> >>> >> SELECT * from tsk_files where meta_addr == META_ADDR_FROM_ABOVE >>> >> >>> >> Is it in the DB? >>> >> >>> >> Thanks! >>> >> >>> >> brian >>> >> >>> >> >>> >> On May 1, 2014, at 11:58 AM, Luís Filipe Nassif <lfc...@gm...> >>> wrote: >>> >> >>> >>> Hi, >>> >>> >>> >>> We have investigated a bit why the add image process is too slow in >>> some cases. The add image process time seems to be quadratic with the >>> number of files in the image. >>> >>> >>> >>> We detected that the function TskDbSqlite::findParObjId(), in >>> db_sqlite.cpp, is not finding the parent_meta_addr -> parent_file_id >>> mapping in the local cache for a lot of files, causing it to search for the >>> mapping in the database (not sure if it is an non-indexed search?) >>> >>> >>> >>> For testing purposes, we added a "return 1;" line right after the >>> cache look up, disabling the database look up, and this resulted in great >>> speed ups: >>> >>> >>> >>> number of files / default load_db time / patched load_db time >>> >>> ~80.000 / 20min / 2min >>> >>> ~300.000 / 3h / 7min >>> >>> ~700.000 / 48h / 27min >>> >>> >>> >>> We wonder if it is possible to store all par_meta_addr -> par_id >>> mappings into local cache (better) or doing an improved (indexed?) search >>> for the mapping in the database. We think that someone with more knowledge >>> of load_db code could help a lot here. >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For >>> FREE >>> >>> Instantly run your Selenium tests across 300+ browser/OS combos. Get >>> >>> unparalleled scalability from the best Selenium testing platform >>> available. >>> >>> Simple to use. Nothing to install. Get started now for free." >>> >>> >>> http://p.sf.net/sfu/SauceLabs_______________________________________________ >>> >>> sleuthkit-users mailing list >>> >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>> >>> http://www.sleuthkit.org >>> >> >>> >> >>> >> >>> ------------------------------------------------------------------------------ >>> >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >>> >> Instantly run your Selenium tests across 300+ browser/OS combos. Get >>> >> unparalleled scalability from the best Selenium testing platform >>> available. >>> >> Simple to use. Nothing to install. Get started now for free." >>> >> >>> http://p.sf.net/sfu/SauceLabs_______________________________________________ >>> >> sleuthkit-users mailing list >>> >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>> >> http://www.sleuthkit.org >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE >>> > Instantly run your Selenium tests across 300+ browser/OS combos. Get >>> > unparalleled scalability from the best Selenium testing platform >>> available. >>> > Simple to use. Nothing to install. Get started now for free." >>> > http://p.sf.net/sfu/SauceLabs >>> > _______________________________________________ >>> > sleuthkit-users mailing list >>> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>> > http://www.sleuthkit.org >>> >>> >> > |