Re: [sleuthkit-developers] Signature Detection Ingest Module
Brought to you by:
carrier
From: Luís F. N. <lfc...@gm...> - 2014-05-14 00:00:57
|
Does someone take a look at this? I think using Tika.detec(stream, filename) would improve autopsy file type detection. Nassif 2014-04-28 20:38 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > Updating, I did not build nor test the develop branch, but the > configuration file mismatch_config.xml from the FileExtMismatch module > seems like Autopsy is not being able to differentiate between the MS Office > formats. If this is correct, I think using Tika detection from an > inputStream would solve the issue. > > > 2014-04-28 20:18 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > > Great news, Brian, thank you. >> >> I took a look at TikaFileTypeDetector and it is using only the file >> first 100 bytes for detection. From Tika.detect(byte[]) doc: >> >> "For best results at least a few kilobytes of the document data are >> needed. See also the other detect() methods for better alternatives when >> you have more than just the document prefix available for type detection. >> " >> >> And Tika's default, when reading from a stream, currently is 64KB, so it >> can correctly detect things like "XML root elements after initial >> comment and DTDs" (MimeTypes doc) and, IMHO, zip based types (ooxml, >> odf...), ole2 and the text detection heuristcs would work better. >> >> From my Tika experience, I think it would do better detection using >> Tika.detec(inputStream, fileName), so Tika will read file bytes as needed >> and will use the file name for detection refinement. In some cases Tika >> will spool the entire stream to a temporary file for correct detection, but >> in the general case will read 64KB. I think reading only 100B, instead of >> 64KB, do not have significant time difference when reading from a spinning >> magnetic drive, with high latency times, commonlly used for disk images >> storage. >> >> >> 2014-04-28 11:01 GMT-03:00 Brian Carrier <ca...@sl...>: >> >>> Yea, the 3.1 release (which is the develop branch on github) is using >>> Tika's file type detection. >>> >>> >>> >>> On Apr 26, 2014, at 7:57 AM, Luís Filipe Nassif <lfc...@gm...> >>> wrote: >>> >>> > Hi all, >>> > >>> > As I previously mentioned, I did not see a module like this in Autopsy >>> 3, but read somewhere it will be in Autopsy 3.1, right? Solr, under the >>> hoods, uses Tika for this purpose (and the results are great) before >>> extracting text from files to index. I think explicitly using Tika for >>> detection would be good, so Autopsy could inform Solr about the detected >>> file mime type instead of Solr re-detecting all file signatures again. What >>> do you think about it? >>> > >>> > Nassif >>> > >>> ------------------------------------------------------------------------------ >>> > Start Your Social Network Today - Download eXo Platform >>> > Build your Enterprise Intranet with eXo Platform Software >>> > Java Based Open Source Intranet - Social, Extensible, Cloud Ready >>> > Get Started Now And Turn Your Intranet Into A Collaboration Platform >>> > >>> http://p.sf.net/sfu/ExoPlatform_______________________________________________ >>> > sleuthkit-developers mailing list >>> > sle...@li... >>> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers >>> >>> >> > |