Re: [sleuthkit-developers] Signature Detection Ingest Module
Brought to you by:
carrier
From: Luís F. N. <lfc...@gm...> - 2014-04-28 23:38:24
|
Updating, I did not build nor test the develop branch, but the configuration file mismatch_config.xml from the FileExtMismatch module seems like Autopsy is not being able to differentiate between the MS Office formats. If this is correct, I think using Tika detection from an inputStream would solve the issue. 2014-04-28 20:18 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>: > Great news, Brian, thank you. > > I took a look at TikaFileTypeDetector and it is using only the file first > 100 bytes for detection. From Tika.detect(byte[]) doc: > > "For best results at least a few kilobytes of the document data are > needed. See also the other detect() methods for better alternatives when > you have more than just the document prefix available for type detection." > > And Tika's default, when reading from a stream, currently is 64KB, so it > can correctly detect things like "XML root elements after initial comment > and DTDs" (MimeTypes doc) and, IMHO, zip based types (ooxml, odf...), ole2 > and the text detection heuristcs would work better. > > From my Tika experience, I think it would do better detection using > Tika.detec(inputStream, fileName), so Tika will read file bytes as needed > and will use the file name for detection refinement. In some cases Tika > will spool the entire stream to a temporary file for correct detection, but > in the general case will read 64KB. I think reading only 100B, instead of > 64KB, do not have significant time difference when reading from a spinning > magnetic drive, with high latency times, commonlly used for disk images > storage. > > > 2014-04-28 11:01 GMT-03:00 Brian Carrier <ca...@sl...>: > >> Yea, the 3.1 release (which is the develop branch on github) is using >> Tika's file type detection. >> >> >> >> On Apr 26, 2014, at 7:57 AM, Luís Filipe Nassif <lfc...@gm...> >> wrote: >> >> > Hi all, >> > >> > As I previously mentioned, I did not see a module like this in Autopsy >> 3, but read somewhere it will be in Autopsy 3.1, right? Solr, under the >> hoods, uses Tika for this purpose (and the results are great) before >> extracting text from files to index. I think explicitly using Tika for >> detection would be good, so Autopsy could inform Solr about the detected >> file mime type instead of Solr re-detecting all file signatures again. What >> do you think about it? >> > >> > Nassif >> > >> ------------------------------------------------------------------------------ >> > Start Your Social Network Today - Download eXo Platform >> > Build your Enterprise Intranet with eXo Platform Software >> > Java Based Open Source Intranet - Social, Extensible, Cloud Ready >> > Get Started Now And Turn Your Intranet Into A Collaboration Platform >> > >> http://p.sf.net/sfu/ExoPlatform_______________________________________________ >> > sleuthkit-developers mailing list >> > sle...@li... >> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers >> >> > |