Re: [sleuthkit-developers] Signature Detection Ingest Module

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Updating, I did not build nor test the develop branch, but the
configuration file mismatch_config.xml from the FileExtMismatch module
seems like Autopsy is not being able to differentiate between the MS Office
formats. If this is correct, I think using Tika detection from an
inputStream would solve the issue.

2014-04-28 20:18 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:

> Great news, Brian, thank you.
>
> I took a look at TikaFileTypeDetector and it is using only the file first
> 100 bytes for detection. From Tika.detect(byte[]) doc:
>
> "For best results at least a few kilobytes of the document data are
> needed. See also the other detect() methods for better alternatives when
> you have more than just the document prefix available for type detection."
>
> And Tika's default, when reading from a stream, currently is 64KB, so it
> can correctly detect things like "XML root elements after initial comment
> and DTDs" (MimeTypes doc) and, IMHO, zip based types (ooxml, odf...), ole2
> and the text detection heuristcs would work better.
>
> From my Tika experience, I think it would do better detection using
> Tika.detec(inputStream, fileName), so Tika will read file bytes as needed
> and will use the file name for detection refinement. In some cases Tika
> will spool the entire stream to a temporary file for correct detection, but
> in the general case will read 64KB. I think reading only 100B, instead of
> 64KB, do not have significant time difference when reading from a spinning
> magnetic drive, with high latency times, commonlly used for disk images
> storage.
>
>
> 2014-04-28 11:01 GMT-03:00 Brian Carrier <ca...@sl...>:
>
>> Yea, the 3.1 release (which is the develop branch on github) is using
>> Tika's file type detection.
>>
>>
>>
>> On Apr 26, 2014, at 7:57 AM, Luís Filipe Nassif <lfc...@gm...>
>> wrote:
>>
>> > Hi all,
>> >
>> > As I previously mentioned, I did not see a module like this in Autopsy
>> 3, but read somewhere it will be in Autopsy 3.1, right? Solr, under the
>> hoods, uses Tika for this purpose (and the results are great) before
>> extracting text from files to index. I think explicitly using Tika for
>> detection would be good, so Autopsy could inform Solr about the detected
>> file mime type instead of Solr re-detecting all file signatures again. What
>> do you think about it?
>> >
>> > Nassif
>> >
>> ------------------------------------------------------------------------------
>> > Start Your Social Network Today - Download eXo Platform
>> > Build your Enterprise Intranet with eXo Platform Software
>> > Java Based Open Source Intranet - Social, Extensible, Cloud Ready
>> > Get Started Now And Turn Your Intranet Into A Collaboration Platform
>> >
>> http://p.sf.net/sfu/ExoPlatform_______________________________________________
>> > sleuthkit-developers mailing list
>> > sle...@li...
>> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers
>>
>>
>