Re: [sleuthkit-developers] Signature Detection Ingest Module

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Does someone take a look at this? I think using Tika.detec(stream,
filename) would improve autopsy file type detection.

Nassif

2014-04-28 20:38 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:

> Updating, I did not build nor test the develop branch, but the
> configuration file mismatch_config.xml from the FileExtMismatch module
> seems like Autopsy is not being able to differentiate between the MS Office
> formats. If this is correct, I think using Tika detection from an
> inputStream would solve the issue.
>
>
> 2014-04-28 20:18 GMT-03:00 Luís Filipe Nassif <lfc...@gm...>:
>
> Great news, Brian, thank you.
>>
>> I took a look at TikaFileTypeDetector and it is using only the file
>> first 100 bytes for detection. From Tika.detect(byte[]) doc:
>>
>> "For best results at least a few kilobytes of the document data are
>> needed. See also the other detect() methods for better alternatives when
>> you have more than just the document prefix available for type detection.
>> "
>>
>> And Tika's default, when reading from a stream, currently is 64KB, so it
>> can correctly detect things like "XML root elements after initial
>> comment and DTDs" (MimeTypes doc) and, IMHO, zip based types (ooxml,
>> odf...), ole2 and the text detection heuristcs would work better.
>>
>> From my Tika experience, I think it would do better detection using
>> Tika.detec(inputStream, fileName), so Tika will read file bytes as needed
>> and will use the file name for detection refinement. In some cases Tika
>> will spool the entire stream to a temporary file for correct detection, but
>> in the general case will read 64KB. I think reading only 100B, instead of
>> 64KB, do not have significant time difference when reading from a spinning
>> magnetic drive, with high latency times, commonlly used for disk images
>> storage.
>>
>>
>> 2014-04-28 11:01 GMT-03:00 Brian Carrier <ca...@sl...>:
>>
>>> Yea, the 3.1 release (which is the develop branch on github) is using
>>> Tika's file type detection.
>>>
>>>
>>>
>>> On Apr 26, 2014, at 7:57 AM, Luís Filipe Nassif <lfc...@gm...>
>>> wrote:
>>>
>>> > Hi all,
>>> >
>>> > As I previously mentioned, I did not see a module like this in Autopsy
>>> 3, but read somewhere it will be in Autopsy 3.1, right? Solr, under the
>>> hoods, uses Tika for this purpose (and the results are great) before
>>> extracting text from files to index. I think explicitly using Tika for
>>> detection would be good, so Autopsy could inform Solr about the detected
>>> file mime type instead of Solr re-detecting all file signatures again. What
>>> do you think about it?
>>> >
>>> > Nassif
>>> >
>>> ------------------------------------------------------------------------------
>>> > Start Your Social Network Today - Download eXo Platform
>>> > Build your Enterprise Intranet with eXo Platform Software
>>> > Java Based Open Source Intranet - Social, Extensible, Cloud Ready
>>> > Get Started Now And Turn Your Intranet Into A Collaboration Platform
>>> >
>>> http://p.sf.net/sfu/ExoPlatform_______________________________________________
>>> > sleuthkit-developers mailing list
>>> > sle...@li...
>>> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers
>>>
>>>
>>
>