From: SourceForge.net <no...@so...> - 2011-12-09 15:05:20
|
Bugs item #3455474, was opened at 2011-12-09 06:46 Message generated for change (Comment added) made by mylka You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=779500&aid=3455474&group_id=150969 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: 1.6.0 - bugs Status: Open Resolution: None Priority: 5 Private: No Submitted By: Antoni Mylka (mylka) Assigned to: Antoni Mylka (mylka) Summary: Some Excel Documents are classified as Word Initial Comment: It seems that Tika magic mime type identifier contains to guess that a document is a word document <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8"> <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" type="string" offset="1152:4096" /> </match> So if a document is an office document (parent office magic) AND has the WordDocument string (characters separated with 0x00 bytes) somewhere, then it's word. Unfortunately this fails with Excel Workbooks which contain embedded Word documents. I'll file a Tika issue. ---------------------------------------------------------------------- >Comment By: Antoni Mylka (mylka) Date: 2011-12-09 07:05 Message: Files as https://issues.apache.org/jira/browse/TIKA-806 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=779500&aid=3455474&group_id=150969 |