From: SourceForge.net <no...@so...> - 2011-12-09 16:44:28
|
Bugs item #3455474, was opened at 2011-12-09 06:46 Message generated for change (Comment added) made by mylka You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=779500&aid=3455474&group_id=150969 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: 1.6.0 - bugs Status: Open Resolution: None Priority: 5 Private: No Submitted By: Antoni Mylka (mylka) Assigned to: Antoni Mylka (mylka) Summary: Some Excel Documents are classified as Word Initial Comment: It seems that Tika magic mime type identifier contains to guess that a document is a word document <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8"> <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" type="string" offset="1152:4096" /> </match> So if a document is an office document (parent office magic) AND has the WordDocument string (characters separated with 0x00 bytes) somewhere, then it's word. Unfortunately this fails with Excel Workbooks which contain embedded Word documents. I'll file a Tika issue. ---------------------------------------------------------------------- >Comment By: Antoni Mylka (mylka) Date: 2011-12-09 08:44 Message: Updated tika to a version which includes a patch in rev2603. Keeping this bug open until the TIKA-806 discussion is finished. ---------------------------------------------------------------------- Comment By: Arjohn Kampman (arjohn) Date: 2011-12-09 07:28 Message: Technically, "D0 CF 11 E0 A1 B1 1A E1" is the file header for Microsoft's Compound File Structure, which is a kind of "FAT disk in a file". The only way to accurately determine the file type is to parse and examine the container structure. For more info, see [MS-CFB] at http://msdn.microsoft.com/en-us/library/dd942138%28v=prot.13%29.aspx ---------------------------------------------------------------------- Comment By: Antoni Mylka (mylka) Date: 2011-12-09 07:05 Message: Files as https://issues.apache.org/jira/browse/TIKA-806 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=779500&aid=3455474&group_id=150969 |