From: SourceForge.net <no...@so...> - 2009-12-07 03:22:51
|
Bugs item #2904755, was opened at 2009-11-27 10:42 Message generated for change (Settings changed) made by jwaddell You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=577089&aid=2904755&group_id=85722 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Normaliser Group: None Status: Open >Resolution: Fixed Priority: 5 Private: No Submitted By: Allan Cunliffe (acunliffe) >Assigned to: Allan Cunliffe (acunliffe) Summary: Changing file extensions changes normalisation results Initial Comment: Testing Branch - Xena 4.3.16 I took a group of files, which previously passed normalisation, and altered their file extensions. I then ran the altered files through Xena again. Some of these failed and some were treated differently than when they had the correct file extension. A summary of results: * CSV file renamed to .xls is guessed as Excel and fails normalisation: org.xml.sax.SAXException: Cannot connect to OpenOffice.org - possibly something wrong with the input file Trace: au.gov.naa.digipres.xena.kernel.normalise.NormaliserManager.parse(NormaliserManager.java:826) au.gov.naa.digipres.xena.kernel.normalise.NormaliserManager.normalise(NormaliserManager.java:1005) au.gov.naa.digipres.xena.core.Xena.normalise(Xena.java:599) au.gov.naa.digipres.xena.core.Xena.normalise(Xena.java:543) au.gov.naa.digipres.xena.litegui.NormalisationThread.normaliseFile(NormalisationThread.java:328) au.gov.naa.digipres.xena.litegui.NormalisationThread.normaliseStandard(NormalisationThread.java:250) au.gov.naa.digipres.xena.litegui.NormalisationThread.run(NormalisationThread.java:191) * Message file (.msg) with extension removed is guessed as Word and fails normalisation: com.sun.star.lang.IllegalArgumentException: URL seems to be an unsupported one. Trace: au.gov.naa.digipres.xena.kernel.normalise.NormaliserManager.parse(NormaliserManager.java:826) au.gov.naa.digipres.xena.kernel.normalise.NormaliserManager.normalise(NormaliserManager.java:1005) au.gov.naa.digipres.xena.core.Xena.normalise(Xena.java:599) au.gov.naa.digipres.xena.core.Xena.normalise(Xena.java:543) au.gov.naa.digipres.xena.litegui.NormalisationThread.normaliseFile(NormalisationThread.java:328) au.gov.naa.digipres.xena.litegui.NormalisationThread.normaliseStandard(NormalisationThread.java:250) au.gov.naa.digipres.xena.litegui.NormalisationThread.run(NormalisationThread.java:191) * Some Word files with the file extension removed are guessed as Microsoft Msg and fails normalisation: javax.mail.MessagingException: Message body is empty. Perhaps this isn't really a MSG file? * DOCX, ODP and SXC were guessed as .zip and normalised. These export as .zip files. Normally, these files would export as ODF documents. ---------------------------------------------------------------------- >Comment By: Justin Waddell (jwaddell) Date: 2009-12-07 14:22 Message: Have now fixed the problem with Word files guessed as MSG files. All of the listed problems are now fixed, except for extensionless MSG files being guessed as Word files, which we cannot do anything about. Fixes made in v3.4.6 of the office plugin, testing branch. ---------------------------------------------------------------------- Comment By: Justin Waddell (jwaddell) Date: 2009-12-03 17:05 Message: Update - the problems with CSV and OOXML/ODF should be fixed. I'll need a copy of the Word files guessed as MSG files to be able to test and fix that problem. Current fixes made in v3.4.5 of the Office plugin. ---------------------------------------------------------------------- Comment By: Justin Waddell (jwaddell) Date: 2009-11-27 11:02 Message: * CSV file renamed to .xls is guessed as Excel and fails normalisation: This should not happen. * Message file (.msg) with extension removed is guessed as Word and fails normalisation: All of Microsoft's pre-OfficeOpenXML files are in the same container format. Some of them have an indicator inside the container of what format they are (we retrieve this indicator using Apache's POIFS library) but some do not. We can't distinguish formats that do not have the indicator - all we know is that it's a Microsoft format of some kind. For these we assume it will be Word, which is the most likely format we will receive. There is not much we can do about this - even the linux 'file' command just identifies it has a "Microsoft Office Document". * Some Word files with the file extension removed are guessed as Microsoft Msg and fails normalisation This should not happen, as the default should be Word. * DOCX, ODP and SXC were guessed as .zip and normalised. These export as .zip files. Normally, these files would export as ODF documents. These files are all really a collection of XML files that have been wrapped in a ZIP archive. Without the extension there is no easy way to determine that it is actually an office document of some kind. Might be able to look at the names of the files inside the ZIP to give some kind of indication. I'll look at the ones that should not happen, and the one where we might be able to improve the guessing. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=577089&aid=2904755&group_id=85722 |