== Summary ==
teitodocx: <note> in <cell> yields unopenable docx
== Versions ==
teitodocx is run on Ubuntu 12.10
$ dpkg-query -W tei-p5-xsl2
Microsoft Office is run in MS Windows XP with SP3 installed.
Microsoft Office 2010 Version: 14.0.6123.5001
(Thank you, Microsoft, for making the version string unselectable so that I can't just cut and paste it.)
== How to Reproduce ==
1. Download the attached zip file.
2. Unzip. Cd into directory.
This will produce test1.docx and test2.docx from the corresponding test1.xml and test2.xml files. The only difference between test1 and test2 is the presence of a <note> inside a <cell> in test1.xml. The note is absent in test2.xml:
$ diff test1.xml test2.xml
< <cell>= more blah<note>gaga</note></cell>
> <cell>= more blah</cell>
The docx files should show an equivalent difference, expressed in the docx format.
== Expected Results ==
I would expect test1.docx and test2.docx to open normally in Word 2010.
== Actual Results ==
test1.docx causes Word to generate a terse error message. The file cannot be opened.
test2.docx opens normally.
== Observations ==
I'm no Office OpenXML expert so take the following with a whole cup of salt.
Adapt the following to your OS and editor or validator of choice:
$ mkdir t
$ cd t
$ unzip ../test1.docx
$ cd word/
# The next command merely improves readability.
$ xmllint --format document.xml > document.fmt.xml
$ emacsclient document.fmt.xml
M-x nxml-mode if needed, and set the schema to point to Office OpenXML's appropriate rnc.
There will be two groups of errors reported by nxml mode:
- The presence of @w:type on w:gridCol elements multiple times, which happens to be harmless right now. (Yet, looking at ISO/IEC 29500-1:2012(E) section 17.4.16, I do not see w:gridCol accepting @w:type.)
- The presence of a <w:r> element in the wrong place (on line 52). This is is what causes Word to bail. It occurs only once in test1.docx.
Upon replacing document.xml with a version which does not have the offending <w:r>, Word can open the document. As noted above, the errant occurrences of @w:type do not seem to cause any ill effect.
For what it is worth, both files crash libreoffice. I do not know why but I'm tempted to think the problem is libreoffice, because I've removed all errors reported by nxml-mode and still libreoffice crashes. Here's the version used:
$ dpkg-query -W libreoffice-common