#469 teitodocx: <note> in <cell> yields unopenable docx

closed-fixed
5
2012-11-25
2012-11-07
No

== Summary ==

teitodocx: <note> in <cell> yields unopenable docx

== Versions ==

teitodocx is run on Ubuntu 12.10

$ dpkg-query -W tei-p5-xsl2
tei-p5-xsl2 6.18

Microsoft Office is run in MS Windows XP with SP3 installed.

Microsoft Office 2010 Version: 14.0.6123.5001
(Thank you, Microsoft, for making the version string unselectable so that I can't just cut and paste it.)

== How to Reproduce ==

1. Download the attached zip file.

2. Unzip. Cd into directory.

3. Execute:

$ make

This will produce test1.docx and test2.docx from the corresponding test1.xml and test2.xml files. The only difference between test1 and test2 is the presence of a <note> inside a <cell> in test1.xml. The note is absent in test2.xml:

$ diff test1.xml test2.xml
23c23
< <cell>= more blah<note>gaga</note></cell>
---
> <cell>= more blah</cell>

The docx files should show an equivalent difference, expressed in the docx format.

== Expected Results ==

I would expect test1.docx and test2.docx to open normally in Word 2010.

== Actual Results ==

test1.docx causes Word to generate a terse error message. The file cannot be opened.

test2.docx opens normally.

== Observations ==

I'm no Office OpenXML expert so take the following with a whole cup of salt.

Adapt the following to your OS and editor or validator of choice:

$ mkdir t
$ cd t
$ unzip ../test1.docx
$ cd word/

# The next command merely improves readability.
$ xmllint --format document.xml > document.fmt.xml

$ emacsclient document.fmt.xml

M-x nxml-mode if needed, and set the schema to point to Office OpenXML's appropriate rnc.

There will be two groups of errors reported by nxml mode:

- The presence of @w:type on w:gridCol elements multiple times, which happens to be harmless right now. (Yet, looking at ISO/IEC 29500-1:2012(E) section 17.4.16, I do not see w:gridCol accepting @w:type.)

- The presence of a <w:r> element in the wrong place (on line 52). This is is what causes Word to bail. It occurs only once in test1.docx.

Upon replacing document.xml with a version which does not have the offending <w:r>, Word can open the document. As noted above, the errant occurrences of @w:type do not seem to cause any ill effect.

For what it is worth, both files crash libreoffice. I do not know why but I'm tempted to think the problem is libreoffice, because I've removed all errors reported by nxml-mode and still libreoffice crashes. Here's the version used:

$ dpkg-query -W libreoffice-common
libreoffice-common 1:3.6.2~rc2-0ubuntu3

Discussion

1 2 > >> (Page 1 of 2)
  • files illustrating the bug

     
    Attachments
  • i hate the <note> element more than I can explain. I will attempt to debug this and fix

     
  • Yeah, I've had to override the default processing for <note> for my own purposes sometimes. It was not fun.

    I've investigated this bug further because I need to be able to move forward with files that I can open, even if they are not quite what I want format-wise.

    I found that the problem seems to be in <xsl:template match="tei:cell"> in teitodocx.xsl. As of the version mentioned in the bug report, the template contains a choice branch <xsl:when test="count(*)=1 and tei:note">. It is when this branch is taken that teitodocx generates an unopenable docx. If I remove that branch and simplify the code to always execute the contents of the <xsl:otherwise> branch which calls the block-element template, I get a file I can open.

     
  • I have done some work on this. Look now.

     
  • It fixes the problem I encountered. However, this will still create an unopenable docx:

    <cell><note place="foot">single note</note></cell>

    It is not clear to me why *I* would create a cell containing only a reference to a footnote, without some text to hang the footnote reference on. Still...

     
  • Try now. I redid the <note> handling passim et seriatim.

     
  • The problem I reported in the comment dated 2012-11-09 11:43:55 PST is still present. I'm going to attach test3.xml to this bug report. It illustrates the issue. I execute:

    $ teitodocx --apphome=/home/ldd/src/tei/trunk/Stylesheets --profiledir=/home/ldd/src/tei/trunk/Stylesheets/profiles test3.xml test3.docx

    The paths are those that point to my local check out of the svn tree. The resulting test3.docx cannot be opened in Word.

     
  • file illustrating additional problem

     
    Attachments
  • Yet another revision of the XSL committed, which processes your file OK. the special case of a <cell> containing just one object is now handled in a more general way.

     
  • I've updated to the latest svn and tried to cause the stylesheet to produce an invalid docx but was not able to.

    So the problem looks fixed. Thank you.

     
1 2 > >> (Page 1 of 2)