#469 teitodocx: <note> in <cell> yields unopenable docx

closed-fixed
5
2012-11-25
2012-11-07
No

== Summary ==

teitodocx: <note> in <cell> yields unopenable docx

== Versions ==

teitodocx is run on Ubuntu 12.10

$ dpkg-query -W tei-p5-xsl2
tei-p5-xsl2 6.18

Microsoft Office is run in MS Windows XP with SP3 installed.

Microsoft Office 2010 Version: 14.0.6123.5001
(Thank you, Microsoft, for making the version string unselectable so that I can't just cut and paste it.)

== How to Reproduce ==

1. Download the attached zip file.

2. Unzip. Cd into directory.

3. Execute:

$ make

This will produce test1.docx and test2.docx from the corresponding test1.xml and test2.xml files. The only difference between test1 and test2 is the presence of a <note> inside a <cell> in test1.xml. The note is absent in test2.xml:

$ diff test1.xml test2.xml
23c23
< <cell>= more blah<note>gaga</note></cell>
---
> <cell>= more blah</cell>

The docx files should show an equivalent difference, expressed in the docx format.

== Expected Results ==

I would expect test1.docx and test2.docx to open normally in Word 2010.

== Actual Results ==

test1.docx causes Word to generate a terse error message. The file cannot be opened.

test2.docx opens normally.

== Observations ==

I'm no Office OpenXML expert so take the following with a whole cup of salt.

Adapt the following to your OS and editor or validator of choice:

$ mkdir t
$ cd t
$ unzip ../test1.docx
$ cd word/

# The next command merely improves readability.
$ xmllint --format document.xml > document.fmt.xml

$ emacsclient document.fmt.xml

M-x nxml-mode if needed, and set the schema to point to Office OpenXML's appropriate rnc.

There will be two groups of errors reported by nxml mode:

- The presence of @w:type on w:gridCol elements multiple times, which happens to be harmless right now. (Yet, looking at ISO/IEC 29500-1:2012(E) section 17.4.16, I do not see w:gridCol accepting @w:type.)

- The presence of a <w:r> element in the wrong place (on line 52). This is is what causes Word to bail. It occurs only once in test1.docx.

Upon replacing document.xml with a version which does not have the offending <w:r>, Word can open the document. As noted above, the errant occurrences of @w:type do not seem to cause any ill effect.

For what it is worth, both files crash libreoffice. I do not know why but I'm tempted to think the problem is libreoffice, because I've removed all errors reported by nxml-mode and still libreoffice crashes. Here's the version used:

$ dpkg-query -W libreoffice-common
libreoffice-common 1:3.6.2~rc2-0ubuntu3

Discussion

  • Sebastian Rahtz

    Sebastian Rahtz - 2012-11-07

    i hate the <note> element more than I can explain. I will attempt to debug this and fix

     
  • Louis-Dominique Dubeau

    Yeah, I've had to override the default processing for <note> for my own purposes sometimes. It was not fun.

    I've investigated this bug further because I need to be able to move forward with files that I can open, even if they are not quite what I want format-wise.

    I found that the problem seems to be in <xsl:template match="tei:cell"> in teitodocx.xsl. As of the version mentioned in the bug report, the template contains a choice branch <xsl:when test="count(*)=1 and tei:note">. It is when this branch is taken that teitodocx generates an unopenable docx. If I remove that branch and simplify the code to always execute the contents of the <xsl:otherwise> branch which calls the block-element template, I get a file I can open.

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2012-11-08

    I have done some work on this. Look now.

     
  • Louis-Dominique Dubeau

    It fixes the problem I encountered. However, this will still create an unopenable docx:

    <cell><note place="foot">single note</note></cell>

    It is not clear to me why *I* would create a cell containing only a reference to a footnote, without some text to hang the footnote reference on. Still...

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2012-11-12

    Try now. I redid the <note> handling passim et seriatim.

     
  • Louis-Dominique Dubeau

    The problem I reported in the comment dated 2012-11-09 11:43:55 PST is still present. I'm going to attach test3.xml to this bug report. It illustrates the issue. I execute:

    $ teitodocx --apphome=/home/ldd/src/tei/trunk/Stylesheets --profiledir=/home/ldd/src/tei/trunk/Stylesheets/profiles test3.xml test3.docx

    The paths are those that point to my local check out of the svn tree. The resulting test3.docx cannot be opened in Word.

     
  • Louis-Dominique Dubeau

    file illustrating additional problem

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2012-11-15

    Yet another revision of the XSL committed, which processes your file OK. the special case of a <cell> containing just one object is now handled in a more general way.

     
  • Louis-Dominique Dubeau

    I've updated to the latest svn and tried to cause the stylesheet to produce an invalid docx but was not able to.

    So the problem looks fixed. Thank you.

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2012-11-25
    • status: open --> closed-fixed
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks