Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#74 OpenDocumentExtractor omits parts of documents

1.3.0 - bugs
closed-fixed
nobody
general (25)
5
2009-07-13
2008-11-27
Bill Evans
No

The OpenDocumentExtractor omits parts of Writer documents (probably others). A document with text like this (also attached):
---begin---
Sentence one. Sentence two.

Paragraph two, one sentence.
---end---
will yield OpenDocument XML like this:
<text:p text:style-name="Standard">Sentence one. <text:s/>Sentence two.</text:p><text:p text:style-name="Standard"/>
<text:p text:style-name="Standard">Paragraph two, one sentence.</text:p>

The bug has to do with the XML parser used, and the way that it tries to "simplify" the parser handler, but it discards text when it encounters nested elements, like the <text:s/> element. So, the full text that will be extracted from this document is:
Paragraph two, one sentence.

I'm investigating a fix, but I'm no ODF expert, so I may extract too much.

Discussion

  • Bill Evans
    Bill Evans
    2008-11-27

    File demonstrating the bug.

     
    Attachments
  • Bill Evans
    Bill Evans
    2008-11-29

    Patch for this bug.

     
    Attachments
  • Bill Evans
    Bill Evans
    2008-11-29

    Test case for the patch.

     
    Attachments
  • Bill Evans
    Bill Evans
    2008-11-29

    I've created a patch that fixes this bug, which is attached as aperture.patch.

    Also added to OpenDocumentExtractorTest.java to test for this bug, and to verify this fix. Ran into a problem with a double '/' in AbstractArchiverSubCrawlerTest.java; fixed that as well. Attached in aperturetest.patch. The new sub-test in ...ExtractorTest requires a new test document; that is attached as openoffice-2.4-writer-multi-space.odt.

     
  • Antoni Mylka
    Antoni Mylka
    2008-12-01

    applied your patch (with some minor documentation-related tweaks) to the trunk, please close this issue if you think it's all

     
  • Antoni Mylka
    Antoni Mylka
    2008-12-30

    Due to lack of objections I declare this issue closed.

     
  • Antoni Mylka
    Antoni Mylka
    2008-12-30

    • status: open --> closed-fixed
     
  • Antoni Mylka
    Antoni Mylka
    2009-07-13

    • milestone: --> 1.3.0 - bugs