#74 OpenDocumentExtractor omits parts of documents

1.3.0 - bugs
closed-fixed
nobody
general (25)
5
2009-07-13
2008-11-27
Bill Evans
No

The OpenDocumentExtractor omits parts of Writer documents (probably others). A document with text like this (also attached):
---begin---
Sentence one. Sentence two.

Paragraph two, one sentence.
---end---
will yield OpenDocument XML like this:
<text:p text:style-name="Standard">Sentence one. <text:s/>Sentence two.</text:p><text:p text:style-name="Standard"/>
<text:p text:style-name="Standard">Paragraph two, one sentence.</text:p>

The bug has to do with the XML parser used, and the way that it tries to "simplify" the parser handler, but it discards text when it encounters nested elements, like the <text:s/> element. So, the full text that will be extracted from this document is:
Paragraph two, one sentence.

I'm investigating a fix, but I'm no ODF expert, so I may extract too much.

Discussion

  • Bill Evans

    Bill Evans - 2008-11-27

    File demonstrating the bug.

     
  • Bill Evans

    Bill Evans - 2008-11-29

    Patch for this bug.

     
  • Bill Evans

    Bill Evans - 2008-11-29

    Test case for the patch.

     
  • Bill Evans

    Bill Evans - 2008-11-29

    I've created a patch that fixes this bug, which is attached as aperture.patch.

    Also added to OpenDocumentExtractorTest.java to test for this bug, and to verify this fix. Ran into a problem with a double '/' in AbstractArchiverSubCrawlerTest.java; fixed that as well. Attached in aperturetest.patch. The new sub-test in ...ExtractorTest requires a new test document; that is attached as openoffice-2.4-writer-multi-space.odt.

     
  • Antoni Mylka

    Antoni Mylka - 2008-12-01

    applied your patch (with some minor documentation-related tweaks) to the trunk, please close this issue if you think it's all

     
  • Antoni Mylka

    Antoni Mylka - 2008-12-30

    Due to lack of objections I declare this issue closed.

     
  • Antoni Mylka

    Antoni Mylka - 2008-12-30
    • status: open --> closed-fixed
     
  • Antoni Mylka

    Antoni Mylka - 2009-07-13
    • milestone: --> 1.3.0 - bugs
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks