COSLoadError parsing an embedded image, PDF g

Help
Stefan
2011-02-21
2013-05-28
  • Stefan
    Stefan
    2011-02-21

    I have some PDF's generated by XSane http://www.xsane.org throwing an COSLoadError on parsing:

    de.intarsys.pdf.parser.COSLoadError: EI expected at character index 4851134
        at de.intarsys.pdf.parser.CSContentParser.parseOperationEI(CSContentParser.java:405)
        at de.intarsys.pdf.parser.CSContentParser.parseStream(CSContentParser.java:472)
        at de.intarsys.pdf.parser.CSContentParser.parseStream(CSContentParser.java:433)
        at de.intarsys.pdf.content.CSContent.createFromBytes(CSContent.java:89)
    

    The problem occoures at de.intarsys.pdf.parser.CSContentParser::parseImageData().
    The embedded image is not recognized because the "EI" is placed directly after the image data with no linefeed or space.
    The image embedding code in the PDF looks like this:

    stream
    q
    ..
    BI
    ..
    ID
    xxxxxxxxxxxxxxxxx..xxxxxxxxxxxxEI
    Q
    endstream
    

    ..get http://tuxatwork.net/pub/jpod/out-0080.pdf as example.

    I'm not sure whether this PDF syntax is correct, but I have to deal with it and its also rendered correctly by Evince (Linux) and AcrobatReader (Win).
    My workaround is to read ahead after each occurance of 'E' to examine if this special "EndImage", a sequence of "EI\nQ\n", is reached.

    The code says "spec is not clear.." so could you incorporate parsing such PDF into jPod?

    Thanks,
    Stefan.

     
  • mtraut
    mtraut
    2011-02-22

    This is a dark corner of the spec (imho). We tried a lot of workarounds and heuristics and managed to stay alive with the current version and our set of test documents.

    Now that we have another one, we will have a look if we can add another hint into the parser. For sure your workaround can't be applied generally, so we would have to find another clue that wont break existing documents (in general, as there is no fix syntactical "end of image indicator" any sequence like "EI\nQ\n" may as well be valid image code).