#29 System.Xml.XmlException when attempting to parse v1.2 PDF

0.1.2.1
wont-fix
XMP (1)
5
2015-04-28
2012-02-21
JazzyFizzles
No

PDF Clown v0.1.1.0.
Document:
PDF Version: v1.2 (Acrobat 3.x).
PDF Producer: Acrobat Distiller 5.0.2 for Macintosh.
When attempting to access this property:
pdfDoc.Metadata.Content
It reports the following:
\'pdfDoc.Metadata.Content\' threw an exception of type \'System.Xml.XmlException\'
Message: \"\'dc\' is an undeclared namespace. Line 4, position 2.\"

Unable to upload PDF doc as it is 3MB and SourceForge limit is 256KB - can email directly on request or upload to share site.
Another PDF document with the same PDF version, but created with a different \'producer\' is OK.

Discussion

  • Apparently your XMP serialization is invalid as it omitted to bind the "dc" namespace prefix (which is typically associated to Dublin Core metadata) to its namespace declaration, like this:

    <x:xmpmeta
      xmlns:x="adobe:ns:meta/"
      x:xmptk="XMP Core 5.4.0"
      >
      <rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        >
    . . .
    

    If that's the case, the problem is up to the file producer -- in order to work around this parsing issue, you have to programmatically get the metadata stream and read its contents with a more relaxed parser:

    import org.pdfclown.bytes.IBuffer;
    import org.pdfclown.objects.PdfStream;
    
    PdfStream metadataStream = 
    (PdfStream)document.getBaseDataObject().resolve(PdfName.Metadata);
    IBuffer contentBody = metadataStream.getBody();
    . . . // Read the buffer using your parser.
    
     
    • labels: --> XMP
    • status: open --> wont-fix
    • assigned_to: Stefano Chizzolini
    • Group: --> 0.1.2.1