#34 KeyNotFoundException using TextExtractor

0.1.2.1
closed-out-of-date
None
3
2015-04-17
2012-04-16
No

I use PDFClown to extract plain text from some PDF documents that I'm not allowed to share, unfortunately.
I hope the following information is enough to identify and fix the problem, though.
AssemblyVersion: 0.1.1
This is my high level code:

StringBuilder builder=new StringBuilder\(\);
using \(Stream input=new Stream\(ioStream\)\) \{
    using \(File inputFile=new File\(input\)\) \{
        TextExtractor extractor=new TextExtractor\(\);
        foreach \(var page in inputFile.Document.Pages\) \{
            builder.AppendLine\(TextExtractor.ToString\(extractor.Extract\(page\)\)\);
        \}
    \}
\}

The following is the stack track of the exception:

System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
at org.pdfclown.documents.contents.fonts.SimpleFont.OnLoad()
at org.pdfclown.documents.contents.fonts.Font.Load()
at org.pdfclown.documents.contents.fonts.Font..ctor(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.fonts.SimpleFont..ctor(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.fonts.TrueTypeFont..ctor(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.fonts.Font.Wrap(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.FontResources.Wrap(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.ResourceItems`1.get_Item(PdfName key)
at org.pdfclown.documents.contents.objects.SetFont.GetResource(IContentContext context)
at org.pdfclown.documents.contents.objects.SetFont.GetFont(IContentContext context)
at org.pdfclown.documents.contents.objects.SetFont.Scan(GraphicsState state)
at org.pdfclown.documents.contents.ContentScanner.MoveNext()
at org.pdfclown.documents.contents.ContentScanner.TextWrapper.Extract(ContentScanner level)
at org.pdfclown.documents.contents.ContentScanner.TextWrapper..ctor(ContentScanner scanner)
at org.pdfclown.documents.contents.ContentScanner.GraphicsObjectWrapper.Get(ContentScanner scanner)
at org.pdfclown.documents.contents.ContentScanner.get_CurrentWrapper()
at org.pdfclown.tools.TextExtractor.Extract(ContentScanner level, IList`1 extractedTextStrings)
at org.pdfclown.tools.TextExtractor.Extract(ContentScanner level, IList`1 extractedTextStrings)
at org.pdfclown.tools.TextExtractor.Extract(IContentContext contentContext)

I'm sorry that I cannot provide a sample PDF.
My current work-around in SimpleFont.OnLoad() looks like this, but I really don't know how correct that solution is:

if \(glyphWidth > 0\) \{
    int code;
    if \(codes.TryGetValue\(charCode, out code\)\) \{
        int idx;
        if \(glyphIndexes.TryGetValue\(code, out idx\)\) \{
            glyphWidths\[idx\]=glyphWidth;
        \}
    \}
\}

Discussion

    • status: open --> pending
     
  • In order to properly solve your issue there's nothing but examine the actual cause of the missing glyph index; therefore the source document is, unfortunately, needed.

     

  • Anonymous
    2012-05-08

    I have come across a bug similar to a previous bug posting. However, it seemed like the original poster did not provide a sample pdf file.

    I have a sample pdf file and cli output to hopefully help you fix the bug. PDF and CLI file are linked. Let me know if the links don't work.

    Also, I am creating a content tweaking application using pdf clown. My application closely follows the \"object\" model of the BasicTextExtraction sample because I am parsing through the ContentObject level. From my experience, after the KeyNotFoundException is thrown an IndexOutOfRange exception is thrown. If you try some fixes, I would be happy to try them out on my PDFs.

    pdf file: http://dl.dropbox.com/u/370470/Pages%20from%20Iraqs_WMD_Vol1.pdf
    cli output: http://dl.dropbox.com/u/370470/pdfclownCLI%20output.txt

     
    • status: pending --> open
     
    • assigned_to: nobody --> stechio
     
  • This issue has been fixed since version 0.1.2.1 (see branch 0.1.2-Fix).

     
    • status: open --> closed-out-of-date
    • Group: --> 0.1.2.1
    • Priority: 5 --> 3