I use PDFClown to extract plain text from some PDF documents that I'm not allowed to share, unfortunately.
I hope the following information is enough to identify and fix the problem, though.
AssemblyVersion: 0.1.1
This is my high level code:
StringBuilder builder=new StringBuilder\(\); using \(Stream input=new Stream\(ioStream\)\) \{ using \(File inputFile=new File\(input\)\) \{ TextExtractor extractor=new TextExtractor\(\); foreach \(var page in inputFile.Document.Pages\) \{ builder.AppendLine\(TextExtractor.ToString\(extractor.Extract\(page\)\)\); \} \} \}
The following is the stack track of the exception:
System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
at org.pdfclown.documents.contents.fonts.SimpleFont.OnLoad()
at org.pdfclown.documents.contents.fonts.Font.Load()
at org.pdfclown.documents.contents.fonts.Font..ctor(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.fonts.SimpleFont..ctor(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.fonts.TrueTypeFont..ctor(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.fonts.Font.Wrap(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.FontResources.Wrap(PdfDirectObject baseObject)
at org.pdfclown.documents.contents.ResourceItems`1.get_Item(PdfName key)
at org.pdfclown.documents.contents.objects.SetFont.GetResource(IContentContext context)
at org.pdfclown.documents.contents.objects.SetFont.GetFont(IContentContext context)
at org.pdfclown.documents.contents.objects.SetFont.Scan(GraphicsState state)
at org.pdfclown.documents.contents.ContentScanner.MoveNext()
at org.pdfclown.documents.contents.ContentScanner.TextWrapper.Extract(ContentScanner level)
at org.pdfclown.documents.contents.ContentScanner.TextWrapper..ctor(ContentScanner scanner)
at org.pdfclown.documents.contents.ContentScanner.GraphicsObjectWrapper.Get(ContentScanner scanner)
at org.pdfclown.documents.contents.ContentScanner.get_CurrentWrapper()
at org.pdfclown.tools.TextExtractor.Extract(ContentScanner level, IList`1 extractedTextStrings)
at org.pdfclown.tools.TextExtractor.Extract(ContentScanner level, IList`1 extractedTextStrings)
at org.pdfclown.tools.TextExtractor.Extract(IContentContext contentContext)
I'm sorry that I cannot provide a sample PDF.
My current work-around in SimpleFont.OnLoad() looks like this, but I really don't know how correct that solution is:
if \(glyphWidth > 0\) \{ int code; if \(codes.TryGetValue\(charCode, out code\)\) \{ int idx; if \(glyphIndexes.TryGetValue\(code, out idx\)\) \{ glyphWidths\[idx\]=glyphWidth; \} \} \}
In order to properly solve your issue there's nothing but examine the actual cause of the missing glyph index; therefore the source document is, unfortunately, needed.
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
I have come across a bug similar to a previous bug posting. However, it seemed like the original poster did not provide a sample pdf file.
I have a sample pdf file and cli output to hopefully help you fix the bug. PDF and CLI file are linked. Let me know if the links don't work.
Also, I am creating a content tweaking application using pdf clown. My application closely follows the \"object\" model of the BasicTextExtraction sample because I am parsing through the ContentObject level. From my experience, after the KeyNotFoundException is thrown an IndexOutOfRange exception is thrown. If you try some fixes, I would be happy to try them out on my PDFs.
pdf file: http://dl.dropbox.com/u/370470/Pages%20from%20Iraqs_WMD_Vol1.pdf
cli output: http://dl.dropbox.com/u/370470/pdfclownCLI%20output.txt
This issue has been fixed since version 0.1.2.1 (see branch 0.1.2-Fix).