I'm trying to read text from PDF samples included in last release. I'm very glad to this with just a couple of lines of code but not everything is read correctly. It might be something wrong with my installation, I'm on windows and have installed last patch. So what is wrong. I get "?" instead of come characters. For instance, if i run text extraction samples on abbott.pdf i get following lines:
?. … S~ 2. Sf5+, ~ 3. S/B ╪
?. … fxe3 2. Qf6+, Ke4 3. Bg2╪
?. … Ke5 2. Sc6+, Kf6 3. Qxf4╪
44. ?. Sg7, e4 2. Rxe3, Kxe3 3. Sf5╪
And one more, I get lines like these instead of images
Also I'm getting some errors on other documents:
An exception happened while running the sample:
System.NotImplementedException: Embedded CFF font file.
oding() in D:\work\PDFClown\DotNET\app\clown\src\it\stefanoc
() in D:\work\PDFClown\DotNET\app\clown\src\it\stefanochizzo
the stack is much longer, don't whatn to clutter this post. That error makes me think that i'm missing something
so i whant to know, is the problem with images treated as a text and misunderstood character only on my side?
sorry if it was discussed previously.
despite those strange symbols, what you are reading as extracted from abbott.pdf is perfectly legal and conforming with the actual encoding built into that file. If you don't believe me, there's a simple way to check whether the output is right: open abbott.pdf with Acrobat Reader, select the text you want to verify, copy and paste it into a Unicode-compliant text editor - you should see the same symbols as above.
Why is there such gibberish?
Before explaining this magical mystery, I have to briefly describe the mechanism involved in text extraction: PDF files are primarily containers of graphics entities, even text characters are treated as such (they are purposely dubbed "glyphs"). When you try to extract text from a page, those glyph codes (which are encoded through an arbitrary internal map) are converted to Unicode-compatible codes through a dedicated glyph-to-Unicode internal map. As such map is optional, it may have been missed or improperly built by the file producer: in either case, you may see unintelligible symbols!
Let's start considering your first chunk:
?. ... S~ 2. Sf5+, ~ 3. S/B ╪
?. ... fxe3 2. Qf6+, Ke4 3. Bg2╪
?. ... Ke5 2. Sc6+, Kf6 3. Qxf4╪
44. ?. Sg7, e4 2. Rxe3, Kxe3 3. Sf5╪
The symbol '?' is a placeholder for an unmapped/mismapped character (in this case '1'): evidently, the generator of this file created a "damaged" Unicode mapping.
These lines are actually character codes corresponding to the glyphs of a symbolic font used to graphically represent a chessboard along with its pieces! Again, if you select some of the chessboard squared "images" within Acrobat Reader and copy-paste them into a text editor, you'll see the same gibberish. It's a visual trick adopted by the generator of the file. So, the second magical mystery is unveiled… :-)
Third problem: "Embedded CFF font file" NotImplementedException. There's an item in the ISSUES list (see ISSUES.html in the root folder of the 0.0.8 distribution) which states about a limitation of 0.0.8 release: "Text extraction: embedded CFF font file format hasn't been supported yet.". This means that Type1/CFF fonts embedded in a PDF file cannot be parsed by 0.0.8 as such format is currently not managed (I had not enough time to include its implementation). It will be supported by 0.1.1 (i.e., after the next release, 0.1.0 ).
Thank you Stefano, that was comprehensive.
My brain just resists to idea that some pieces of information are lost. I know PDF is a lossy format, but I didn't expect it can lose some text.
Have a lot more questions but I think I should study PDF spec first.
Thanks a lot.