text extraction

Help
toshy kava
2010-12-28
2013-01-26
  • toshy kava
    toshy kava
    2010-12-28

    Hi Stefano,

    I'm trying to read text from PDF samples included in last release. I'm very glad to this with just a couple of lines of code but not everything is read correctly. It might be something wrong with my installation, I'm on windows and have installed last patch.  So what is wrong. I get "?" instead of come characters. For instance, if i run text extraction samples on abbott.pdf i get following lines:
    ?. … S~ 2. Sf5+, ~ 3. S/B ╪
    ?. … fxe3 2. Qf6+, Ke4 3. Bg2╪
    ?. … Ke5 2. Sc6+, Kf6 3. Qxf4╪
    44. ?. Sg7, e4 2. Rxe3, Kxe3 3. Sf5╪

    And one more, I get  lines like these instead of images
    {WDNDWDWD}
    {DWDWDWDp}
    {WDW$pDpD}
    {GW0WibDW}
    {WDWDBDW!}
    {DWDPDWDP}
    {WDWDWDWg}
    {IWDnDRDW}

    Also I'm getting some errors on other documents:

    An exception happened while running the sample:
    System.NotImplementedException: Embedded CFF font file.
       at it.stefanochizzolini.clown.documents.contents.fonts.Type1Font.GetNativeEnc
    oding() in D:\work\PDFClown\DotNET\app\clown\src\it\stefanoc
    hizzolini\clown\documents\contents\fonts\Type1Font.cs:line 94
       at it.stefanochizzolini.clown.documents.contents.fonts.Type1Font.LoadEncoding
    () in D:\work\PDFClown\DotNET\app\clown\src\it\stefanochizzo
    lini\clown\documents\contents\fonts\Type1Font.cs:line 123

    the stack is much longer, don't whatn to clutter this post. That error makes me think that i'm missing something

    so i whant to know, is the problem with images treated as a text and misunderstood character only on my side?

    sorry if it was discussed previously.

    thanks.

     
  • Hi Toshy,

    despite those strange symbols, what you are reading as extracted from abbott.pdf is perfectly legal and conforming with the actual encoding built into that file. If you don't believe me, there's a simple way to check whether the output is right: open abbott.pdf with Acrobat Reader, select the text you want to verify, copy and paste it into a Unicode-compliant text editor - you should see the same symbols as above.

    Why is there such gibberish?
    Before explaining this magical mystery, I have to briefly describe the mechanism involved in text extraction: PDF files are primarily containers of graphics entities, even text characters are treated as such (they are purposely dubbed "glyphs"). When you try to extract text from a page, those glyph codes (which are encoded through an arbitrary internal map) are converted to Unicode-compatible codes through a dedicated glyph-to-Unicode internal map. As such map is optional, it may have been missed or improperly built by the file producer: in either case, you may see unintelligible symbols!

    Let's start considering your first chunk:

    ?. ... S~ 2. Sf5+, ~ 3. S/B ╪
    ?. ... fxe3 2. Qf6+, Ke4 3. Bg2╪
    ?. ... Ke5 2. Sc6+, Kf6 3. Qxf4╪ 
    44. ?. Sg7, e4 2. Rxe3, Kxe3 3. Sf5╪
    

    The symbol '?' is a placeholder for an unmapped/mismapped character (in this case '1'): evidently, the generator of this file created a "damaged" Unicode mapping.

    Second chunk:

    {WDNDWDWD}
    {DWDWDWDp}
    {WDW$pDpD}
    {GW0WibDW}
    {WDWDBDW!}
    {DWDPDWDP}
    {WDWDWDWg}
    {IWDnDRDW}
    

    These lines are actually character codes corresponding to the glyphs of a symbolic font used to graphically represent a chessboard along with its pieces! Again, if you select some of the chessboard squared "images" within Acrobat Reader and copy-paste them into a text editor, you'll see the same gibberish. It's a visual trick adopted by the generator of the file. So, the second magical mystery is unveiled… :-)

    Third problem: "Embedded CFF font file" NotImplementedException. There's an item in the ISSUES list (see ISSUES.html in the root folder of the 0.0.8 distribution) which states about a limitation of 0.0.8 release: "Text extraction: embedded CFF font file format hasn't been supported yet.". This means that Type1/CFF fonts embedded in a PDF file cannot be parsed by 0.0.8 as such format is currently not managed (I had not enough time to include its implementation). It will be supported by 0.1.1 (i.e., after the next release, 0.1.0 ).

    That's it!
    Stefano *<:o)

    http://pdfclown.wordpress.com/2010/09/23/waiting-for-pdf-clown-0-1-release/

     
  • toshy kava
    toshy kava
    2010-12-29

    Thank you Stefano,  that was comprehensive.

    My brain just resists to idea that some pieces of information are lost. I know PDF is a lossy format, but I didn't expect it can lose some text.

    Have a lot more questions but I think I should study PDF spec first.

    Thanks a lot.