#64 Ignore formating(?) characters in PDFs

Simon Fe

Some PDFs fail to be searched correctly as they appear they may contain formatting chacters internal to words. DocFetcher seems to interpret these as spaces rather than just ignoring them completely (as does, say, Google)

For example, I've attached an page from a scientific paper that I found on the web using Google (from http://cg.informatik.uni-freiburg.de/intern/seminar/87reevesShadowsWithDepthMaps.pdf) . If you try searching for, say, 'shadow map', DocFetcher returns nothing, but will give you a result if you try "s h a d o w" "m a p". This is somewhat tedious.

Any chance someone could get it to ignore (what I assume are) formatting characters?


  • Simon Fe

    Simon Fe - 2012-04-30

    section of a PDF that fails to search correctly with DocFetcher

  • Nam-Quang Tran

    Nam-Quang Tran - 2012-04-30


    There's really nothing I can do about this. DocFetcher's PDF support is based on a library called PDFBox (http://pdfbox.apache.org/), and whatever this library can or can't do is the same for DocFetcher.

    Best regards
    q:-) <= Quang

  • Simon Fe

    Simon Fe - 2012-05-01

    Hi Quang,
    Thanks for the info. I realised I was using an old version (1.0.3) and just downloaded 1.1.beta6, which helped somewhat, but it still gets quite a lot of strange spacing.

    I then went to the PDFBox site, downloaded 1.6.0 and tried the "ExtractText" function.
    Although it wasn't absolutely perfect - there were a couple of odd spaces here and there - it was still much better than what is extracted from within DocFetcher (1.1.b6). Is DocFetcher currently using an older version of PDFBox?


    PS: In version 1.0.3 I managed to find a way to make the search box a reasonable size by setting "SearchBoxMaxWidth=800" in the user.properties file. Is there an equivalent faciltity in 1.1 because the default size is just impractically small. :-(

  • Nam-Quang Tran

    Nam-Quang Tran - 2012-05-04

    DocFetcher 1.1 beta 6 uses PDFBox 1.6.0. You can see this by the version number of the pdfbox*.jar file in DocFetcher's lib folder. I'm not sure why DocFetcher's text extraction results are worse than what is returned by the ExtractText method. I'll take a closer look at that when I have time (which I don't, at the moment).

    In DocFetcher 1.1 beta 6, the user.properties file is gone and was replaced by a program.conf file, which is inside the conf folder.

  • Simon Fe

    Simon Fe - 2012-05-08

    Thanks. FWIW I've also used "DocSearcher". That too, aparently, uses PDFBox but doesn't seem to have the problem. (I was using it originally, but the incremental/rebuild index seems to be broken, at least for networked files)


Log in to post a comment.