Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#64 Ignore formating(?) characters in PDFs

open
nobody
None
5
2012-09-17
2012-04-30
Simon Fe
No

Some PDFs fail to be searched correctly as they appear they may contain formatting chacters internal to words. DocFetcher seems to interpret these as spaces rather than just ignoring them completely (as does, say, Google)

For example, I've attached an page from a scientific paper that I found on the web using Google (from http://cg.informatik.uni-freiburg.de/intern/seminar/87reevesShadowsWithDepthMaps.pdf) . If you try searching for, say, 'shadow map', DocFetcher returns nothing, but will give you a result if you try "s h a d o w" "m a p". This is somewhat tedious.

Any chance someone could get it to ignore (what I assume are) formatting characters?
Thanks

Discussion

  • Simon Fe
    Simon Fe
    2012-04-30

    section of a PDF that fails to search correctly with DocFetcher

     
    Attachments
  • Nam-Quang Tran
    Nam-Quang Tran
    2012-04-30

    Hi,

    There's really nothing I can do about this. DocFetcher's PDF support is based on a library called PDFBox (http://pdfbox.apache.org/), and whatever this library can or can't do is the same for DocFetcher.

    Best regards
    q:-) <= Quang

     
  • Simon Fe
    Simon Fe
    2012-05-01

    Hi Quang,
    Thanks for the info. I realised I was using an old version (1.0.3) and just downloaded 1.1.beta6, which helped somewhat, but it still gets quite a lot of strange spacing.

    I then went to the PDFBox site, downloaded 1.6.0 and tried the "ExtractText" function.
    Although it wasn't absolutely perfect - there were a couple of odd spaces here and there - it was still much better than what is extracted from within DocFetcher (1.1.b6). Is DocFetcher currently using an older version of PDFBox?

    Thanks
    Simon

    PS: In version 1.0.3 I managed to find a way to make the search box a reasonable size by setting "SearchBoxMaxWidth=800" in the user.properties file. Is there an equivalent faciltity in 1.1 because the default size is just impractically small. :-(

     
  • Nam-Quang Tran
    Nam-Quang Tran
    2012-05-04

    DocFetcher 1.1 beta 6 uses PDFBox 1.6.0. You can see this by the version number of the pdfbox*.jar file in DocFetcher's lib folder. I'm not sure why DocFetcher's text extraction results are worse than what is returned by the ExtractText method. I'll take a closer look at that when I have time (which I don't, at the moment).

    In DocFetcher 1.1 beta 6, the user.properties file is gone and was replaced by a program.conf file, which is inside the conf folder.

     
  • Simon Fe
    Simon Fe
    2012-05-08

    Thanks. FWIW I've also used "DocSearcher". That too, aparently, uses PDFBox but doesn't seem to have the problem. (I was using it originally, but the incremental/rebuild index seems to be broken, at least for networked files)