Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo


Taking long time parsing PDF

baba singh
  • baba singh
    baba singh


    I noticed that if the PDF is very large then the OSS takes long time to parse
    the next content. I reduced the size of the PDF input stream to read but seems
    like OSS somewhere tries to download the whole PDF before moving to the next
    link. The parser already finishes parsing but it does not start crawling the
    next link.

    Is there a solution? maybe I am missing something.



  • I suppose you already try to change the value "sizeLimit" in the "parser.xml"
    file (and restart OSS):

    <parser name="PDF parser" class="com.jaeksoft.searchlib.parser.PdfParser" sizeLimit="8388608">
  • baba singh
    baba singh

    Yes I did... but still I presume it downloads the whole document. I set the
    limit to mere 100000. It complets the parsing but I think it does not crawl
    the next link until it finishes the complete download.

    will appreciate your hint what I can do.