Associate HTML text description to PDF parser

Help
baba singh
2010-07-23
2012-09-13
  • baba singh

    baba singh - 2010-07-23

    Hi,

    I am new to OSS. I am trying to use and update the server. this is what I
    would like to do.

    In a website to be crawled are some PDF files with description. I would like
    to be able to search on the description and get corresponding PDF document. I
    can search on the content, title of the PDF. I can also search on the
    description in the web paget (HTMLParser) but I need to somehow link them.

    How is it possible?

    regards,

    bbs

     
  • Emmanuel Keller

    Emmanuel Keller - 2010-07-28

    Hi

    The default memory parameter (256MB) is usually too small for efficiently
    handle large documents. If you are using the 1.2 version, you can see the
    memory usage in the /system/runtime panel.

    I suggest to add this line in the start.sh scripts, and restart OSS:

     export CATALINA_OPTS="-d64 -Xms1024m -Xmx1024m -server"
    

    You can replace 1024 by a larger value if needed.

    Regards,

    EK.

     
  • baba singh

    baba singh - 2010-07-28

    Thank you.. but I think the problem is different (i may be wrong).

    I think it is downloading the whole document even if I reduce the sizelimit.
    could that be the problem? if so then how would tell OSS to limit the
    downloading process to the specified sizelimit.

    regards,

    bbs

     
  • baba singh

    baba singh - 2010-07-28

    ignore my last reply... was meant to another post :)

    may be you did not understand my problem... I have to crawl html pages with
    text and links to pdf files. the text in the html file describes the pdf file.
    currently when I run OSS it starts HTMLParser & PDFParser. I can search on
    indexes of HTML parser (html body,title etc) and PDFParser (content of the PDF
    doc) separately. I cannot do a combine search. I would like to be able to
    associate html text description with for example pdf title.

    hope you understand,

    bbs

     
  • Emmanuel Keller

    Emmanuel Keller - 2010-07-28

    Yes, I understand your need. By now, the webcrawler cannot handle that.

    If it is your own site, and if OSS can connect to the database, you could use
    the database crawler. The database crawler is able to get data from a table
    (or a view) and mix file parsing.

     

Log in to post a comment.