I am new to OSS. I am trying to use and update the server. this is what I
would like to do.
In a website to be crawled are some PDF files with description. I would like
to be able to search on the description and get corresponding PDF document. I
can search on the content, title of the PDF. I can also search on the
description in the web paget (HTMLParser) but I need to somehow link them.
How is it possible?
The default memory parameter (256MB) is usually too small for efficiently
handle large documents. If you are using the 1.2 version, you can see the
memory usage in the /system/runtime panel.
I suggest to add this line in the start.sh scripts, and restart OSS:
export CATALINA_OPTS="-d64 -Xms1024m -Xmx1024m -server"
You can replace 1024 by a larger value if needed.
Thank you.. but I think the problem is different (i may be wrong).
I think it is downloading the whole document even if I reduce the sizelimit.
could that be the problem? if so then how would tell OSS to limit the
downloading process to the specified sizelimit.
ignore my last reply... was meant to another post :)
may be you did not understand my problem... I have to crawl html pages with
text and links to pdf files. the text in the html file describes the pdf file.
currently when I run OSS it starts HTMLParser & PDFParser. I can search on
indexes of HTML parser (html body,title etc) and PDFParser (content of the PDF
doc) separately. I cannot do a combine search. I would like to be able to
associate html text description with for example pdf title.
hope you understand,
Yes, I understand your need. By now, the webcrawler cannot handle that.
If it is your own site, and if OSS can connect to the database, you could use
the database crawler. The database crawler is able to get data from a table
(or a view) and mix file parsing.
Log in to post a comment.