On trying to crawl the Pattern http://www.cytoscape.org/*
(* for all content of the URL)
OSS1.4.1beta3 failed right after the homepage(1 side processed) and declares the crawl as finished.
I attach the oss.log
The log shows a
13:58:58,073 root - com.jaeksoft.searchlib.SearchLibException: File not found
com.jaeksoft.searchlib.web.ServletException: com.jaeksoft.searchlib.SearchLibException: File not found
but that is from timestamp and content associated to the screenshot function I have used some minutes before(for this ticket it was deactivated again and should not affect this issue here).
Correlated to the time is more:
14:03:45,076 root - RELOAD - Hourly - Fri Feb 01 13:00:00 CET 2013 - Count:3 - Average:8.666667 - Min:2 - Max:12
14:05:11,598 root - The element type \"link\" must be terminated by the matching end-tag \"</link>\".
14:05:12,186 root - Expected scheme-specific part at index 7: mailto:
I give some other tries.
Sometimes the crawl is failing. Now I have one that do its job(until 44 pages now ->limit at 100 pages / host).
I have observed this behaviour also with OSS1.3.1
The last crawl has failed at 42(not 44 as mentioned above).
A look at the logs shows again:
4:25:33,260 root - The element type \"link\" must be terminated by the matching end-tag \"</link>\".
14:25:43,025 root - The element type \"link\" must be terminated by the matching end-tag \"</link>\".
14:25:43,135 root - Illegal character in path at index 25: http://www.cytoscape.org/<?= $latest_download_link ?>
14:25:43,135 root - Illegal character in path at index 25: http://www.cytoscape.org/<?= $latest_release_notes_link ?>
The full .log will be attached.
So, it seems to be the missing end tag that makes OSS aborting.
A third check gives also:
14:36:44,329 root - The element type \"link\" must be terminated by the matching end-tag \"</link>\".
Do you mean the crawler terminates with "Aborted" status ?
I made a test crawl for the same url and 8 documents where indexed.
All the other URL are images.
The URl browser has 84 URL.
No, the status "Aborted" is never shown.
The crawl ends on a different number of documents(utilizing different crawlers: Indices -> Create a new index -> so on...).
You can also retest this with one crawler:
run once and inrement manually. You see that the number of documents will increase.
Do you mean that you started the crawler with option "RunOnce" and once it is stopped you started the crawler again with "RunOnce" option and the documents are increasing?
You can show this also with different crawlers.-> Different numbers of documents on the url given above.
Here is the explanation for why there is increasing of documents while each crawl session with option "RunOnce".
The pattern is http://www.cytoscape.org/* which means the crawler will crawl all the URLs from the website.
At first crawl session the crawler crawls the homepage url http://www.cytoscape.org/ and stops crawling because of the "RunOnce" option but the links extracted from the home page http://www.cytoscape.org will be added in the URL Browser you can check the URLs in Crawler/Web/URL browser.
When the crawler is initiated for next crawl session the crawler crawls the URL's extracted by the first crawl session.
That's the reason why you find the documents are increasing for each crawl session.
Aha, the crawler works incrementally.
That comes not clear to me. I have thought that it is working only ones(get all what is there) and that it is.
I have retested with two different crawlers.
They both are working as you describe.
My statement to different end points on every crawl was wrong.
Sorry, I have had mixed it up in my mind.
Thank you for clarification Naveen
This ticket here could be closed.(->May be you want to gt something from this for the documentation.)
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.