If an URL is added while crawling and the URL has been removed in the website while re-crawling it should be removed from the main index too.
Some web sites use Temporary redirect or permanent redirect, for removed pages.
Would it be possible to choose to remove these URLs too from the main index and URL database.
To have the ability to tell the crawler which URL to remove depending on server response.
It has been implemented in v1.5.2.
In the URL Browser (Crawler/Web/URL Browser) you have now a new function:
"Synchronize the selected URLs with the index".
First make a selection of all valid URLs, then call this function to delete any document of the index that is not present in the selected URLs.
You can automate this by using the scheduler, a new task has been created: "Web Crawler - URL database".
Nighty build of 1.5.2 are available here:
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.