#279 Remove 404 pages from the index.

v1.5
open
nobody
None
1
2014-02-10
2013-05-31
Naveen A.N
No

If an URL is added while crawling and the URL has been removed in the website while re-crawling it should be removed from the main index too.

Discussion

  • Laurent
    Laurent
    2013-11-18

    Some web sites use Temporary redirect or permanent redirect, for removed pages.
    Would it be possible to choose to remove these URLs too from the main index and URL database.
    To have the ability to tell the crawler which URL to remove depending on server response.
    thx
    laurent

     
  • Hi Laurent,

    It has been implemented in v1.5.2.
    In the URL Browser (Crawler/Web/URL Browser) you have now a new function:
    "Synchronize the selected URLs with the index".

    First make a selection of all valid URLs, then call this function to delete any document of the index that is not present in the selected URLs.

    You can automate this by using the scheduler, a new task has been created: "Web Crawler - URL database".

    Nighty build of 1.5.2 are available here:
    http://www.open-search-server.com/ftp