#279 Remove 404 pages from the index.

v1.5
open
nobody
None
1
2014-02-10
2013-05-31
Naveen A.N
No

If an URL is added while crawling and the URL has been removed in the website while re-crawling it should be removed from the main index too.

Discussion

  • Laurent

    Laurent - 2013-11-18

    Some web sites use Temporary redirect or permanent redirect, for removed pages.
    Would it be possible to choose to remove these URLs too from the main index and URL database.
    To have the ability to tell the crawler which URL to remove depending on server response.
    thx
    laurent

     
  • Emmanuel Keller

    Emmanuel Keller - 2014-02-10

    Hi Laurent,

    It has been implemented in v1.5.2.
    In the URL Browser (Crawler/Web/URL Browser) you have now a new function:
    "Synchronize the selected URLs with the index".

    First make a selection of all valid URLs, then call this function to delete any document of the index that is not present in the selected URLs.

    You can automate this by using the scheduler, a new task has been created: "Web Crawler - URL database".

    Nighty build of 1.5.2 are available here:
    http://www.open-search-server.com/ftp

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks