Ensuring all pages are indexed

Chuck
2014-03-27
2014-04-05
  • Chuck

    Chuck - 2014-03-27

    Hello!

    We've been running OpenSearchServer for some months now and there's one thing I just don't have a handle on yet.

    I found out early on that when we first run the crawler on a site, it seems to index the start page and then list, but not index, pages that are linked from that page. On a second crawl, it indexes the pages that were linked from the first, but not any pages linked from the newly indexed pages. As such, we have to re-run the crawl a number of times until all pages are indexed.

    That's fine when doing a manual crawl, but we need our sites to be indexed periodically to include new pages. To do that we use the Scheduler to do a Web Crawler-Start at specific times when the servers aren't all that busy. The problem is that if someone adds a new page to the site, it picks up the new link but doesn't index it till the next time that's run.

    So, the question is whether I'm missing something here or if there's a way to ensure that all new pages are indexed at the time set by the scheduler. I looked through the discussions here and see some people mentioning this, but haven't found a strategy to deal with this yet.

    Thanks.

     
  • Naveen A.N

    Naveen A.N - 2014-03-28

    Hello Chuck,

    Are you using the "RunOnce" option from the crawler?

    There are two options to run the crawler

    1) RunOnce - Which runs the crawler only once and stops. Which crawls the URL that has been added in the pervious crawl session.

    2) RunForever - Which crawls forever. It crawls continuously till it crawls all the URL available to crawl from the pattern added.

    --Naveen.A.N

     
    • Chuck

      Chuck - 2014-03-28

      Naveen:

      Thanks for the response. I'm curious though. In testing the "RunForever" option on a small Web site (<100 pages), it does just what it says, runs forever. If I select Web Crawler Start and Run Once = false in the scheduler, won't it loop forever re-checking the site?

      That brings up the question as to whether that's a bad thing anyway. I assume it will only check pages that haven't been updated for the amount of time specified. Can we have it doing this for all our Web sites without causing system resource issues?

      Which brings up another question. Are the settings under the Crawler -> Web -> Crawl Process tab the ones that would be used when the crawler is started by the scheduler? I would assume so.

      lastly, is there a way to have OpenSearchServer simply run though our sites sequentially? Currently we have them on separate scheduler jobs.

      The situation here is that we already have some 20 Web sites being indexed and would like to do quite a few more. Some of our sites are pretty large with upwards of 1000 pages and a few considerably larger. Also, these sites often have pages updated every day. We'd like to have a lag of no more than 1 day for the indexing of those new pages. Just looking for strategies to make that happen without overloading the search server. Already I'm thinking it needs to be on it's own hardware.

      Thoughts?

      Thanks,

      Chuck

       
  • Alexandre Toyer

    Alexandre Toyer - 2014-04-03

    Hello Chuck,

    Choosing "Run once = false" in the "Web crawler start" task of a scheduler job is exactly like choosing "Run forever" when starting the web crawler from the "Crawl process" tab.

    Crawl is periodic, pages will be re-crawled after the period chosen in field "Fetch interval between re-fetches:".

    Choosing "Run forever" does not mean that crawler will crawl your websites each second, it only means that when the configured period of time is reached it will re-crawl and re-index previously indexed pages. You can do the test on a small website to understand: once every pages of the website will be crawled you will see lots of crawling sessions with only "0" for each column, in the "Crawl process" tab. It means that crawler is active but it has no URL to crawl yet. Once the configured period of time for re-fetch will be reached it will start crawling again the URLs.

    Settins under the "Crawl process" tab are indeed used when starting the crawler from a scheduler job.

    For your crawling strategies you could for example choose to divide your websites in several index, depending on the "re-fetch" time you want for each. You could create one index for "re-fetch" of one week, another index for re-fetch of one day, and so on.
    It will be easy then to merge those different index into yet another one with the "Merge index" task of the scheduler. Don't forget to add some "Stop crawl" task before mergin anything, and add some "Start crawl" task after merge.

    Regards,
    Alexandre

     
    • Chuck

      Chuck - 2014-04-04

      Alexandre:

      Thank you for the information on the "Run forever" option and the settings under the "Crawl process" tab. That all makes sense. I assume then that using "Run forever" will have minimal impact on processor and memory usage, which appears to me to be the case.

      Since we only search one website at a time (a search feature for users to find information in only that website), we started by creating a separate index for each Web site. Is there another practical way to approach this where we could index multiple websites together but only search one at a time? At this point I'm not sure how to do that.

      Since we need updates to the sites to be reflected in the search results reasonably quick (24 hours). I believe we'll have to run with a re-fetch of 1 day, so merging different re-fetch times, while interesting, may not help us.

      It became apparent that our OSS was running out of memory, so we did increase the resources available to OSS so it now has 8 processors and 2 GB memory available. That may have solved some of our problems. I'm now considering just letting all of the indexes run continuously and see what happens.

       
  • Alexandre Toyer

    Alexandre Toyer - 2014-04-04

    Hello Chuck,

    You could decide to work with one index only and index all of your websites in it. You will just need to filter each query, to add a filter on the "host" field. Have a look at the "filter" tab of a query template, you will need for instance to use "host:www.example.com".
    You could decide to create as many queries as you have websites and statically add a filter in each, or you could work with one query which you could filter dynamically (depending on what you are using as a "front-end" for your users...).

    2 GB memory may still be a bit low, depending on your volumes of data... But 8 processors is pretty good :)

    Regards,
    Alexandre

     
    • Chuck

      Chuck - 2014-04-04

      Alexandre:

      I think I'm getting close to the solution here.

      We have a PHP class we wrote to use the API for searches. I did have 'filters' in my default request, which had been set to false. Looking around a bit I was able to sort out the following, which for our requests gets JASON encoded before submitting it.

      ['filters'] = array(
      array(
      "type" => "QueryFilter",
      "negative" => false,
      "query" => "host:".$this->website
      )
      )

      This seems to work right. I added a second pattern for another Web site and had it re-index. I can now select the desired website using the above. I assume this is correct.

      Is there any one place where I can find documentation on all of what I can put into these requests?

      I'm going to do a bit more testing and will probably create a new index into which I can add a number of our websites then see how all that works.

      Thanks,

      Chuck

       

Log in to post a comment.