[Grub-develop] To Rodrigo: Storing and scheduling URLs

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

To Rodrigo:

We need to add a table(s) to our database that will store the URLs and some
statistics along with it.  We need a mechanism which will use the statistics
for dispatching/scheduling URLs to the Clients to crawl.  We have already
talked on this issue.

When Clients connect to the Server, they make a request.  The response is a
list of URLs for crawling.  This module should offer an interface that will
return a list, and update this action appropriately to the DB.

I would like you to work on this module.

You should devise an algorithm that will figure out how to schedule the
URLs.  Use some of the ideas we have already presented via our emails.  I
pasted an excerpt form an older email at the bottom of the msg.

You must take into account several things in you model, though:
1) At the beginning, we will not have many Clients at our disposal, and our
database will be overwhelmed with new URLs to be crawled.  You must design
this algorithm so that even when we have millions of URLs that were never
crawled, our Clients will get back to those old pages, so that our database
will stay up-to-date as much as possible.  Remember, our goal is to have the
most up-to-date search engine on the net.
2) In future, we will provide means to measure each Client's crawling
performance (pages crawled/day), so that we can assign appropriate number of
URLs to each one of them.  Don't worry about this one for now.
3) Also, we must think security.  We may need to introduce a certain amount
of redundancy in order to check whether we get good data from our Clients.
For example, we may have 10% redundancy in crawling.  If data does not match
from two Clients, a third Client may be assigned to crawl the page in
question, and to figure out which Client "cheated".  Of course, the page may
have changed in the short amount of time the two Clients crawled, and we may
wrongfully conclude that a Client is rogue.  Anyway, I say, don't worry
about security for now.  Let's leave this for a later stage.
4) The URL scheduling algorithm must be highly configurable and modular
enough so that we may add new capabilities to it easily.
5) Many other things I haven't accounted for.  Like for example, taking into
account the proximity of Clients to sites in dispatching the URLs...

From an old message:

About dispatching/scheduling URLs to Clients:

[ozra] Dispatching (term I borrowed from Robert) is a mechanism for
scheduling URLs
to Clients for crawling.  Here is my suggestion on how to schedule the URLs.
Every page that is crawled for the first time by our system is automatically
scheduled to be crawled again in (say) two weeks.  If in two weeks a Client
crawls the page and finds that the page has changed, the next crawling time
will be set for one week, or half the previous time;  if next week the page
changed again, the time will be halfed again to 3 days, and so on.  If on
the other hand, a page didn't change, we might perhaps double the next
scheduled time from two weeks to a month, etc.

        [Rodrigo] Hmmm sounds good to me...just change doubling and halving
to multiplying and dividing by 1.5, I guess that's a more proper
value...also, we have to consider the situation where a client starts
crawling a HUGE site(Geocities for instance)...of course no one client will
crawl all of it, so we have to make it schedule the parts it doesn't...and
develop a good schema so that no two clients will be crawling the same
thing, and no pages will be left uncrawled...

[ozra] Let's not forget that for each URLs that will be crawled, Client
needs to get "permission" for the Server.  No exception.

---end msg---

Give me your thoughts on this.

Cheers,

ozra.

--------------------------------------------------------------
Igor Stojanovski                                 Grub.Org Inc.
Chief Technical Officer                 5100 N. Brookline #830
                                       Oklahoma City, OK 73112
oz...@gr...                            Voice: (405) 917-9894
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------