RE: Fw: [Grub-develop] Indexing plans for the data? Users want to know.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Ok, so if someone were to want to start using the results for a search engine (possibly myself), I think a couple things would have to happen.

1. The client would have to be a bit smarter.  Since the point of this is to not have an office building full of computers grinding away at processing these results, the client should do as much processing as possible.  For example, process and index a page and only send back things like keyword counts, page position, relevancy, weight, etc. as predefined by some set of rules.  Now saying that, this is where people can hack it and send back whatever they want.  Time to close the source?  Or possibly and hopefully, make it plug-in capable.  The plug-ins would be used to do different things with the results and the plug-ins could be closed leaving the client open.  This would also allow for extensibility of the client to do other things.  Another way to prevent the hack would be to have multiple clients return results from teh same urls and compare.

2.  The search engine server can take the client results and store them and use them however it sees fit.  Obviously with the goal of having better results than google (impossible?).  The search engine server should be able to grab results via web service like the Google API.  Then any engine can grab the results and process them and come up with the best scheme to see what's relevant.

If nothing else, this would be a very interesting project to actually make grub commercially viable and possibly get a lot of new attention seeing as how it might actually be useful.

Travis Reeder 

----- Original Message -----
From: "Kord Campbell" <ko...@gr...>
To: <gru...@li...>
Cc: <gru...@li...>
Sent: Monday, December 30, 2002 4:06 PM
Subject: [Grub-develop] Indexing plans for the data? Users want to know.

> Hi,
>
> I copied the general list on this email as I thought everyone
> might get something out of the explanation that I give in
> response to Travis' concerns.
>
> 1.  Is there any indexing happening right now?
>
> First, and as many of you may know, we do NOT index the results
> from the crawls that are done by the clients.  However, we do
> keep the status info of the URLs and the returned data for the
> last 24 hour crawl cycle.
>
> 1a.  What is being done with the client results?
>
> The URL meta data (update rate, update time, down rate, etc.)
> is available through a XML interface with our SQL server, and
> the crawl data is available via an ftp site.  We have, on
> occasion, had people request access to this data.  If anyone
> wishes access to these resources, we will try to oblige.  Of
> course people wishing to pull a full feed from us or do 1,000s
> of queries to the database (small server here folks) will need
> to discuss other options with us.
>
> Please also keep in mind that we are still in TESTING, and that
> the results returned right now are NOT 100% reliable.  This means
> if someone were using our data, we couldn't guarantee that the
> data was good, and that the crawl rate would be stable.
>
> Time will fix this, of course.  ;)
>
> 2.  What database platform are you using?
>
> MySQL.  It's quite fast - seriously.
>
> 3.  What rules you are setting for ranking keywords, ranking pages, etc?
>
> Again, we are a CRAWLING engine, not a search engine.  When the
> time comes, we expect other search engines to pull data from
> the service.  This means they don't have to crawl their own set
> of URLs, which decreases crawl bandwidth on the net, and increases
> the crawl rate of the sites - which also increases the quality and
> relevance of a search done on those sites.
>
> If anyone has any questions or comments about any of this, please
> feel free to post to the list!
>
> Happy holidays!
>
> Kord
>
> >
> > Message: 1
> > Date: Sun, 29 Dec 2002 16:01:58 -0700 (MST)
> > From: tr...@sp...
> > To: gru...@li...
> > Subject: [Grub-develop] Search page
> >
> > What's the plans for this area?  Is anybody working on indexing and
getting the actual search page going?  I'm finding it kind of useless to be
running the client for no purpose.  Like what's the point of running it
right now if nobody can reap the benefits?
> >
> > So here's some questions:
> > 1.  Is there any indexing happening right now?  What is being done with
the client results?
> > 2.  What database platform are you using?
> > 3.  What rules you are setting for ranking keywords, ranking pages, etc?
> >
> > Travis Reeder
> > Space Program
> > http://www.spaceprogram.com
> >
> >
> >
> > --__--__--
> >
> > _______________________________________________
> > Grub-develop mailing list
> > Gru...@li...
> > https://lists.sourceforge.net/lists/listinfo/grub-develop
> >
> >
> > End of Grub-develop Digest
> >
>
> --
> --------------------------------------------------------------
> Kord Campbell                                       Grub, Inc.
> President                      5500 North Western Avenue #101C
>                                        Oklahoma City, OK 73118
> ko...@gr...                            Voice: (405) 848-7000
> http://www.grub.org                        Fax: (405) 848-5477
> --------------------------------------------------------------
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Grub-develop mailing list
> Gru...@li...
> https://lists.sourceforge.net/lists/listinfo/grub-develop
>