|
From: Dru D. <dr...@ya...> - 2003-08-22 03:29:35
|
> Just to get everyone started, I'll start proposing > some code we need to > work on. I do have some code I have written > previously, that we could > use as a base, but I'm afraid of stunting > creativity, > so I'd like to > brainstorm a little here first, then gather our > thoughts and start > coding. > > Here's some thoughts: > > There are two main divisions of code: indexing, and > searching. > > The indexer, I believe, should have the following > qualities: > > - configurable by a simple INI style conf file > - resilient to reboots (in other words, needs to be > able to continue > where it left off) > - distributable (so we can have 10, 20, 30, etc, > indexers running > simultaneously on different machines, all indexing > different URL's) > - somewhat portable, possibly a linux,freebsd, and > windows versions > - optimized for speed (which means it adjusts for a > slow system on a > fast link, or fast system on a slow link) > > I'm envisioning a "master" indexer, which delegates > certain batches of > URL's to be indexed to each indexing "client". The > client requests a > batch of URLs, indexes them, then sends the indexed > data back to the > master, which then incorporates that data into the > full index. This way > we can spread out many indexers on high speed > internet connections, and > only send the master the already indexed data for > inclusion in the main > system. If we can create a windows client (like > the > SETI project did), > and a "buzz" for the coolness of helping the only > open source, free, > non-profit search engine on the planet, we could > potentially get > hundreds or thousands of machines indexing the > internet for us, free. I like this idea, but we need to make sure we write a really tight secure client. I remember there was a security vulnerability with SETI@Home and I had to block it at our corp. firewall. > > We'll have to decide how to store data, or index > it, > the most efficient > way. I have thought about this, and also tried > many > many options > already, some failing, and some even working fairly > well. > > I believe that plain old unix filesystems are fast, > very fast. I agree. Hopefully know will want to run our db's on a windows box ;-) > Remember, databases use the filesystem to store > their > data, so the db is > only as fast as its filesystem. It's all about HOW > you organize that > data. If we know how we search for that data, then > we can custom make a > db structure, file structure or layout, etc, that > would allow us to find > data at amazingly fast speeds. > > More thoughts to come over the next few days.. > I needed a little primer on search engines, and I found a decently written one here: http://computer.howstuffworks.com/search-engine.htm as well as another good resource: http://www.searchenginewatch.com/ -Dru ===== http://www.drusshop.com __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com |