Re: [Sprawler-general] Re: Basic first steps

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dru Dru wrote:
>>Just to get everyone started, I'll start proposing
>>some code we need to 
>> work on.  I do have some code I have written
>>previously, that we could 
>> use as a base, but I'm afraid of stunting
>>creativity,
>>so I'd like to 
>> brainstorm a little here  first, then gather our
>>thoughts and start 
>> coding.
>> 
>> Here's some thoughts:
>> 
>> There are two main divisions of code: indexing, and
>>searching.
>> 
>> The indexer, I believe, should have the following
>>qualities:
>> 
>> - configurable by a simple INI style conf file
>> - resilient to reboots (in other words, needs to be
>>able to continue 
>> where it left off)
>> - distributable (so we can have 10, 20, 30, etc,
>>indexers running 
>> simultaneously on different machines, all indexing
>>different URL's)
>> - somewhat portable, possibly a linux,freebsd, and
>>windows versions
>> - optimized for speed (which means it adjusts for a
>>slow system on a 
>> fast link, or fast system on a slow link)
>> 
>> I'm envisioning a "master" indexer, which delegates
>>certain batches of 
>> URL's to be indexed to each indexing "client".  The
>>client requests a 
>> batch of URLs, indexes them, then sends the indexed
>>data back to the 
>> master, which then incorporates that data into the
>>full index.  This way 
>> we can spread out many indexers on high speed
>>internet connections, and 
>> only send the master the already indexed data for
>>inclusion in the main 
>> system.  If we can create a windows client (like
>>the
>>SETI project did), 
>> and a "buzz" for the coolness of helping the only
>>open source, free, 
>> non-profit search engine on the planet, we could
>>potentially get 
>> hundreds or thousands of machines indexing the
>>internet for us, free.
> 
> 
> I like this idea, but we need to make sure we write a
> really tight secure client. I remember there was a
> security vulnerability with SETI@Home and I had to
> block it at our corp. firewall. 

Good point - we're in luck though, we have someone here who is in to 
internet security, and I'm sure wouldn't mind making this project nice 
and secure like that.. :)

>> We'll have to decide how to store data, or index
>>it,
>>the most efficient 
>> way.  I have thought about this, and also tried
>>many
>>many options 
>> already, some failing, and some even working fairly
>>well.
>> 
>> I believe that plain old unix filesystems are fast,
>>very fast. 
> 
> 
> I agree. Hopefully know will want to run our db's on a
> windows box ;-)

oh.. that's just not funny.. ;P

>> Remember, databases use the filesystem to store
>>their
>>data, so the db is 
>> only as fast as its filesystem.  It's all about HOW
>>you organize that 
>> data.  If we know how we search for that data, then
>>we can custom make a 
>> db structure, file structure or layout, etc, that
>>would allow us to find 
>> data at amazingly fast speeds.
>> 
>> More thoughts to come over the next few days..
>> 
> 
> 
> I needed a little primer on search engines, and I
> found a decently written one here:
> http://computer.howstuffworks.com/search-engine.htm as
> well as another good resource:
> http://www.searchenginewatch.com/

Great pointers!  We should put these up on our website.. which doesn't 
exist yet.  Currently, rdickey (ross) is working on this, but he jsut 
went back into college, so he'll be kind of busy.. If you are 
interested, please take the bull by the horns and run with it..

Eric

-- 
------------------------------------------------------------------
Eric Anderson	   Systems Administrator      Centaur Technology
All generalizations are false, including this one.
------------------------------------------------------------------