[Sprawler-general] Re: Basic first steps

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> Just to get everyone started, I'll start proposing
> some code we need to 
>  work on.  I do have some code I have written
> previously, that we could 
>  use as a base, but I'm afraid of stunting
> creativity,
> so I'd like to 
>  brainstorm a little here  first, then gather our
> thoughts and start 
>  coding.
>  
>  Here's some thoughts:
>  
>  There are two main divisions of code: indexing, and
> searching.
>  
>  The indexer, I believe, should have the following
> qualities:
>  
>  - configurable by a simple INI style conf file
>  - resilient to reboots (in other words, needs to be
> able to continue 
>  where it left off)
>  - distributable (so we can have 10, 20, 30, etc,
> indexers running 
>  simultaneously on different machines, all indexing
> different URL's)
>  - somewhat portable, possibly a linux,freebsd, and
> windows versions
>  - optimized for speed (which means it adjusts for a
> slow system on a 
>  fast link, or fast system on a slow link)
>  
>  I'm envisioning a "master" indexer, which delegates
> certain batches of 
>  URL's to be indexed to each indexing "client".  The
> client requests a 
>  batch of URLs, indexes them, then sends the indexed
> data back to the 
>  master, which then incorporates that data into the
> full index.  This way 
>  we can spread out many indexers on high speed
> internet connections, and 
>  only send the master the already indexed data for
> inclusion in the main 
>  system.  If we can create a windows client (like
> the
> SETI project did), 
>  and a "buzz" for the coolness of helping the only
> open source, free, 
>  non-profit search engine on the planet, we could
> potentially get 
>  hundreds or thousands of machines indexing the
> internet for us, free.

I like this idea, but we need to make sure we write a
really tight secure client. I remember there was a
security vulnerability with SETI@Home and I had to
block it at our corp. firewall. 
>  
>  We'll have to decide how to store data, or index
> it,
> the most efficient 
>  way.  I have thought about this, and also tried
> many
> many options 
>  already, some failing, and some even working fairly
> well.
>  
>  I believe that plain old unix filesystems are fast,
> very fast. 

I agree. Hopefully know will want to run our db's on a
windows box ;-)

>  Remember, databases use the filesystem to store
> their
> data, so the db is 
>  only as fast as its filesystem.  It's all about HOW
> you organize that 
>  data.  If we know how we search for that data, then
> we can custom make a 
>  db structure, file structure or layout, etc, that
> would allow us to find 
>  data at amazingly fast speeds.
>  
>  More thoughts to come over the next few days..
>  

I needed a little primer on search engines, and I
found a decently written one here:
http://computer.howstuffworks.com/search-engine.htm as
well as another good resource:
http://www.searchenginewatch.com/

-Dru

=====
http://www.drusshop.com

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com