sprawler-devel Mailing List for Sprawler

Status: Pre-Alpha

Brought to you by: destari, mojobnichols

sprawler-devel — For communications on the Sprawler project development.

You can subscribe to this list here.

Flat | Threaded

1 2 3 .. 9 > >> (Page 1 of 9)

Re: [Sprawler-devel] commit error.

From: <mni...@mo...> - 2004-08-18 23:32:31

>>>>> "Eric" == Eric Anderson <and...@ce...> writes:

> I don't see any error?  Can you try re-sending?

When I have something to commit I will.



>> FYI I'm getting the following error on commits to cvs.


Checking in lib/Sprawler.pm;
/cvsroot/sprawler/sprawler/lib/Sprawler.pm,v  <--  Sprawler.pm
new revision: 1.11; previous revision: 1.10
done


mojo

Re: [Sprawler-devel] commit error.

From: Eric A. <and...@ce...> - 2004-08-18 14:56:28

I don't see any error?

Can you try re-sending?


Mojo B. Nichols wrote:

>Eric,
>
>FYI
>I'm getting the following error on commits to cvs.
>
>Mojo
>
>--
>An American is a man with two arms and four wheels.
>		-- A Chinese child
>
>
>
>-------------------------------------------------------
>SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
>100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
>Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
>http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
>_______________________________________________
>Sprawler-devel mailing list
>Spr...@li...
>https://lists.sourceforge.net/lists/listinfo/sprawler-devel
>  
>


-- 
------------------------------------------------------------------
Eric Anderson     Sr. Systems Administrator    Centaur Technology
Talk sense to a fool and he calls you foolish.
------------------------------------------------------------------

[Sprawler-devel] commit error.

From: <mni...@mo...> - 2004-08-18 03:56:17

Eric,

FYI
I'm getting the following error on commits to cvs.

Mojo

--
An American is a man with two arms and four wheels.
		-- A Chinese child

Re: [Sprawler-devel] client check.

From: <mni...@mo...> - 2004-06-23 04:26:27

>>>>> "Eric" == Eric Anderson <and...@ce...> writes:

> Mojo B. Nichols wrote:
>>> Actually I think it may be my perl and just the sockets... can
>>> somebody else try this on linux?  I said client because that
>>> fails, but upon closer inspection it doesn't seem that simple.
>>> 

>>  Whew no not my sockets. Basically the problem was two fold: One my
>> client database didn't have my client in there.  If I add it
>> blindly to the database that takes it past that point.  Then my url
>> seed db was either empty or broken or something. removing that
>> index allowed it to start working (reseeded it etc). I'm going to
>> shuffle this off to the side and see if I can figure out where it
>> went wrong. Perhaps the seeded url db is in cvs? I'll check it out.

> Glad to hear you got it working!  That makes me feel better
> anyway.. :)

>> I'm curious about this client db and its intended use.
>> 

> Basically, to keep clients from being able to check in/upload data
> for *ANY* arbitrary url they desire.  That way, an evil indexer
> can't fake an index for it's own website, with all kinds of fake or
> misleading data in it, causing our index to be invalid.  They
> request URL's to index, then they must check in those URL's.  You
> can't check in URL's you have not checked out..

> Maybe it's time we have someone write up some documentation on all
> this?  How to use each piece, with example and syntax, etc..  What
> do you think?

Last I checked documentation was pretty there although this being a
new method may need to be added.  It sounds as if we need to add them
to the client db upon sending a set of urls. I have to think about
this a little more.  

I thought we could use client redundancy and checksums to insure index
integrity. As we receive a batch we put it in a queue as soon as its
redundant client (or clients) return with indexes and they checkout
the master accepts them. The theory here is that then a rogue client
would have to occupy a large percentage of client machines to every
get skewed results in.  As for the client check in alone, it seems
like it could still be manipulated. If they obtain a set of urls what
is to prevent them from under reporting those urls, or other such
mischievousness. Any way at the end of the day it probably doesn't
hurt to ensure that urls sent to a client come back from a
client... so I'm not really arguing against it.  In fact it would be a
necessary step in preventing a rogue client from just sending the
required amount of skewed indexes to try to fool the master in the
redundancy scheme.  We can dub the redundancy scheme RAIC Redundant
Array of Independent Clients:-) or some other such nonsense.

mojo

--
When the Apple IIc was introduced, the informative copy led off with a couple
of asterisked sentences:

	It weighs less than 8 pounds.*
	And costs less than $1,300.**

In tiny type were these "fuller explanations":

      * Don't asterisks make you suspicious as all get out?  Well, all
	this means is that the IIc alone weights 7.5 pounds. The power
	pack, monitor, an extra disk drive, a printer and several bricks
	will make the IIc weigh more. Our lawyers were concerned that you
	might not be able to figure this out for yourself.

     ** The FTC is concerned about price fixing. You can pay more if
	you really want to.  Or less.
		-- Forbes

Re: [Sprawler-devel] client check.

From: Eric A. <and...@ce...> - 2004-06-23 03:39:29

Mojo B. Nichols wrote:

>>Actually I think it may be my perl and just the sockets... can
>>somebody else try this on linux?  I said client because that fails,
>>but upon closer inspection it doesn't seem that simple.
>>    
>>
>
>Whew no not my sockets. Basically the problem was two fold: One my
>client database didn't have my client in there.  If I add it blindly
>to the database that takes it past that point.  Then my url seed db
>was either empty or broken or something. removing that index allowed
>it to start working (reseeded it etc). I'm going to shuffle this off
>to the side and see if I can figure out where it went wrong. Perhaps
>the seeded url db  is in cvs? I'll check it out.  
>  
>

Glad to hear you got it working!  That makes me feel better anyway.. :)

>I'm curious about this client db and its intended use.
>

Basically, to keep clients from being able to check in/upload data for 
*ANY* arbitrary url they desire.  That way, an evil indexer can't fake 
an index for it's own website, with all kinds of fake or misleading data 
in it, causing our index to be invalid.  They request URL's to index, 
then they must check in those URL's.  You can't check in URL's you have 
not checked out..

Maybe it's time we have someone write up some documentation on all 
this?  How to use each piece, with example and syntax, etc..  What do 
you think?

Eric

-- 
------------------------------------------------------------------
Eric Anderson     Sr. Systems Administrator    Centaur Technology
Talk sense to a fool and he calls you foolish.
------------------------------------------------------------------

Re: [Sprawler-devel] client check.

From: <mni...@mo...> - 2004-06-23 02:34:03

>>>>> "Mojo" == Mojo B Nichols <mni...@mo...> writes:

>>>>> "Eric" == Eric Anderson <and...@ce...> writes:
>> Mojo B. Nichols wrote:
>>> the client check seems somewhat broken.
>>> 

>> Uhh, how about some additional details? Eric

> Actually I think it may be my perl and just the sockets... can
> somebody else try this on linux?  I said client because that fails,
> but upon closer inspection it doesn't seem that simple.

Whew no not my sockets. Basically the problem was two fold: One my
client database didn't have my client in there.  If I add it blindly
to the database that takes it past that point.  Then my url seed db
was either empty or broken or something. removing that index allowed
it to start working (reseeded it etc). I'm going to shuffle this off
to the side and see if I can figure out where it went wrong. Perhaps
the seeded url db  is in cvs? I'll check it out.  

I'm curious about this client db and its intended use.

Thanks,

--
HP had a unique policy of allowing its engineers to take parts from stock as
long as they built something.  "They figured that with every design, they were 
getting a better engineer.  It's a policy I urge all companies to adopt."
-- Apple co-founder Steve Wozniak, "Will Wozniak's class give Apple to teacher?"
   EE Times, June 6, 1988, pg 45

Re: [Sprawler-devel] client check.

From: <mni...@mo...> - 2004-06-22 21:24:35

>>>>> "Eric" == Eric Anderson <and...@ce...> writes:

> Mojo B. Nichols wrote:
>> the client check seems somewhat broken.
>> 

> Uhh, how about some additional details? Eric

Actually I think it may be my perl and just the sockets... can
somebody else try this on linux?  I said client because that fails,
but upon closer inspection it doesn't seem that simple.

Thanks,
Mojo

--
Life is both difficult and time consuming.

Re: [Sprawler-devel] client check.

From: Eric A. <and...@ce...> - 2004-06-22 13:06:40

Mojo B. Nichols wrote:

>the client check seems somewhat broken.
>

Uhh, how about some additional details? 

Eric


-- 
------------------------------------------------------------------
Eric Anderson     Sr. Systems Administrator    Centaur Technology
Talk sense to a fool and he calls you foolish.
------------------------------------------------------------------

[Sprawler-devel] client check.

From: <mni...@mo...> - 2004-06-19 14:27:19

the client check seems somewhat broken.


--
Nobody ever died from oven crude poisoning.

[Sprawler-devel] sprawler bugs.

From: <mni...@mo...> - 2004-06-10 10:53:42

MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
--text follows this line--

Hi all,

When I attempt to run sprawler locally this is what I get:

./master.pl
No urls in db to index! Seeding url index ...
78 urls added to seed list.
print() on closed filehandle LOG at lib/Sprawler.pm line 92.
print() on closed filehandle LOG at lib/Sprawler.pm line 92.
print() on closed filehandle LOG at lib/Sprawler.pm line 92.
print() on closed filehandleLOG at lib/Sprawler.pm line 92, <PARENT> line 1.
print() on closed filehandle LOG at lib/Sprawler.pm line 92, <CHILD> line 1.
Use of uninitialized value in string eq at ./master.pl line 125.
Client tes...@sp...-1031080407379 attempted to steal from us!
print() on closed filehandle LOG at lib/Sprawler.pm line 92.
print() on closed filehandle LOG at lib/Sprawler.pm line 92.

and

./indexer.pl -s localhost
Prototype mismatch: sub Sprawler::Client::get ($) vs none at lib/Sprawler/Client.pm line 127.
Index path: ./indexes/tes...@sp...-1031080407379/
Indexable content types: text/html text/plain
Requesting urls

Before I dig into what's causing this does anybody else see this
behavior when trying this simple test. I think our biggest challenge
will be keeping our programs relatively platform agnostic. Eric I
understand this works for you in freebsd?

Thanks,

mojo

--
World Domination, One CPU Cycle At A Time

Forget about searching for alien signals or prime numbers. The real
distributed computing application is "Domination@World", a program to advocate
Linux and Apache to every website in the world that uses Windows and IIS.

The goal of the project is to probe every IP number to determine what kind of
platform each Net-connected machine is running. "That's a tall order... we
need lots of computers running our Domination@World clients to help probe
every nook and cranny of the Net," explained Mr. Zell Litt, the project head.

After the probing is complete, the second phase calls for the data to be
cross-referenced with the InterNIC whois database. "This way we'll have the
names, addresses, and phone numbers for every Windows-using system
administrator on the planet," Zell gloated. "That's when the fun begins."

The "fun" part involves LART (Linux Advocacy & Re-education Training), a plan
for extreme advocacy. As part of LART, each Linux User Group will receive a
list of the Windows-using weenies in their region. The LUG will then be able
to employ various advocacy techniques, ranging from a soft-sell approach
(sending the target a free Linux CD in the mail) all the way to "LARTcon 5"
(cracking into their system and forcibly installing Linux).

[Sprawler-devel] testing

From: <mni...@mo...> - 2004-06-10 10:48:10

This is a test.  If it had been a real mail you would have wanted to
read it. 

mojo


--
People are very flexible and learn to adjust to strange
surroundings -- they can become accustomed to read Lisp and
Fortran programs, for example.
- Leon Sterling and Ehud Shapiro, Art of Prolog, MIT Press

[Sprawler-devel] FASD project: Online survey launched

From: <ben...@id...> - 2004-05-22 12:22:50

Dear Open Source developer

I am doing a research project on "Fun and Software Development" in which I kindly invite you to participate.
You will find the online survey under http://fasd.ethz.ch/qsf/. The questionnaire consists of 53 questions and you will need about 15 minutes to complete it.

With the FASD project (Fun and Software Development) we want to define the motivational significance of fun when software developers decide to engage in Open Source projects. What is special about our research project is that a similar survey is planned with software developers in commercial firms. This procedure allows the immediate comparison between the involved individuals and the conditions of production of these two development models. Thus we hope to obtain substantial new insights to the phenomenon of Open Source Development.


With many thanks for your participation,
Benno Luthiger


PS:
The results of the survey will be published under http://www.isu.unizh.ch/fuehrung/blprojects/FASD/.
We have set up the mailing list fa...@we... for this study. Please see http://fasd.ethz.ch/qsf/mailinglist_en.html for registration to this mailing list.

_______________________________________________________________________

Benno Luthiger
Swiss Federal Institute of Technology Zurich
8092 Zurich

Mail: benno.luthiger(at)id.ethz.ch
_______________________________________________________________________

Re: [Sprawler-devel] More Scribble Pad Notes

From: <mni...@mo...> - 2004-05-04 02:30:55

>>>>> "J" == J Kasprzak <ja...@ka...> writes:

> Hello.  Well, I couldn't help but notice that it's been quiet here
> lately. I haven't seen anything posted here for a while and haven't
> seen anything committed to the repository lately. I'd just like top
> know what's been happening, and be kept up to date, etc. I
> personally have not been as active with this lately, as I was on a
> short vacation last week and had another nasty cold. But I'm still
> loking for ways to help, even though I haven't been able to code (my

I'm still here, but have been busy with guests literally from the
beginning of april to now.  Last guest scheduled left today, so I will
begin coding again soon. Actually I did a bug fix, but some other
things weren't working so I haven't committed.  I plan on getting back
into full swing here, soon. 

regards,

mojo

--
Having children is like having a bowling alley installed in your brain.
		-- Martin Mull

Re: [Sprawler-devel] More Scribble Pad Notes

From: J. K. <ja...@ka...> - 2004-05-03 04:24:11

Hello.

Good to hear from you. I was wondering what was happening, any time there
isn't anything going on, no talk, no new code, no website updates, etc. I
get concerned. In fact, I was wondering about what I should be doing. If
the other members here don't seem to be working at it, then I wondered if
I should keep working. This is the reason I did not send a reply to the
message on the Sprawler Map within a few days as I said I would. If we
were not going to go over important matters and compare notes, etc. then I
figured I might be wasting my time. About all I wanted was to know that
the rest of you were active here, because I certainly can't work on this
alone.

But I must say that I am glad that our leader, Eric, appears to have the
determination needed for this project to be a success. Without his
support, I cannot see this project taking off. This is his vision, and as
the person who has the vision, the importance of that position cannot be
underestimated.

Anyway, what about the rest of you? It'd be good to hear fromall of you as
well.

Thanks,

J.K.

> Ok - first, it HAS been quiet around here.  I've pretty much been 150%
> busy at work, so I've been disconnected f rom the project.  I'll be back
>  in action very soon, and I guarantee I'll make up for the lost time, so
>  now's a good time to start studying the code, play with it, etc,
> because  it's going to be a coding frenzy when I start up again soon. :)
>  I'll  reread these notes, and comment on them separately..
>
> Anyone who's kicking back and waiting for some action - prepare for
> Sprawler's most important time..
>
> Eric
>
>
>
> --
> ------------------------------------------------------------------ Eric
> Anderson     Sr. Systems Administrator    Centaur Technology
> Today is the tomorrow you worried about yesterday.
> ------------------------------------------------------------------
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: Oracle 10g
> Get certified on the hottest thing ever to hit the market... Oracle 10g.
>  Take an Oracle 10g class now, and we'll give you the exam FREE.
> http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
> _______________________________________________
> Sprawler-devel mailing list
> Spr...@li...
> https://lists.sourceforge.net/lists/listinfo/sprawler-devel

Re: [Sprawler-devel] More Scribble Pad Notes

From: Eric A. <and...@ce...> - 2004-04-30 13:35:15

J. Kasprzak wrote:

>Hello.
>
>Well, I couldn't help but notice that it's been quiet here lately. I
>haven't seen anything posted here for a while and haven't seen anything
>committed to the repository lately. I'd just like top know what's been
>happening, and be kept up to date, etc. I personally have not been as
>active with this lately, as I was on a short vacation last week and had
>another nasty cold.
>
[..snip good stuff..]

Ok - first, it HAS been quiet around here.  I've pretty much been 150% 
busy at work, so I've been disconnected f rom the project.  I'll be back 
in action very soon, and I guarantee I'll make up for the lost time, so 
now's a good time to start studying the code, play with it, etc, because 
it's going to be a coding frenzy when I start up again soon. :)  I'll 
reread these notes, and comment on them separately..

Anyone who's kicking back and waiting for some action - prepare for 
Sprawler's most important time..

Eric

-- 
------------------------------------------------------------------
Eric Anderson     Sr. Systems Administrator    Centaur Technology
Today is the tomorrow you worried about yesterday.
------------------------------------------------------------------

[Sprawler-devel] More Scribble Pad Notes

From: J. K. <ja...@ka...> - 2004-04-24 04:37:13

Hello.

Well, I couldn't help but notice that it's been quiet here lately. I
haven't seen anything posted here for a while and haven't seen anything
committed to the repository lately. I'd just like top know what's been
happening, and be kept up to date, etc. I personally have not been as
active with this lately, as I was on a short vacation last week and had
another nasty cold. But I'm still loking for ways to help, even though I
haven't been able to code (my PC needs a RAM upgrade for me to be running
the code and testing it the way I want.) Eric, I got your reply to my
comments on the Sprawler Map, and I plan on commenting on that later on in
a later reply (which I expect to come in the next few days.) So as I
wonder what the four of you are doing, let me just dump some more
information from notes that I have scribbled. Here you are:

Here are some more calculations and thoughts on them.

Just as a quick reminder, here are the variables I'm using:

M: number of master machines
N: number of indexer machines
p: pages indexed per indexer
s: (average) size of indexed data per web page

I should also note that we have a value, which I've been calling max(M/N),
which is the maximum number of masters we allow for indexers. Maybe I
should represent this value with a single letter, as I do with the other
values. But I'll leave this as it is for now. It's just a disorganized
list of scratch pad calculations which I or we can make more formal later.

The values I've been using for these variables are as follows:

max(M/N) = 100
s = 100 kB = 0.1 MB
p = 1000 pages

I'm not saying that these are the numbers we will go with, or that these
numbers are the best. It's just what I've been working with for now based
on an initial estimated guess of what might be best. And remember, it's
not so much the numbers that are important, it's the letters. I've come up
with equations such as the one for average amout of data sent to master in
an indexing cycle (which is max(M/N)*p*s = 10 GB) and these are what
reveal the whole framework within which we are working, and put numerical
values in to come up with optimal results.

So here in this situation, in the worst case, where all data is received at
once, 10 GB of data is sent to each master, and so it'll need to have that
all copied over to the NFS disks before the indexers bring in any more
data. I figure I already covered the bandwidth issue, so I'm bringing up
disk space now. Although the more I look at this, the more it appears that
bandwidth will be a bottleneck.

So in this case 10 GB of data should be available for receiving data from
the indexers, but that's not taking into account what I'm going to bring
up next. And those are the issues of expansion and redundancy. Our engine
will expand to take in more pages, and we'll need to have redundancy in
the design in case masters go down.

First, let's talk about expansion.

Hardware expansion is relatively stagnant, I think. It's easier to increase
max(M/n) than it is to get more hardware and money.

We can fairly easily adjust p, and max(M/N).

s is not something that'll change as much. As we look for more ways of
indexing it'll gradually increase.

What we need is a redundancy factor, where we decide how much more each
machine should handle. Suppose we set that factor to 2, in other words,
each machine can handle twice as much as it can in its worst case. This is
something we'll need to maintain. As when one machine goes down, another
can pick up and handle it. But designing large hardware systems isn't
something I'm as familiar with, so please correct me if I'm dead wrong.

And perhaps we need and expansion factor as well. Should machines be able
to handle double their worst-case load in order to make sure it can handle
expansion in the size of the index? As we grow, we'll need to be able to
handle it.

Bandwidth per machine may be what's most static, that's my concern at this
time. Or we may need more masters to compensate for lack of bandwidth. Can
we get ourselves some T3s?

Now for some more calculations.

Estimation of Client Indexing Time
----------------------------------

A client (or indexer) to does the following four things over and over again:

1. Downloads URLs (and perhaps information corresponding to them)
2. Downloads the HTML that from each URL
3. Indexes the data for each URL's HTML
4. Sends this indexed data back to the master

So based on my original numbers, here's my calculation of how much time a
client spends indexing, so we can get an idea of what indexing cycle
length is.

1) Downloading 1000 URLs and their corresponding data on them would just
likely be 100 kB (assuming 100 bytes of data per URL, but we could include
what links to the page or anything else that's have the indexer know more
about the URL, more about this idea later.) So this could become a 1 MB
download, which would only be significant in length if you're using
dialup. I'm assuming we won't have may dialup indexers, and so I won't
even factor this in.

2) Assume 25 kb of HTML per web page. (Let w =25 kB)
   Then data downloaded = w*p = 0.025 MB * 1000 = 25 MB

   Assuming 1.5 MB/s of bandwidth, this phase takes: 25 MB
						     -----
						     1.5 MB/s  = 216.7 s
							       = 3 min 37 s

3) Assume: indexer time = 10 s / page = i (LEt i= 10s / page)

   Time to complete this phase = p*i = 1000 pages * 10 s/page = 10 000 s
							      = 166.6 s
							      = 2 h 46 min

4) Worst case time to send it back : I already estimated to be 1 h 51 min
in worst case.

Total indexing time = 4 hours 40 minutes

And much of that time will be spend with a connection open, sending indexed
data back, or getting HTML from web pages. One of the reasons I decided that
we have each indexer index more pages per cycle would be so that it'd have
more analysis to do, and would have less time with connections open.
Perhaps I'm a little too concerned with making this more SETI@home-like,
even though it can't be, by nature, with the way that it just goes out and
grabs data all the time, from the web and from masters.

Now time to consider two big questions:

1) How many pages will we index at a given time? Call this value P.

2) How much time do we want to spend (re)indexing the web? Let's call this
goal for a time frame T.

T, for now, is just something we are setting.

But what it may all come down to is, which of the above questions is more
important?

For now, let's just say that our goal is to reindex our goal in terms of
number of web pages once a week.

Then T = 168 hours = 10 080 min = 604 800 s

The worst-case cycle length was calculated to be 5 hours (based on my
numbers above.)

Therefore, 33 indexing cycles could be done in this time. But number of
cycles may not be what's important here.

Number of pages indexed = 33*p*M
		        = 33*1000*10
			= 330 000 pages

But what if our goal is to index 1 000 000 (10^6) pages?

Most obvious (and pherps worst) answer: Do another two weeks of indexing.

What we'd prefer: having more indexers and corresponding masters. We can
just triple the M value.

The max(M/N) ratio remains the same though, so we also triple the number of
indexers.

So if we have 100 masters, triple that to 300, then we need 3000 indexers.
And they won't just arrive. Much will need to be done from our own end to
control project growth.

And what about adjusting p? Well, number of cycles and pages indexed per
cycle are inversely proprtional. The time that it all takes gets thrown
off as well. And so, this just emphasizes the importance of completing
indexing cycles very quickly, but we need to keep clients in mind, as we
don't want them to consume excessive bandwidth, or take up excessive HDD
space, etc. So we may need plenty of tweaks for speeding things up, lots
of good hardware, and to find just the right amount of
responsibility-sharing between clients and indexers. More on this to
follow.

Thanks,

J.K.

Re: [Sprawler-devel] Sprawler map

From: Eric A. <and...@ce...> - 2004-03-31 16:25:01

J. Kasprzak wrote:

>Well, that map is just what I figured we needed. It does look quite good,
>and after going over it, there are quite a few things I'd like to say,
>mostly concerning the physical aspects of it.
>
>First, about the logical aspect, I have nothing against what's currently
>there. I can't think of anything wrong with it really, maybe a few subtle
>changes could be done about in each of the modules there (ie. controller
>vs. processor, maybe controller could do some of what procesor does, but
>need more detail.) And since I need more details, I'll just focus mostly
>on the physical design.
>
>So here's what I have to say.
>
>Physical Design Issues
>----------------------
>
>Ratio of Indexer Machines to Master Machines: It appears to be N:1, in that
>each master in the layer has N indexers assigned to it. Is this assignment
>static or dynamic? And how large or small should N be? Perhaps it should all
>be dynamically assigned, as we would want the load to be distributed as
>much as possible. And a maximum value of N needs to be set up for each
>master, assuming hardware differences. (Or are all indexers created
>equal?) How many connections should each one handle?
>
>Ratios In General: For each level, it says there are N machines. As an
>individual who is something of a mathematics geek, this seems to imply that
>all levels have equal numbers of machines. It may seem that I'm nit-picking
>here, but this could lead to confusion, especially when looking over the
>last section.
>  
>
I'm glad you mentioned that.  Unfortunately, I wasn't too clear on that 
part.  Here's what I was thinking of doing.  The harvester's all talk to 
*apparently* the same controller - but it isn't.  If we have 5 
controllers, we have one DNS name (cntlr.sprawler.com) that uses DNS 
round robin to distribute load between as many (or as little) actual 
controller machines as we need.  We can add machines, and within 
minutes, the load will the distributed to those new hosts.  We can 
remove them, etc.  If we lose a machine (hardware failure, etc), we 
remove it from the DNS round robin listing and within about a minute it 
disappears.  The harvester will be smart enough to know if it gets a 
reset connection (host unavailable, broken, etc) it waits x seconds and 
retries (most probably getting a new machine, which should work).  This 
also gives us the flexibility to use any load balancing/distributing 
application we want, with no notice or change to the clients.  There can 
be any number of any of the machines on the map.  I made them all the 
same out of symmetry, but there's no real reason there needs to be any 
specific number of each.  We'll grow the different classes of machines 
as needed (and requires barely any changes to add/remove additional 
machines).

>Adaptability: This concerns the overall structure for all levels. How well
>can it adapt to changes? More specifically, can it adapt well to machines
>being added/upgraded on the fly? Would configuration files need to be
>changed? This last point ties in with the last section somewhat.
>  
>
I think I answered this above, but if I didn't answer to your 
satisfaction, please say so and I can go into more detail.

>Possible Redundancy: Does each master know what other masters are doing? Do
>they need to? How do URLs and data get partitioned among master machines? Do
>they all have access to a centralized data store of URLs? The physical
>design does not seem to indicate where this data on URLs and phases each
>URL is in is stored.
>  
>
The controllers (masters) have no idea what any other controller is 
doing, or even if there are any other controllers.  They don't need to 
know.  They all access a central repository of URLs, and pull (somewhat 
randomly) a list of URLs to hand out to the harvesters.  The actual data 
is stored on NFS shared disks that the controllers can access, and also 
the processors and maintainers can also access.  This is why we can pop 
in more controllers at any given time, or remove them.  It's all at our 
whim.

>Client Idle Time: If there's a situation where the indexer cannot connect to
>the master, what can it do? If servers go down, the queue could then become
>excessively long. And what if there's nothing left for the client to index
>before the next reindexing cycle (dare I suggest this so soon?) Maybe the
>client can just reindex the URLs again.
>  
>
If the harvester cannot connect to a controller, it sleeps x seconds and 
tries again (hopefully getting another controller and moving on).  If it 
continues to fail, it will try forever.  The queue should not get huge 
on us - only if we are gathering data from the harvesters and not 
handing out URLs, or vice versa - we are handing out URLs, but not 
gathering the data back - which shouldn't happen, and even if it does, 
it should not be a real problem except for a little heavy traffic when 
things come back alive.  There will never be a time that we have nothing 
to index - the maintainer will see to that.  It's job is to make sure 
that there are always URLs to be indexed by selecting URLs that have 
already been indexed to be re-indexed, and re-inject them into the 
indexing pool.

>General Bad Stuff: Eric mentioned "nastygrams" could theoretically be sent
>over and perhaps the data sent over could be spoofed. And it certainly is
>a nasty world out there, with people who may test to see how robust our
>servers really are. We may need to consistently work on methods to keep
>from being flooded with DDoS attacks and other related floods, and so we
>may need keep checking for what is legitimate data and what isn't. But
>with indexers, perhaps we can only start out with people we trust.
>  
>
Definitely!  In fact, we'll probably only go to a dstributed indexing 
system once we are well on our way to a working engine, so we should 
have some publicity, and hopefully a few more good developers on the team.

>Separation of Responsibility Between WWW Server and Compute Machines: Who
>should do the parsing of user input? Maybe a bit of both would do. All
>user input could be encoded into the URL string (as Google does) or maybe
>it could all be put into an easily-parsable form on the web side. And
>should CGI be used? PHP is free and fast, and doesn't use CGI (right?) .
>The data on what's searched for canbe sent through it, and the compute
>machine can take the data in perhaps aneasily-parsable form. But this may
>be more software-related, and we need to determine the ratio of WWW
>machines and Compute machines to find the right balance. I think that
>maybe there should be more WWW Servers, as taking in queries may not be
>all they do. Remember that we may have personalization features and other
>similar things there.
>  
>
I was  thinking something similar.  What I had envisioned was the WWW 
servers would do all HTML related creation, user interfacing, etc.  It 
would take a query from a user, order it in a certain fashion, and 
request data from the compute machines (via TCP most likely).  The WWW 
servers would send a search request for the different "tokens" that were 
entered (the WWW server parses this, and determines which compute 
machines to contact, what type of query, etc - more on this later on 
down the road) to the selected compute machine, and the compute machine 
responds back with the data to be shown to the user.  Most of the time, 
the WWW server will need to contact several compute machines (and these 
are scalable also in a similar fashion as above), and take the data from 
each and compile it into a list.  The WWW server will be almost 
completely doing web requests and HTML (or PHP) tricks, but some minimal 
computations could be done.

>Reducing Disk Access: Much memory would be needed to keep what is most
>likely to be requested in memory. And what we need is a good algorithm for
>determining which data is most likely to be requested? Would it really be
>what's most recent? Wouldn't it be what requested most often? Some
>combination of both? And do these are have access to the same data on
>this? Just more issues withdistributed computing.
>  
>
I think what we'll end up with is something that keeps in memory the 
most recent and most popular terms.  Each compute machine will have it's 
own cache (which of course will be different than the others).  I also 
think we'll have compute machines for different sections of index data.  
We can break up the access anyway we want - it really doesn't matter.  
We can have "rulesets" that the  compute machines use to determine their 
"scope" of search, and the WWW servers will be told which compute 
machines have which scopes.  This gives us the ability to spread the 
memory caches across scopes, machines, and the index as a whole, still 
using cheap hardware.

>Priorities: We need to determine how to allocate resources. With the money
>that we get, certain percentages can be allocated to certain places. But
>how should this be done? The Computer and Master layers will need plenty of
>hardware, but which may be more important? Is indexing more important than
>searching may be what that last question comes down to. And this applies
>for all layers. Which need more and/or better hardware?
>  
>
This is going to be interesting.  I think we'll have to attack this one 
day at a time.  Right now, we need disk space, and a few machines to use 
to get an initial setup going.  Once we start indexing full-time, we'll 
need LOTS of disk space, and several machines for disk servers.  As we 
get a larger index, we'll want to add compute and WWW machines, so we 
can support the increasing number of searches that we'll be getting.  It 
will hit a break point, where suddenly many people know about us, and 
use us, yet we're still small and growing, so we'll be at a critical 
time.  Hopefully I can score some good hardware deals with a few vendors 
before that happens.

>--------------------------------------------------------------------------
>
>Alright, now in that last section, I mostly raised questions. Now I'm
>going to see what I can come up with for answers.
>
>First, answers to questions on the interaction of masters with indexer
>clients, where I asked how many indexers each master could handle at a
>time. Here, I model the system and come up with a little notation in order
>to help us quantify everything and come up with some numbers that will be
>useful in the overalldesign of the system.
>
>Let's define some variables first.
>
>M: the number of masters
>N: the number of indexers
>p: number of pages that each client indexes in each indexing cycle
>s: size of file of indexed data for a URL
>
>So what we need to do is somehow come up with a maximum value for what M/N
>should be. If the M:N ratio is too high, that'll lead to masters being
>bogged down with requests for URLs and to have indexed data stored. And we
>don't want that. Now thorughout this, I'm assuming the worst cases for
>each indexing cycle. And an indexing cycle is, as you probably know, the
>whole cycle of the master finding URLs in the "to be indexed" state, then
>having clients request these URLs, then index data in them, and send the
>indexed data back.
>
>So let max(M/N) be this maximum value.
>
>Whenever new indexers are added and
>registered, the M/N ratio increases, and perhaps the server can be updated
>of these changes and somehow it'll need to know how to assign it to a
>master server at a time that isn't handling as many indexers at a time.
>This assignment would be done dynamically, though perhaps I'm stating the
>obvious here. The system just needs to know how many clients each master
>is handling.
>
>But then p, the number of URLs that an indexer requests, could also be
>made more dynamic, rather than just having a value in a configuration file
>for it. The value p could be related to the length of each reindexing
>interval, and I'll cover the importance of that next.
>
>But here are some quick little equations to put values into:
>
>Number of pages being indexed per cycle = p * N
>Size of data handled by master per indexing cycle = max(M/N) * p * s
>
>In the worst case, the master handles all of this data at once.
>
>We want to maximize p*n. (To index as many pages as possible per cycle.)
>
>But max(M/N)*p*s should be capped. But how? We want to limit the amount of
>time clients spend sending data, and masters spend taking it, right? We need
>to take bandwith per master into consideration.
>
>Let's say masters get all the data at once: which is the worst case, and
>would ideally be avoided. (client badwidth and processing speed can vary
>quite a bit, and this could actually be good news for us, causing us to
>avoid this, but I digrress.)
>
>Quick experiment:
>
>Say a master has 1.5 MBps of bandwith. (as do clients.)
>
>Let s=100 kB (we've been using this as a worst-case maximum figure)
>Let M/N = 100
>Let p = 1000 pages/cycle
>
>Then data sent at given time = 100 * 1000 pages/cycle* 0.1 MB/page
>                             = 10 000 MB/cycle
>
>Worst case time per cycle = 10 000 MB  /  1.5 MB/s
>                          = 6 667 s
>                          = 111.11 min
>                          = 1 h 51 min 7 s
>
>
>Now two hours to upload the data does not look good, but this is absolute
>worst case, and it does show the inportance of capping p and max(M/N)
>values. We can just keep playing with these values, and it's what we're
>working with, unless there are a few things here I'm wrong about.
>
>Another issue: Given p, how long would each indexer take to index p pages?
>In other words, to go through its mini-cycle? What it works out to is is
>follows:
>
>average time for indexer to
>complete its part of cycle = p*avg(pagesize)   +  avg(indexing time per page)
>		             ---------------
>                              avg(bandwidth)
>
>
>We will need to find out these averages above to get total cycle length,
>which should be related to the reindexer interval. Perhaps it shoud be
>dynamic based on statistics we compile as we index pages? Add that last
>equation to worst-case time per cycle, and time master takes to find what
>needs to be indexed, and there you have cycle length.
>
>One last thing I should mention is that these are just things that came from
>my scratch pad. I've been quite interested in how such a large system would
>work, and we'll need to come up with some numbers here. Now, if I'm wrong
>about anything here, now would be a good time to tell me. And if so, you can
>correct me and this data can be made more formal (ie. I haven't assigned
>variables to worst case indexing times, and maybe I should.)
>  
>
This is awesome.  I love it.  This is exactly what I hope to see from 
someone working on this project.  Ok - I think your numbers are 
*extreme* worst case scenario, but good to see nonetheless.  There are a 
few things I'd say about them: first, the clients will naturally stagger 
themselves out a bit, so it won't take all clients 2 hrs to send their 
data.  Plus, you are figuring 100 clients, with 1000 pages per cycle - 
which is a little high I think.  It will probably be more like 50 pages 
per cycle (we don't want to slam our servers when they upload, we don't 
want one client to be responsible for too many URLs, and we don't want 
to use up too much space on the client's disk).  

I think you should incorporate some of these calculations in the code.  
Add some simple routines to time and average the page sizes, download 
times per page, per cycle, upload times, etc.  It would be good for us 
to know, and nice for debugging. 

Eric



-- 
------------------------------------------------------------------
Eric Anderson     Sr. Systems Administrator    Centaur Technology
Today is the tomorrow you worried about yesterday.
------------------------------------------------------------------

[Sprawler-devel] To-Do.

From: <mni...@mo...> - 2004-03-31 13:24:53

My additions below. I'm thinking for testing we should schedule a time
frame and try to get everyone on the irc channel, to discuss what we
see. I guess from 9:00pm EST to 12:00 pm EST is good for me.


Here is the updated to-do list (as promised).  Please feel free to
pick something you like off the list, send an email to this list
(-devel), and let us know what you're working on. Harvester:
(indexer.pl)


my additions although I've been a little out of the loop lately.

-----------------------
   o indexer doesn't completely store data in the local db files.  for
     instance, the urls are stored, but not the text linking to those urls
     (the linked text should be stored with each url) (open)

   o there are a lot of index types missing (header text, small text,
     strong text, etc, etc) (open)

   o fix pick_lanquage method (Eric)

   o test and select an html parser (HTML:Parser,XML::Parser,
     TokeParser, Pull Parser) based on efficency (Ilya).

   o methods for determining font clashes (open)

   o renaming of all classes, methods to reflect current naming
     convention. (mojo)       

Controller: (master.pl)
-----------------------
   o Controller needs to check for "nastigrams" - charactors and such that
     could cause the Controller to execute commands on behalf of the
     user it is  running as. (open)(mojo) 


   o Methods for toggling states, re-indexing, etc (open).

   o Patch needed to make Controller only allow checkout of a max number of
     urls per Harvester, so we need to check how many they currently have
     checked out, and get the difference. (open)

Queue Processor: (queue.pl)
-------------------------
   o add a queue processing agent that goes through the db files sent by
     the harvester to the controller and parse the data out and put it
     in the index tree.  (open) 

Queue Maintainer: (maintainer.pl)
----------------------------
   o Agent that runs independently of other programs, goes through the state
     db's and finds urls that need reindexing, and re-injects them into the
     queue by changed it's state (and moving it to the corresponding state
     db). (open)

General
-----------------
   o TESTERS!  We need your bandwidth!  This is an easy way to get
     involved!  (EVERYONE) 
   
   o Should we try a certain time and have every one join # sprawler.

   o Design good user interface for web front end (open)

   o general error checking and code robustness. (open)

Re: [Sprawler-devel] Sprawler updated To-Do list

From: <mni...@mo...> - 2004-03-31 13:14:52

>>>>> "Eric" == Eric Anderson <and...@ce...> writes:



> Processor: (processor.pl) ------------------------- 

I feel processor in general is too vague of a term.  Don't get me
wrong I don't care that much what it's named, but how about just
queue.pl. 


> Maintainer: (maintainer.pl) ---------------------------- o Agent
> that runs independently of other programs, goes through the state
> db's and finds urls that need reindexing, and re-injects them into
> the queue by changed it's state (and moving it to the corresponding
> state db). (open)

juries still out on this one.

Re: [Sprawler-devel] Sprawler map

From: J. K. <ja...@ka...> - 2004-03-31 06:04:01

Well, that map is just what I figured we needed. It does look quite good,
and after going over it, there are quite a few things I'd like to say,
mostly concerning the physical aspects of it.

First, about the logical aspect, I have nothing against what's currently
there. I can't think of anything wrong with it really, maybe a few subtle
changes could be done about in each of the modules there (ie. controller
vs. processor, maybe controller could do some of what procesor does, but
need more detail.) And since I need more details, I'll just focus mostly
on the physical design.

So here's what I have to say.

Physical Design Issues
----------------------

Ratio of Indexer Machines to Master Machines: It appears to be N:1, in that
each master in the layer has N indexers assigned to it. Is this assignment
static or dynamic? And how large or small should N be? Perhaps it should all
be dynamically assigned, as we would want the load to be distributed as
much as possible. And a maximum value of N needs to be set up for each
master, assuming hardware differences. (Or are all indexers created
equal?) How many connections should each one handle?

Ratios In General: For each level, it says there are N machines. As an
individual who is something of a mathematics geek, this seems to imply that
all levels have equal numbers of machines. It may seem that I'm nit-picking
here, but this could lead to confusion, especially when looking over the
last section.

Adaptability: This concerns the overall structure for all levels. How well
can it adapt to changes? More specifically, can it adapt well to machines
being added/upgraded on the fly? Would configuration files need to be
changed? This last point ties in with the last section somewhat.

Possible Redundancy: Does each master know what other masters are doing? Do
they need to? How do URLs and data get partitioned among master machines? Do
they all have access to a centralized data store of URLs? The physical
design does not seem to indicate where this data on URLs and phases each
URL is in is stored.

Client Idle Time: If there's a situation where the indexer cannot connect to
the master, what can it do? If servers go down, the queue could then become
excessively long. And what if there's nothing left for the client to index
before the next reindexing cycle (dare I suggest this so soon?) Maybe the
client can just reindex the URLs again.

General Bad Stuff: Eric mentioned "nastygrams" could theoretically be sent
over and perhaps the data sent over could be spoofed. And it certainly is
a nasty world out there, with people who may test to see how robust our
servers really are. We may need to consistently work on methods to keep
from being flooded with DDoS attacks and other related floods, and so we
may need keep checking for what is legitimate data and what isn't. But
with indexers, perhaps we can only start out with people we trust.

Separation of Responsibility Between WWW Server and Compute Machines: Who
should do the parsing of user input? Maybe a bit of both would do. All
user input could be encoded into the URL string (as Google does) or maybe
it could all be put into an easily-parsable form on the web side. And
should CGI be used? PHP is free and fast, and doesn't use CGI (right?) .
The data on what's searched for canbe sent through it, and the compute
machine can take the data in perhaps aneasily-parsable form. But this may
be more software-related, and we need to determine the ratio of WWW
machines and Compute machines to find the right balance. I think that
maybe there should be more WWW Servers, as taking in queries may not be
all they do. Remember that we may have personalization features and other
similar things there.

Reducing Disk Access: Much memory would be needed to keep what is most
likely to be requested in memory. And what we need is a good algorithm for
determining which data is most likely to be requested? Would it really be
what's most recent? Wouldn't it be what requested most often? Some
combination of both? And do these are have access to the same data on
this? Just more issues withdistributed computing.

Priorities: We need to determine how to allocate resources. With the money
that we get, certain percentages can be allocated to certain places. But
how should this be done? The Computer and Master layers will need plenty of
hardware, but which may be more important? Is indexing more important than
searching may be what that last question comes down to. And this applies
for all layers. Which need more and/or better hardware?


--------------------------------------------------------------------------

Alright, now in that last section, I mostly raised questions. Now I'm
going to see what I can come up with for answers.

First, answers to questions on the interaction of masters with indexer
clients, where I asked how many indexers each master could handle at a
time. Here, I model the system and come up with a little notation in order
to help us quantify everything and come up with some numbers that will be
useful in the overalldesign of the system.

Let's define some variables first.

M: the number of masters
N: the number of indexers
p: number of pages that each client indexes in each indexing cycle
s: size of file of indexed data for a URL

So what we need to do is somehow come up with a maximum value for what M/N
should be. If the M:N ratio is too high, that'll lead to masters being
bogged down with requests for URLs and to have indexed data stored. And we
don't want that. Now thorughout this, I'm assuming the worst cases for
each indexing cycle. And an indexing cycle is, as you probably know, the
whole cycle of the master finding URLs in the "to be indexed" state, then
having clients request these URLs, then index data in them, and send the
indexed data back.

So let max(M/N) be this maximum value.

Whenever new indexers are added and
registered, the M/N ratio increases, and perhaps the server can be updated
of these changes and somehow it'll need to know how to assign it to a
master server at a time that isn't handling as many indexers at a time.
This assignment would be done dynamically, though perhaps I'm stating the
obvious here. The system just needs to know how many clients each master
is handling.

But then p, the number of URLs that an indexer requests, could also be
made more dynamic, rather than just having a value in a configuration file
for it. The value p could be related to the length of each reindexing
interval, and I'll cover the importance of that next.

But here are some quick little equations to put values into:

Number of pages being indexed per cycle = p * N
Size of data handled by master per indexing cycle = max(M/N) * p * s

In the worst case, the master handles all of this data at once.

We want to maximize p*n. (To index as many pages as possible per cycle.)

But max(M/N)*p*s should be capped. But how? We want to limit the amount of
time clients spend sending data, and masters spend taking it, right? We need
to take bandwith per master into consideration.

Let's say masters get all the data at once: which is the worst case, and
would ideally be avoided. (client badwidth and processing speed can vary
quite a bit, and this could actually be good news for us, causing us to
avoid this, but I digrress.)

Quick experiment:

Say a master has 1.5 MBps of bandwith. (as do clients.)

Let s=100 kB (we've been using this as a worst-case maximum figure)
Let M/N = 100
Let p = 1000 pages/cycle

Then data sent at given time = 100 * 1000 pages/cycle* 0.1 MB/page
                             = 10 000 MB/cycle

Worst case time per cycle = 10 000 MB  /  1.5 MB/s
                          = 6 667 s
                          = 111.11 min
                          = 1 h 51 min 7 s


Now two hours to upload the data does not look good, but this is absolute
worst case, and it does show the inportance of capping p and max(M/N)
values. We can just keep playing with these values, and it's what we're
working with, unless there are a few things here I'm wrong about.

Another issue: Given p, how long would each indexer take to index p pages?
In other words, to go through its mini-cycle? What it works out to is is
follows:

average time for indexer to
complete its part of cycle = p*avg(pagesize)   +  avg(indexing time per page)
		             ---------------
                              avg(bandwidth)


We will need to find out these averages above to get total cycle length,
which should be related to the reindexer interval. Perhaps it shoud be
dynamic based on statistics we compile as we index pages? Add that last
equation to worst-case time per cycle, and time master takes to find what
needs to be indexed, and there you have cycle length.

One last thing I should mention is that these are just things that came from
my scratch pad. I've been quite interested in how such a large system would
work, and we'll need to come up with some numbers here. Now, if I'm wrong
about anything here, now would be a good time to tell me. And if so, you can
correct me and this data can be made more formal (ie. I haven't assigned
variables to worst case indexing times, and maybe I should.)

Thanks,

J.K.



> I've written up a "map", or floorplan of some of the conceptual layout
> of the project - both physical and logical.  I've put it here:
>
> http://www.sprawler.com/Sprawler-map.pdf
>
> Please feel free to comment, ask questions, etc.  Specially on the
> naming of things.
>
> Eric
>
>
> --
> ------------------------------------------------------------------ Eric
> Anderson     Sr. Systems Administrator    Centaur Technology
> Today is the tomorrow you worried about yesterday.
> ------------------------------------------------------------------
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Sprawler-devel mailing list
> Spr...@li...
> https://lists.sourceforge.net/lists/listinfo/sprawler-devel

[Sprawler-devel] Sprawler updated To-Do list

From: Eric A. <and...@ce...> - 2004-03-31 04:29:14

Here is the updated to-do list (as promised).  Please feel free to pick 
something you like off the list, send an email to this list (-devel), 
and let us know what you're working on. 

Harvester: (indexer.pl)
-----------------------
   o indexer doesn't completely store data in the local db files.  for
     instance, the urls are stored, but not the text linking to those urls
     (the linked text should be stored with each url) (open)

   o there are a lot of index types missing (header text, small text,
     strong text, etc, etc) (open)

   o fix pick_lanquage method (Eric)

   o test and select an html parser (HTML:Parser,XML::Parser,
     TokeParser, Pull Parser) based on efficency (Ilya).

   o methods for determining font clashes (open)


Controller: (master.pl)
-----------------------
   o Controller needs to check for "nastigrams" - charactors and such that
     could cause the Controller to execute commands on behalf of the 
user it is
     running as. (open)

   o Methods for toggling states, re-indexing, etc (open).

   o Patch needed to make Controller only allow checkout of a max number of
     urls per Harvester, so we need to check how many they currently have
     checked out, and get the difference. (open)

Processor: (processor.pl)
-------------------------
   o add a queue processing agent that goes through the db files sent by
     the harvester to the controller and parse the data out and put it 
in the index
     tree.  (open)

Maintainer: (maintainer.pl)
----------------------------
   o Agent that runs independently of other programs, goes through the state
     db's and finds urls that need reindexing, and re-injects them into the
     queue by changed it's state (and moving it to the corresponding state
     db). (open)

General
-----------------
   o TESTERS!  We need your bandwidth!  This is an easy way to get
     involved!  (EVERYONE)

   o Design good user interface for web front end (open)

   o general error checking and code robustness. (open)



-- 
------------------------------------------------------------------
Eric Anderson     Sr. Systems Administrator    Centaur Technology
Today is the tomorrow you worried about yesterday.
------------------------------------------------------------------

[Sprawler-devel] HEADS UP! Naming change

From: Eric A. <and...@ce...> - 2004-03-31 04:28:21

As was noted in the sprawler-map.pdf I pointed to a couple weeks back, I 
proposed to change the names of the various parts of the Sprawler 
software.  Since I have heard no objections, I plan on doing that this 
weekend unless I hear cries and complaints by then. 

New to-do list coming up..

Eric

-- 
------------------------------------------------------------------
Eric Anderson     Sr. Systems Administrator    Centaur Technology
Today is the tomorrow you worried about yesterday.
------------------------------------------------------------------

[Sprawler-devel] Sprawler test machine

From: Eric A. <and...@ce...> - 2004-03-31 04:26:44

Just to keep everyone in the loop, I've recently built another Sprawler 
development/test box - this machine has the following specs:

Processor: P4 1.8GHz (512K Cache)
Memory: 768MB RAM (DDR 333)
Disk: ~ 350GB usable disk space
Connection: T1

It has room for growth, so as we need more disk space for 
testing/building indexes, we'll add drives.  Once this machine is maxed, 
we should be ready to roll into a full layout.  If anyone has space hard 
disks laying around, feel free to donate them.

Eric


-- 
------------------------------------------------------------------
Eric Anderson     Sr. Systems Administrator    Centaur Technology
Today is the tomorrow you worried about yesterday.
------------------------------------------------------------------

Re: [Sprawler-devel] parse header info in Extractor

From: J. K. <ja...@ka...> - 2004-03-26 06:40:40

Hello there.

> I think we must to add next fields of HTTP header:
>
> Expires:
> it may be helpful for master - not index document earle then Expires
> date.
>
> Last-Modified:
> it might be a good for indexer:
>  if (date_now>Expires: and Last-Modified:<Expires:)
> {
> not index this document
> }
>
> or
>
>  if (date_of_index<=Last-Modified:)
> {
> not index this document
> }

I supported the whole idea of using this information in the HTTP headers.
But after chatting with Eric on the #sprawler channel irc.freenode.net the
other day, he informed me that the "last-modifed" date isn't always
accurate. That doesn't mean we can't use it, it just means we can't rely
on it. But there is some other information we can get. And that would be
the size of the document, which you may notice is what Google stores for
pages it has cached (for some reason, non-cached pages don't have this
information in Google's results.) Now, I do understand the pages may
change before we can reindex them, but do they tend to change so much that
size data would be so far off? Having that stored on a resuilts page can
give users an idea of the size of a page, which can be very useful if they
want to know how much content is there. It's also good to know when you
have a dialup connection. :)

Here's a tweak for you:

I see that we have a line in Client.pm that says:

my @docheader=LWP::Simple::head($document);

Also added was:

$self->{CONTENT_TYPE}=undef;

Well, we can add this line after where we get the header:

$self->{SIZE_IN_BYTES}=$docheader[1];

Of course, the SIZE_IN_BYTES value would need to be declared first, but
you get the idea. If I'm not mistaken, other values returned from the
header are modification time and expiration date (in that order in an
array returned by the function.) We can take that info as well, but what
we do with it is another story.

Anyway, just thought I'd throw that in. I've mentioned any data we can get
from headers may be valuable, and size can also be good for gathering
statisitics, getting a greater idea of average web page size for our
purposes. It'll be good to know, so we'll know how much hardware we'll
need. And then there's the issue of caching web pages, where size of
pages, of course, is definitely a factor.

On a completely unrelated note, I plan on commenting on the Sprawler map
within the next few days (this time I'll make sure of it, Eric.) So much
to do, so little time.

Thanks,

J.K.

Re: [Sprawler-devel] Documentation of Existing Sprawler Methods and Classes

From: J. K. <ja...@ka...> - 2004-03-26 06:22:41

Attachments: perlScripts.txt

Hello.


> I think it looks great, and is very well done.  I have only read through
>  the beginning in detail, and skimmed the rest, but it looks awesome!
> I'm  curious about what everyone thinks we should do - have a separate
> document like this for the documentation, or insert it inline with the
> code?  I like inline, but it can make it harder to skim through code
> without all the docs.

Well, it's good to hear that you (and Mojo) like what I've come up with.
Mojo said that this will facilitate changes and new additions and that's
just what I was hoping to do with this. And might be best to leave this as
it is in order to help us understand it, see the big picture, etc. And
having it inline would be good as well. You can just copy and paste it all
in, and whenever changes are made, if you'd like to have it externally,
all it'll take is a quick script to update the external file. I think if
we just have the commented data have extra sharp symbols as comments (also
can be done by a short script) then this other script for updating the
external documentation can be doen automatically. In fact, I'm attaching a
text file with two separate Perl scrip[ts for doing those two things.


>>Did I go into too much or too little detail? There are some
>>inconsistencies in the format, and what is in it, (ie. I didn' include
>> the who-calls-what, but that may be a little to detailed for internal
>> documentation, perhaps it's more appropriate for more external
>>doumentation, whcih I may come up with next.) Also, it might not be
>> that wasy to undersand, in particular, where there are optional
>> parameters.
>>
>>
>
> Perfect amount of detail I think.

Perhaps it is, although maybe I should include more of the who-calls-who
for each method wherever possible, in order to give a better idea of not
just what the methods do, but how they interact. Also, maybe there are
some ways I can change the format to make it look more readable. I'd like
to  hear any suggestions you may have.

>>You can tell me what it is that you'd like done here, as I do plan on
>> expanding on it, cleaning it up, etc. I say we should also keep the big
>> picture in mind, and while Eric's Sprawler map was highly informative,
>> we need to bridge the gap between that and the code. This to help with
>> that. You could think of it as "Sprawler Code for Dummies" or whichever
>> you prefer. Anyway, I think I'll go look through that TODO list for
>> another task to work on, even though I may work a little more on this.
>> You can do what you want with it, though. And maybe this is a file that
>> can be kept separate, for a lookup of all classes and the methods in
>> them.
>>
>>
>
> Maybe we should put all these notes/docs on the website?

The SF website? Hmm, I was actually thinking of having it in the
repository. It is something that's be updated quite often and since it
gives somewhat low level details on the code, maybe it should be there
with the code in the repository. I was thinking that it'll be easier to
just commit this rather than update any of our websites.

>>On a remotely-related note, here's something to take a look at:
>>http://sourceforge.net/projects/dotproject/
>>I've taken a look at it and am thinking that is something we could use.
>> And I must say I like their home page. It's just another reason we can
>> think of setting up our own blog. And with this project, we cna keep
>> our internal documentation there, rather than have to go theough
>> mailing list archives. Just a thought.
>>
>>
>
> I've looked at that before. Very interesting - we could use it, but we
> don't even use the sourceforge stuff that much as it is.  Make we should
>  start?  There are task managers, etc.
>
> In fact - what do you think about putting all this documentation,
> todo's, etc, in the task manager and documentation manager?  Would you
> like to maintain this?

Silly me, I hadn't thought very much about using what's there. We
definitely need to have the Task Manager that's currently there updated
(it hasn't been in months) and the whole project description perhaps, as I
don't think we're the first to have the open source search technology. We
could also use the Doc Manager, although that isn't something I've seen
used very often. Maybe I can look into that, and then I can tell you if
that's something I'd like to work on (or if we should use what's there at
all.) But those are some good ideas.

Thanks,

J.K.

> Eric
>
> --
> ------------------------------------------------------------------ Eric
> Anderson     Sr. Systems Administrator    Centaur Technology
> Today is the tomorrow you worried about yesterday.
> ------------------------------------------------------------------
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Sprawler-devel mailing list
> Spr...@li...
> https://lists.sourceforge.net/lists/listinfo/sprawler-devel

Flat | Threaded

1 2 3 .. 9 > >> (Page 1 of 9)