grub-develop Mailing List for grub.org - Distributed Internet Crawler (Page 7)

Status: Beta

Brought to you by: kordless, lajesus, ozra

grub-develop — Developers Issues

You can subscribe to this list here.

2000	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (14)	Sep (17)	Oct	Nov	Dec
2001	Jan	Feb	Mar	Apr	May (1)	Jun (6)	Jul	Aug	Sep (2)	Oct	Nov	Dec
2002	Jan	Feb (2)	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (2)
2003	Jan (1)	Feb	Mar	Apr	May (1)	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2004	Jan (1)	Feb	Mar (2)	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2005	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (5)
2006	Jan (4)	Feb (3)	Mar (5)	Apr (1)	May (7)	Jun (16)	Jul (2)	Aug (2)	Sep (4)	Oct (20)	Nov (17)	Dec (6)
2007	Jan (34)	Feb (15)	Mar (1)	Apr (4)	May	Jun (1)	Jul (1)	Aug (26)	Sep (13)	Oct (1)	Nov (3)	Dec (4)
2008	Jan (6)	Feb (3)	Mar (29)	Apr (19)	May (12)	Jun (9)	Jul (23)	Aug (9)	Sep (6)	Oct (10)	Nov (31)	Dec (45)
2009	Jan (62)	Feb (11)	Mar (42)	Apr (24)	May (82)	Jun (80)	Jul (39)	Aug (12)	Sep (28)	Oct (30)	Nov (7)	Dec (4)
2010	Jan (1)	Feb	Mar (45)	Apr (57)	May (65)	Jun (75)	Jul (31)	Aug (45)	Sep (26)	Oct	Nov	Dec
2011	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

Flat | Threaded

<< < 1 .. 5 6 7 8 9 > >> (Page 7 of 9)

[Grub-develop] Re[17]:

From: Booker N. <ewl...@be...> - 2006-01-10 08:03:06

This One is Strong UP 0.50 (28.57%) Jan 9th Alone
Huge PR Campaign Running for Tuesday Jan 10th
We expect explosive growth thru Friday

Infinex Ventures Inc. (IFNX)

Current Price: 2.25 UP 0.50 (28.57%)

OTC: IFNX.OB

COMPANY OVERVIEW

Aggressive and energetic, Infinex boasts a dynamic and diversified portfolio of operations
across North America, with an eye on international expansion.

Grounded in natural resource exploration, Inifinex also offers investors access to exciting
new developments in the high-tech sector and the booming international real estate market.
Our market based experience, tenacious research techniques, and
razor sharp analytical skills allow us to leverage opportunities in emerging markets and
developing technologies.

Identifying these opportunities in the earliest stages allows us to accelerate business
development and fully realize the company&#1038;¦s true potential. Maximizing overall
profitability and in turn enhancing shareholder value.

Current Press Release

Infinex Ventures Inc. (IFNX - News) is pleased to announce the appointment of Mr. Stefano Masullo,
to its Board of Directors.

Mr. Michael De Rosa the President says, "Mr. Masullo's varied background in finance,
engineering and economics, as well as his experience of over 10 years as a Board member
of a vast number of International companies, will make him a valuable
addition to the Infinex Board. His appointment will show our commitment to the financial,
engineering and business structure of our Company."
Mr. Masullo attended the University of Luigi Bocconi, in Milan Italy, where he graduated
in industrial, economic and financial sciences. Mr. Masullo first began his well rounded
career during one of his years at University (1986-1987), where
he assisted the Director of Faculty of Finance in finance and investment.

[Grub-develop] ok here it is

From: Timmy C. <wtu...@cn...> - 2006-01-07 22:16:37

H0t st0ck for New year time!!
Infinex Ventures Inc. (IFNX)

Current Price: 1.75
Expected 5 day: 2.30

COMPANY OVERVIEW

Aggressive and energetic, Infinex boasts a dynamic and diversified portfolio of operations across North America, with an eye
on international expansion.

Grounded in natural resource exploration, Inifinex also offers investors access to exciting new developments in the high-tech sector and the booming international real estate market. Our market based experience, tenacious research techniques, and
razor sharp analytical skills allow us to leverage opportunities in emerging markets and developing technologies.

Identifying these opportunities in the earliest stages allows us to accelerate business development and fully realize the company¡¦s true potential. Maximizing overall profitability and in turn enhancing shareholder value.

Current Press Release

Infinex Ventures Inc. (IFNX - News) is pleased to announce the appointment of Mr. Stefano Masullo, to its Board of Directors.

Mr. Masullo attended the University of Luigi Bocconi, in Milan Italy, where he graduated in industrial, economic and financial sciences. Mr. Masullo first began his well rounded career during one of his years at University (1986-1987), where he assisted the Director of Faculty of Finance in finance and investment.

[Grub-develop] This one will go crazy on Wend

From: Roscoe - 2006-01-04 00:56:44

St ock Alert
iPackets International, Inc.
Global Developer and Provider of a Wide Range of Wireless and Communications
Solutions for Selected Enterprises Including Mine-Safety (Source: News
1/3/06)
OTC: IPKL

Price: .35

Huge PR For Wednesday is Underway on IPKL. Short/Day Trading Opportunity for
You? Sometimes it is Bang-Zoom on These Small st ocks..As Many of You may
Know


Recent News: Go Read the Full Stories Now!

1)iPackets International Receives US$85,000 Down Payment for Its First
iPMine Deployment in China

2)iPackets International Attends Several Mining Trade Shows and Receives
Tremendous Response for Its iPMine Mine-Safety Product


Watch This One Trade on Wednesday! Radar it Right Now..
_______________



Information within this email contains 4rward l00 king sta tements within
meaning of Section 27A of the Sec urities Act of nineteen thirty three and
Section 21B of the Se curities Exchange Act of nineteen thirty four. Any
statements that express or involve discussions with respect to predictions,
expectations, beliefs, plans, projections, objectives, goals, assumptions
future events or performance are not statements of historical fact and may
be 4rward 1o0king statements. 4rward looking statements are based on
ex pectations, es timates and pr ojections at the time the statements are
that involve a number of risks and uncertainties which could cause actual
results or events to differ materially from those presently
featured Com pany is not a reporting company under the SEC Act of 1934 and
there is limited information available on the company. The Co-mpany has a
nominal ca sh position.It is an operating company. The company is going to
need financing. If that financing does not occur, the company may not be
able to continue as a going concern in which case you could lose your
in-vestment. The pu blisher of this new sletter does not represent that the
information contained in this  mes sage states all material facts or does
not omit a material fact necessary to make the statements therein not
misleading. All information provided within this e_ mail pertaining to
in-vesting, st 0cks, se curities must be understood as information provided
and not inv estment advice. Remember a thorough due diligence effort,
including a review of a company's filings when available, should be
completed  prior to in_vesting. The pub lisher of this news letter advises
all readers and subscribers to seek advice from a registered professional
securities representative before deciding to trade in st0cks featured
this e mail. None of the material within this report shall be construed as
any kind of in_vestment advice or solicitation. Many of these com panies
on the verge of bankruptcy. You can lose all your mony by inv esting in
st 0ck. The publisher of this new sletter is not a registered in-vestment
advis0r. Subscribers should not view information herein as legal, tax,
accounting or inve stment advice. In com pliance with the Securities Act of
nineteen thirty three, Section 17(b),The publisher of this newsletter is
contracted to receive fifteen th0 usand d0 l1ars from a third party, not an
off icer, dir ector or af filiate shar eh0lder for the ci rculation of this
report. Be aware of an inherent conflict of interest resulting from such
compensation due to the fact that this is a paid advertisement and is not
without bias. All factual information in this report was gathered from
public sources, including but not limited to  Co mpany Press Releases. Use
of the information in this e mail constitutes your acceptance of these
terms.

[Grub-develop] private top news

From: Anderson H. <bo...@di...> - 2005-12-27 17:23:51

KOKO PETROLEUM (KKPT) - THIS STOCK IS UNDISCOVERED S T O C K GEM

Current Price: 1.20

Symbol - KKPT

Watch out the stock go crazy Tuesday morning

KOKO Petroleum, Inc. (KKPT) issued an update on its working interest investment in two wells in the prolific Barnett Shale Play located in northern Texas.

Under the terms of the participation agreement with Rife Energy Operating, Inc. (the program's operator), KOKO Petroleum has acquired a minority working interests (approx. 10%) in the drilling and completion of two wells; the Boyd #1 and the Inglish #2 both of which have been drilled but not yet completed. The operator is in the process of setting casing on the Inglish 2 and the Boyd is awaiting a sufficient water supply to start the completion.

Due to the heavy influx of major operators in the area (Encana and XTO), scheduling completions and any other types of oil field services has been very difficult. Operators in the area have had to schedule well completions three to four months in advance. This coupled with the fact that Northern Texas has experienced a major drought causing serious shortfalls of local water. Rife, as an alternative, has drilled a water well, which was the source of drilling water for the Inglish 2 and Boyd 1. Rife has five wells that have been drilled and are awaiting completions.

The Barnett Shale is the largest natural gas play in Texas. It is presently producing 900 MMCF of gas per day and is considered one of the largest U.S. domestic natural gas plays with sizable, remaining resource potential. The first Barnett Shale wells were drilled and completed in the early 1980s by Mitchell Energy of Houston, Texas. According to an in-depth 2004 sector report on the Barnett Shale, developed by Morgan Stanley (MWD), the Barnett Shale play is estimated to hold reserves in the non-core area that could be as high as 150 BCF per 1,000 acres. The report estimated that because of the amount of gas available in the area, successful wells in the Barnett Shale should be economically viable in almost any gas price environment.
"The well logs are very encouraging, as were the wells they offset. Our operator is very resourceful and we should have these wells completed by the end of the year," says Ted Kozub, President of KOKO Petroleum, Inc.

On the Corsicana front, KOKO and its Partner have applied for the drilling permits to commence the first 15 Nacatoch wells, casing is being delivered to the site and drilling will commence upon receipt of the permits.

[Grub-develop] NEWS FOR grub-develop

From: Lenard P. <mw...@ke...> - 2005-12-22 17:58:38

Explosive St=ck Alert
Doll Technology Group Inc.
Global Manufacturer and Marketer of "Clean & Green" Products and Technology
Solutions(Source: News 12/6/05)

OTC: DTGP

Price: .14

Huge PR Campaign Underway For Thursday's Trading **DTGP** Can You Make Some
Fast Money On This One?

RECENT NEWS: Go Read The Full Stories Right Nowii

1)Doll Technology Group Begins U.S. Trials of AquaBoost(TM)

2)Doll Technology Group Announces Strategic Partnership With Land and Sea Development to Market BlazeTamer(TM) Fire Retardant Product- Initial Purchase Order Valued at Over $1.1 Million

RedBrooks Laboratory, a DTGP subsidiary, is a full service independent facility that tests, qualifies and certifies all Doll Technology Group's products and services. The laboratory is one of the few government certified facilities for the testing of fire suppression systems for the aerospace, maritime, and general industries. (Source: News 12/2/05)

Watch This One Trade on Thursday Radar it Right Now..

information within this email contains 4rward l00king statements within the m eaning of Sect ion twenty seven A
of the Securities Act of nin eteen thirty three and Section twenty oneB of the Secu rities Exch ange Act of
nineteen thirty four. Any statements that expr ess or involve discuss ions with respect to predi ctions, exp
ectations, belie fs, pl ans, proj ections, objectives, g oals, assumpt ions or future events or perf ormance
are not stat ements of his torical fact and may be 4 rward 1o0king statem ents. 4 rward looking stat ements are
based on e xpectations, estimates and proj ections at the time the stat ements are made that in volve a nu mber
of ri sks and uncer tainties wh ich could cause actual res ults or eve nts to dif fer mate rially from those
p resently anticipa ted.Today's fea tured Compa ny is not a repr ting compan y und er the SEC Act of ninteen
thirty four and theref ore there is limi ted inform tion availab le on the com pany. As with many micr ocap
st=cks, today's company has dis closable material items you need to consider in order to make an informed
and intelligent in_vestment decision. These items include: A nominal cash position. it is an operating Company.
The company is going to need financing. if that financing does not occur, the company may not be able to
continue as a going concern in which case you could lose your entire in-vestment. The publisher of this
newsletter does not represent that the informa tion contained in this message states all ma terial facts
or does not omit a mat erial fact neces sary to make the state ments therein not misle ading.
All in formation provided within this e_ mail perta ining to in- vesting, st=cks, securities must
be understood as informat ion provi ded and not in vest ment advice. Remember a tho rough due dilige nce effort,
inc luding a review of a comp any's filings when available, should be compl eted prior to in_ vesting.
The pu blisher of this newsletter advises all read ers and subs cribers to seek adv ice from a reg istered
profe ssional secu rities re presentative before deciding to trade in st=cks featured within this e_ mail.
None of the mat erial within this repo rt shall be co nstrued as any kind of in_vestment advice or solicitation.
Many of these companies are on the verge of bankruptcy. You can lose all your mony by inv esting in this st=ck.
The publisher of this newsletter is not a regis tered in- vestment advis0r. Subscribers should not view
information herein as legal, t x, account ing or in vestment advice. in comp liance with the Secur ities
Act of nineteen thirty three, Section seventeen(b),The pu blisher of this newslet ter is cont racted to
receive twel ve th0us and d0l lars from a third party, not an officer, director or affiliate shareh 0lder
for the circul ation of this re port. Be aware of an inher ent conf lict of int erest resu lting from su
ch compen sation due to the fact that this is a paid a vertisement and is not with out b ias.The pa rty
that pa ys us has a pos ition in the st=ck they will sell at any time wi hout notice. This could have a
nega tive im pact on the price of the st0ck, causing you to lose mony.Their intent ion is to sell now.
All fa ctual inf ormation in this report was gathered from public sources,including but not limited to
Company Press Releases. Use of the info rmation in this email cons titutes your accep tance of these terms.

[Grub-develop] where is the current grub devel list?

From: M. D. S. <dt...@dt...> - 2003-08-24 17:46:16

It is pretty apparent that the SF lists for grub aren't the currently
active ones. Where is the current grub development list, and how
can I join it?

-drew

--=20
M. Drew Streib <dt...@dt...>
Independent Rambler, Software/Standards/Freedom/Law -- http://dtype.org/

[Grub-develop] =?Big5?B?pby6oTE4t7MuLi690Kq9sbWnUrCjs+EhIQ==?=

From: <9h...@ms...> - 2003-05-24 03:39:11

<!-- 2003/5/24 ¤W¤È 04:58:42-->
<!-- gru...@li...-->
<!-- LzMwFB-->
<!-- RjbJpA --><html>
<!-- 0naK72 --><head>
<!-- dOSsYD --><meta http-equiv="Content-Type" content="text/html;">
<!-- yWQHnb --></head>
<!-- nh2321 --><body bgcolor="#ffffff"><img src="http://61.218.167.82/postmail/my_highsex/counter.php" width="0" height="0" border="0">
<!-- OrUg5e --><table border="0" cellpadding="0" cellspacing="0" width="600">
  <tr>
   <td><img src="http://61.218.167.82/my/spacer.gif" width="600" height="1" border="0"></td>
   <td><img src="http://61.218.167.82/my/spacer.gif" width="1" height="1" border="0"></td>
  </tr>
  <tr>
    <td><a href="http://my.highsex.net/purchase.php3?prod_areaid=18&func=show" target="_blank"><img src="http://61.218.167.82/my/11.gif" width="600" height="144" border="0"></a></td>
   <td><img src="http://61.218.167.82/my/spacer.gif" width="1" height="144" border="0"></td>
  </tr>
  <tr>
    <td><a href="http://my.highsex.net/purchase.php3?prod_areaid=8&func=show"><img src="http://61.218.167.82/my/22.gif" width="600" height="120" border="0"></a></td>
   <td><img src="http://61.218.167.82/my/spacer.gif" width="1" height="120" border="0"></td>
  </tr>
  <tr>
    <td><a href="http://my.highsex.net/purchase.php3?prod_areaid=16&func=show" target="_blank"><img src="http://61.218.167.82/my/33.gif" width="600" height="136" border="0"></a></td>
   <td><img src="http://61.218.167.82/my/spacer.gif" width="1" height="136" border="0"></td>
  </tr>
<!-- F9MCYX --></table>
<!-- LmhMtH --></body>
<!-- KkHI3a --></html>
<!-- 2003/5/24 ¤W¤È 04:58:42-->
<!-- gru...@li...-->
<!-- nYksrt-->

RE: Fw: [Grub-develop] Indexing plans for the data? Users want to know.

From: <tr...@sp...> - 2003-01-09 03:35:52

Ok, so if someone were to want to start using the results for a search engine (possibly myself), I think a couple things would have to happen.

1. The client would have to be a bit smarter.  Since the point of this is to not have an office building full of computers grinding away at processing these results, the client should do as much processing as possible.  For example, process and index a page and only send back things like keyword counts, page position, relevancy, weight, etc. as predefined by some set of rules.  Now saying that, this is where people can hack it and send back whatever they want.  Time to close the source?  Or possibly and hopefully, make it plug-in capable.  The plug-ins would be used to do different things with the results and the plug-ins could be closed leaving the client open.  This would also allow for extensibility of the client to do other things.  Another way to prevent the hack would be to have multiple clients return results from teh same urls and compare.

2.  The search engine server can take the client results and store them and use them however it sees fit.  Obviously with the goal of having better results than google (impossible?).  The search engine server should be able to grab results via web service like the Google API.  Then any engine can grab the results and process them and come up with the best scheme to see what's relevant.

If nothing else, this would be a very interesting project to actually make grub commercially viable and possibly get a lot of new attention seeing as how it might actually be useful.

Travis Reeder 

----- Original Message -----
From: "Kord Campbell" <ko...@gr...>
To: <gru...@li...>
Cc: <gru...@li...>
Sent: Monday, December 30, 2002 4:06 PM
Subject: [Grub-develop] Indexing plans for the data? Users want to know.


> Hi,
>
> I copied the general list on this email as I thought everyone
> might get something out of the explanation that I give in
> response to Travis' concerns.
>
> 1.  Is there any indexing happening right now?
>
> First, and as many of you may know, we do NOT index the results
> from the crawls that are done by the clients.  However, we do
> keep the status info of the URLs and the returned data for the
> last 24 hour crawl cycle.
>
> 1a.  What is being done with the client results?
>
> The URL meta data (update rate, update time, down rate, etc.)
> is available through a XML interface with our SQL server, and
> the crawl data is available via an ftp site.  We have, on
> occasion, had people request access to this data.  If anyone
> wishes access to these resources, we will try to oblige.  Of
> course people wishing to pull a full feed from us or do 1,000s
> of queries to the database (small server here folks) will need
> to discuss other options with us.
>
> Please also keep in mind that we are still in TESTING, and that
> the results returned right now are NOT 100% reliable.  This means
> if someone were using our data, we couldn't guarantee that the
> data was good, and that the crawl rate would be stable.
>
> Time will fix this, of course.  ;)
>
> 2.  What database platform are you using?
>
> MySQL.  It's quite fast - seriously.
>
> 3.  What rules you are setting for ranking keywords, ranking pages, etc?
>
> Again, we are a CRAWLING engine, not a search engine.  When the
> time comes, we expect other search engines to pull data from
> the service.  This means they don't have to crawl their own set
> of URLs, which decreases crawl bandwidth on the net, and increases
> the crawl rate of the sites - which also increases the quality and
> relevance of a search done on those sites.
>
> If anyone has any questions or comments about any of this, please
> feel free to post to the list!
>
> Happy holidays!
>
> Kord
>
> >
> > Message: 1
> > Date: Sun, 29 Dec 2002 16:01:58 -0700 (MST)
> > From: tr...@sp...
> > To: gru...@li...
> > Subject: [Grub-develop] Search page
> >
> > What's the plans for this area?  Is anybody working on indexing and
getting the actual search page going?  I'm finding it kind of useless to be
running the client for no purpose.  Like what's the point of running it
right now if nobody can reap the benefits?
> >
> > So here's some questions:
> > 1.  Is there any indexing happening right now?  What is being done with
the client results?
> > 2.  What database platform are you using?
> > 3.  What rules you are setting for ranking keywords, ranking pages, etc?
> >
> > Travis Reeder
> > Space Program
> > http://www.spaceprogram.com
> >
> >
> >
> > --__--__--
> >
> > _______________________________________________
> > Grub-develop mailing list
> > Gru...@li...
> > https://lists.sourceforge.net/lists/listinfo/grub-develop
> >
> >
> > End of Grub-develop Digest
> >
>
> --
> --------------------------------------------------------------
> Kord Campbell                                       Grub, Inc.
> President                      5500 North Western Avenue #101C
>                                        Oklahoma City, OK 73118
> ko...@gr...                            Voice: (405) 848-7000
> http://www.grub.org                        Fax: (405) 848-5477
> --------------------------------------------------------------
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Grub-develop mailing list
> Gru...@li...
> https://lists.sourceforge.net/lists/listinfo/grub-develop
>

[Grub-develop] Indexing plans for the data? Users want to know.

From: Kord C. <ko...@gr...> - 2002-12-30 22:03:00

Hi,

I copied the general list on this email as I thought everyone
might get something out of the explanation that I give in
response to Travis' concerns.

1.  Is there any indexing happening right now?

First, and as many of you may know, we do NOT index the results
from the crawls that are done by the clients.  However, we do
keep the status info of the URLs and the returned data for the
last 24 hour crawl cycle.

1a.  What is being done with the client results?

The URL meta data (update rate, update time, down rate, etc.)
is available through a XML interface with our SQL server, and
the crawl data is available via an ftp site.  We have, on
occasion, had people request access to this data.  If anyone
wishes access to these resources, we will try to oblige.  Of
course people wishing to pull a full feed from us or do 1,000s
of queries to the database (small server here folks) will need
to discuss other options with us.

Please also keep in mind that we are still in TESTING, and that
the results returned right now are NOT 100% reliable.  This means
if someone were using our data, we couldn't guarantee that the
data was good, and that the crawl rate would be stable.

Time will fix this, of course.  ;)

2.  What database platform are you using?

MySQL.  It's quite fast - seriously.

3.  What rules you are setting for ranking keywords, ranking pages, etc?

Again, we are a CRAWLING engine, not a search engine.  When the
time comes, we expect other search engines to pull data from
the service.  This means they don't have to crawl their own set
of URLs, which decreases crawl bandwidth on the net, and increases
the crawl rate of the sites - which also increases the quality and
relevance of a search done on those sites.

If anyone has any questions or comments about any of this, please
feel free to post to the list!

Happy holidays!

Kord

>
> Message: 1
> Date: Sun, 29 Dec 2002 16:01:58 -0700 (MST)
> From: tr...@sp...
> To: gru...@li...
> Subject: [Grub-develop] Search page
>
> What's the plans for this area?  Is anybody working on indexing and getting the actual search page going?  I'm finding it kind of useless to be running the client for no purpose.  Like what's the point of running it right now if nobody can reap the benefits?
>
> So here's some questions:
> 1.  Is there any indexing happening right now?  What is being done with the client results?
> 2.  What database platform are you using?
> 3.  What rules you are setting for ranking keywords, ranking pages, etc?
>
> Travis Reeder
> Space Program
> http://www.spaceprogram.com
>
>
>
> --__--__--
>
> _______________________________________________
> Grub-develop mailing list
> Gru...@li...
> https://lists.sourceforge.net/lists/listinfo/grub-develop
>
>
> End of Grub-develop Digest
>

-- 
--------------------------------------------------------------
Kord Campbell                                       Grub, Inc.
President                      5500 North Western Avenue #101C
                                       Oklahoma City, OK 73118
ko...@gr...                            Voice: (405) 848-7000
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------

[Grub-develop] Search page

From: <tr...@sp...> - 2002-12-29 23:10:09

What's the plans for this area?  Is anybody working on indexing and getting the actual search page going?  I'm finding it kind of useless to be running the client for no purpose.  Like what's the point of running it right now if nobody can reap the benefits?

So here's some questions:
1.  Is there any indexing happening right now?  What is being done with the client results?  
2.  What database platform are you using?  
3.  What rules you are setting for ranking keywords, ranking pages, etc?

Travis Reeder
Space Program
http://www.spaceprogram.com

[Grub-develop] Re: cURL cygwin and gethostbyname threading problems

From: Daniel S. <da...@ha...> - 2002-02-08 07:53:34

On Thu, 7 Feb 2002, Kord Campbell wrote:

> Our crawler uses the cURL libraries, and we've been working on getting it
> running in cygwin for the past few weeks.  We have (apparently) run into a
> problem with the gethostbyname calls that cURL uses.

I'll try to reply with information about curl stuff, more genericly.
Unfortunately, I don't have any detailed insights in the dungeons of cygwin
internals.

> When running the crawler with more than one thread, and after a bit of time
> passes, the crawler will crash inside the cURL routines, right where cURL
> accesses the gethostbyname funtion.
>
> As we understand it, cygwin does not offer a reentrant version of
> gethostbyname (gethostbyname_r coming to mind), and as such may be
> susceptible to errors when used with multiple threads. This also apparently
> breaks the reentrant capabilities of cURL libraries themselves, when
> compiled and used under cygwin.

If that is indeed the case, then yes, libcurl will not be working fully
re-entrant.

> The nut of our question is whether anyone else can confirm or deny any
> problems with the gethostbyname function in cygwin, using cURL

I would recommend you to take this question to a cygwin forum where people
with knowledge about internals like this might be likely to hang out.

I am also interested in getting to know if this truly is the case or not.

> and if confirmed, what was done to work around this problem?

Unregarding of what operating system you use, this could happen. (I here
assume that your program is at least somewhat portable.) Not all operating
systems provide thread-safe versions of the name resolving functions.

What to do? Well, if you can't avoid using libcurl from several simultanous
threads you need to protect the name resolving function with a mutex or
something, so that only one function call will be used at any given moment.
Those thread synchronising mechanisms aren't very portable either though.

To make things even worse, it is next to impossible for a configure script or
similar to actually find out if a platform has a thread-safe gethostbyname()
or not, since several platforms these days actually have a gethostbyname()
function (and not gethostbyname_r()) that works in a thread-safe manner!

-- 
    Daniel Stenberg -- curl groks URLs -- http://curl.haxx.se/

[Grub-develop] cURL cygwin and gethostbyname threading problems

From: Kord C. <ko...@gr...> - 2002-02-07 19:14:26

Our crawler uses the cURL libraries, and we've been working on
getting it running in cygwin for the past few weeks.  We have
(apparently) run into a problem with the gethostbyname calls
that cURL uses.

When running the crawler with more than one thread, and after
a bit of time passes, the crawler will crash inside the cURL
routines, right where cURL accesses the gethostbyname funtion.

As we understand it, cygwin does not offer a reentrant version
of gethostbyname (gethostbyname_r coming to mind), and as such
may be susceptible to errors when used with multiple threads.
This also apparently breaks the reentrant capabilities of cURL
libraries themselves, when compiled and used under cygwin.

The nut of our question is whether anyone else can confirm or
deny any problems with the gethostbyname function in cygwin,
using cURL, and if confirmed, what was done to work around 
this problem?

Thanks,

Kord 

-------------------------------------------------------------- 
Kord Campbell                                    Grub.Org Inc. 
President                               6051 N. Brookline #118 
                                       Oklahoma City, OK 73112 
ko...@gr...                            Voice: (405) 843-6336 
http://www.grub.org                        Fax: (405) 848-5477 
--------------------------------------------------------------

[Grub-develop] client segv

From: M. D. S. <dt...@dt...> - 2001-09-16 22:23:48

Attachments: strace-grubcrash.out.bz2

I know that this is a known bug, but not sure of its status:

The grub client seems to SIGSEGV about 1 drop in 3, and subsequently
restarts and continues. I put a strace on the process when running
it and have attached the last 1000 lines before the segfault.

System Info:
Debian GNU/Linux Potato (Stable)
Grub Client 0.1.6 (from source tar.gz)
Running 15 concurrent wgets, and b/w limiting to 256KB/sec

-drew

-- 
M. Drew Streib <dt...@dt...> | http://dtype.org/
FSG <dt...@fr...>    | Linux International <dt...@li...>
freedb <dt...@fr...>        | SourceForge <dt...@so...>

[Grub-develop] Is anybody out there?

From: Martin K. <rai...@ya...> - 2001-09-01 14:29:55

Has Grub deceased?

=====
http://devzero.ath.cx/
Visit the Systems Information Database
Have some interesting information? Put it up on the SID.
-Martin Klingensmith

__________________________________________________
Do You Yahoo!?
Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger
http://im.yahoo.com

Re: [Grub-develop] compiling grub with gcc-3.0

From: Jesper J. <ju...@ei...> - 2001-06-23 14:26:31

Vaclav Barta wrote:

>
> P.S.: Am I supposed to see my own messages posted to the
> Grub-develop list if I'm subsribed to it?

Yup, you should get your own messages back, but you really should be 
using the grub-general list. As far as I can tell that's what everybody 
else does (although traffic on that list has been extreamly low lately?).

Best regards,
Jesper Juhl

PS. I don't know what everybody else thinks, but I personally prefer 
unified diff format (diff -u) for patches. Could we get some official 
statement on this? Ozra? Kord?

[Grub-develop] compiling grub with gcc-3.0

From: Vaclav B. <vb...@co...> - 2001-06-23 11:06:01

Attachments: to-gcc-3.0.diff

Hi,

since gcc-3.0 is out, I think it would be nice if the grub client
compiled with it... It doesn't, but the problem is simple to 
repair - basically gcc-3.0 starts treating namespaces seriously, 
so it's necessary to add a bunch of qualifications/using 
directives. Please find attached a patch (of today's CVS sources)
doing that. If you look at it before applying (which I would 
encourage :-) ), you'll also find a couple of comments I'd like to
act on - I may have too strict definition of what a bug is, but
code like 
ClientDB::ClientDB(GrubCLog *log)
{
   try {
	arch = new archive();
   } catch (ArException ex) {
	cout<< "Caught Exception: " << ex.getErrno() << ": "
	    << ex.getDescription() << endl;
   }

just irritates me every time I see it...

	Bye
		Vasek

P.S.: Am I supposed to see my own messages posted to the
Grub-develop list if I'm subsribed to it? I think I haven't seen
my last message coming back and it really would be unfortunate if
the mail setup was wrong, I was talking to the wall and didn't 
even know it...

Re: [Grub-develop] Incremental changes to grub client - any interest?

From: Vaclav B. <vb...@co...> - 2001-06-12 20:25:58

Attachments: Coordinator.cpp Coordinator.h

Igor Stojanovski wrote:
> > sources - but I'm not getting very far... :-( I think perhaps
> > if I wrote comments and cleaned things up as I go along, I
> > would have a track to keep on - but of course I don't want to
> > throw those changes away... Ideally, I'd like somebody else to
> I would like to see them.
As a very small example of what I mean, please find attached
changed Coordinator.h & Coordinator.cpp (from today's CVS).
Basically I removed the USE_THREADS condition (per the on-line 
TODO entry "remove threads on the client" - BTW a TODO list
posted, say, monthly on Grub-develop would be nice) and added a
few (more-or-less critical :-) ) comments.
As I expected, a number of problems presented themselves: :-/
 - I shouldn't be posting whole files, only diffs - but I don't
know how to work on my changes *and* preserve the CVS version at 
the same time. What is the normal CVS setup for that? Is copying the
whole source tree before changing anything (which I forgot to
do :-( ) really the simplest way?
 - The changed client compiles but that really isn't enough - how
do I test? I noticed a number of test files/executables in the package -
is there some harness to run them?

	Bye
		Vasek

RE: [Grub-develop] Incremental changes to grub client - any interest?

From: Igor S. <oz...@gr...> - 2001-06-11 23:10:29

> -----Original Message-----
> From: gru...@li...
> [mailto:gru...@li...]On Behalf Of Vaclav
> Barta
> Sent: Monday, June 11, 2001 2:25 PM
> To: Gru...@li...
> Subject: [Grub-develop] Incremental changes to grub client - any
> interest?
> 
> 
> Hi,
> 
> in the past few weeks (on and off), I've been trying to read grub
> sources - but I'm not getting very far... :-( I think perhaps if
> I wrote comments and cleaned things up as I go along, I would have
> a track to keep on - but of course I don't want to throw those
> changes away... Ideally, I'd like somebody else to review them
> and merge them to the official tree - any takers? :-)
> 
> 	Bye
> 		Vasek

I would like to see them.

ozra.

> 
> _______________________________________________
> Grub-develop mailing list
> Gru...@li...
> http://lists.sourceforge.net/lists/listinfo/grub-develop

Re: [Grub-develop] Incremental changes to grub client - any interest?

From: Jesper J. <ju...@ei...> - 2001-06-11 20:28:50

Vaclav Barta wrote:

> Hi,
> 
> in the past few weeks (on and off), I've been trying to read grub
> sources - but I'm not getting very far... :-( I think perhaps if
> I wrote comments and cleaned things up as I go along, I would have
> a track to keep on - but of course I don't want to throw those
> changes away... Ideally, I'd like somebody else to review them
> and merge them to the official tree - any takers? :-)
> 

I wouldn't mind taking a look :)

- Jesper Juhl

[Grub-develop] Incremental changes to grub client - any interest?

From: Vaclav B. <vb...@co...> - 2001-06-11 19:37:22

Hi,

in the past few weeks (on and off), I've been trying to read grub
sources - but I'm not getting very far... :-( I think perhaps if
I wrote comments and cleaned things up as I go along, I would have
a track to keep on - but of course I don't want to throw those
changes away... Ideally, I'd like somebody else to review them
and merge them to the official tree - any takers? :-)

	Bye
		Vasek

[Grub-develop] Testing

From: Igor S. <oz...@cr...> - 2001-05-15 23:49:03

one-two, one-two.

[Grub-develop] Storing and scheduling URLs -- getting rid of the cumulative effect

From: Igor S. <oz...@gr...> - 2000-09-27 01:27:55


--------------------------------------------------------------
Igor Stojanovski                                 Grub.Org Inc.
Chief Technical Officer                 5100 N. Brookline #830
                                       Oklahoma City, OK 73112
oz...@gr...                            Voice: (405) 917-9894
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------


-----Original Message-----
From: rda...@gi...
[mailto:rda...@gi...]On Behalf Of Rodrigo Damazio
Sent: Monday, September 25, 2000 9:01 PM
To: Igor Stojanovski
Subject: Re: FW: Storing and scheduling URLs


Igor Stojanovski wrote:

> Here is what I was thing about scheduling.  It is not too complicated for
> implementation.
>
> Say, these are the fields of the URL table:
>
> url_id, next_in_list, next_update, other fields... (they are unimportant
in
> our case)
>
> ...where next_in_q is a url_id of a next in the list of URLs to be crawled
> in the same time interval.  next_update is calculated time interval which
> tells how much time must elapse before it is scheduled again, and which is
> calulated by halving/doubling (or by factor of 1.5) based on how often it
> changed.
>
> For example, let's say the first day grub Clients run we receive four new
> URLs (never crawled before), and we stick them in the URL table:
>
> URL table:
> url_id  next_in_list next_update in_crawl
> U1      U2           8           0
> U2      U3           8           0
> U3      U4           8           0
> U4      (null)       8           0
>
> next_update tells that once this URL is crawled; if its contents changed
> (which is sure to happen for never-crawled pages), the time will be halved
> to 4 for all of them.
>
> Then in addition we need a table to represent a time line.  Let's say that
> the smallest time interval is a week, and the week_id is computed as a
> number of weeks since the beginning of year 2000:
>
> Timeline table:
> time_id front_url_id
> 0       U1
>
> (Say that this week is 30th.) When week 30 comes, a queue of URLs is
> assembled via the linked lists of URLs that the timeline table points to,
> using front_url_id.  Here we have time_id = 0, where zero is a special
value
> meaning new URLs never crawled.  Now it is week 30, but since we don't
have
> an entry for time_id = 30, we go to time_id = 0 to get URLs never crawled
> before to attain a crawling queue.  It points to U1, and using
next_in_list
> we can derive the queue:
>
> QUEUE: U1, U2, U3, U4.
>
> We also set the in_crawl flag to 1.  This may be useful to make a clean-up
> of URLs that were never reported back from Clients.

        Hmmm so far sounds good, I just don't think we need to use the
next_id
thing - we can just do a query like "give me all URLs with time_id = 0",
something like that...

> When Clients report back the results, the in_crawl is set back to zero,
and
> next_update is halved:

        Note: it's only halved if the content has changed!!
[ozra] That's what I am saying, too.

> URL table:
> url_id  next_in_list next_update in_crawl
> U1      U2           4           0
> U2      U3           4           0
> U3      U4           4           0
> U4      (null)       4           0
>
> We create entry for time_id = 34 (current week plus next_update -- 30 + 4)
> and now we have:
>
> Timeline table:
> time_id front_url_id
> 0       (null)          -- we assume we never got any new URLs
> 34      U1
>
> ...34 is the week these URLs should be recrawled.
>
> Now, week 34 comes along, and we derive a queue again, which would look
the
> same (assuming that we never got any new URLs in meantime):
>
> QUEUE: U1, U2, U3, U4.
>
> Say URLs U1 and U3 have been updated, and U2, U4 not.
>
> Then we would double/half the next_update fields appropriately, and now
> there would be two linked lists (U1->U3, and U2->U4).  Every URL in the
> table belong to a linked list.
>
> URL table:
> url_id  next_in_list next_update in_crawl
> U1      U3           2           0
> U2      U4           8           0
> U3      (null)       2           0
> U4      (null)       8           0
>
> Timeline table:
> time_id front_url_id
> 0       (null)
> 36      U1
> 42      U2
>
> When week 36 comes, a queue containing U1 and U3 will be crawled, and so
on.
> I think you got my point.
>
> When we add a new URLs, it is prepended at (inserted at beginning of the
> list of) the list under time_id = 0.
>
> Deleting URLs from the URL list should not be very difficult.  We might
add
> an extra flag to indicate DELETED, and any time a QUEUE for URLs to be
> crawled is created, another queue for URLs to be deleted can be created at
> the same time as well by looking at the deleted flag.  Which means, that
> URLs will not be immediately removed, because they are part of the linked
> list.  Of course, we could devise a doubly linked list, but this is
> absolutely unnecessary.
>
> At the beginning we will be overwhelmed with new URLs to be crawled, and
if
> we don't think smart about it, we will not have any updates on previously
> crawled URLs, but only crawling new URLs, or the other way around.  That
is
> why all new URLs are put under time_id = 0, so that our algorithm may
> combine crawling new URLs and old URLs.  It would be great if (for
example)
> 60% of the crawling is spent on crawling (and finishing) old URLs, and 40%
> on newly-found ones.  In this way our database will grow and will be
> up-to-date.  But if crawling old URLs takes around or more then 100%, then
> we will have to sacrifice the "up-to-dateness" for crawling new URLs.

        This is good for the beginning, BUT, say, if we get to have a
one-million-URL queue for a certain week, and we only get a small part of
that
crawled, what happens?? It's a cumulative effect...we have to think of a way
to
build a queue that doesn't expect a "crawling schedule" to be met by the
clients...like my random idea was...

[ozra] Well, the queue I am proposing is not schedule-aware, it's just plain
and simple.  In anyway, you are right, and I think the cumulative effect
will occur inevitably.

I think I have a solution to this problem.  First, let's not forget what our
goal is -- the most up-to-date, and later, the most comprehensive search
engine on the net (the second is when we get enough Clients).  The
up-to-date part will be respected from the very beginning.

For what I am proposing, first, let's keep in mind three things --
everything I said about the algorithm in my previous email is unchanged
(except on how to schedule the URLs for crawling).  Second, there will be no
list that belongs to a time_id which is in the past, even if it means that
when we are backed up we move the whole remaining list to the following day,
and third, the Time table in the real workable system is with "resolution"
of one day instead of one week (I used week as an example only).

We will know (i.e. estimate) the capability our Clients prior to or at the
beginning of each day.  For this we will use some kind of a prediction
function such as moving average (I don't know which is right, I am not a
mathematician).  We store that value in URLS_PER_DAY variable at beginning
of each day.  Second, we take the average of sizes of each list of URLs,
which is same as total number of URLs that do not belong to the new URLs
list (for which time_id = 0):

                Total_number_of_URLs_in_our_database -
Number_of_URLs_for_time_id_equals_zero
AVG_LIST_SIZE
= --------------------------------------------------------------------------
---
                                      Number_of_entries_in_Time_table - 1

...and store the value in AVG_LIST_SIZE.  So we have the two variables --
URLS_PER_DAY and AVG_LIST_SIZE.  The goal here is to have these two values
as equal as possible.  Because this way our database will always be
up-to-date (as much as our algorithm permits), and it will grow only when
the number of Clients increases (more correctly, the total ability of the
Clients to crawl increases, which should be proportional).  And we will know
that when the value of URLS_PER_DAY increases.  When this happens, we will
peek into our new (never crawled) URLs, and send them to the Clients to
close the gap, i.e., increase the AVG_LIST_SIZE by scheduling more URLs to
the queue.

Now, here comes a problem -- what if URLS_PER_DAY decreases?  Well, perhaps
we may randomly pick URLs to delete from our database (and the indexed data
relating to them), or something else -- diminishing our database seems a
kind of silly to me.  But give me some other ideas.  Well, in order to avoid
such occurrences from happening often, there must be some gap allowed
between AVG_LIST_SIZE and URLS_PER_DAY, in that AVG_LIST_SIZE should always
be kept certain percentage lower, instead of making them equal.

Also, no prediction will be achieved exactly.  If the queue has not emptied
completely, it will be inserted at the beginning of the list for the next
day.  If it was emptied earlier, we should take some URLs from the following
day and crawl them earlier.  The bad cumulative effect should not take place
here as we are protected by the AVG_LIST_SIZE / URLS_PER_DAY ratio to
understand and to take care of the any extra URLs that may cause that.

> To cope with the problem, if it is week 50 and we are backed up to week
35,
> our algorithm must be configurable so that we may decrease the amount of
new
> URLs being crawled.

        Hmmm it's still cumulative though...we would get to a point where
the
site might just "stop" by the crawling bottleneck...it's a geometric
progression...

[ozra] Said above.

> Plus if we are so overwhelmed by new coming URLs, we may schedule them in
> the future, and not make them due immediately, or even reschedule sublists
> of URLs, which is a very efficient and simple operation.

        Hmmm we really needed to run a simulation on all this , if we use a
maldesigned algorythm we can screw up the whole thing...

[ozra] That's a good idea.

> I am just not sure about one thing -- there will be a lot of simple SQL
> statements needed to traverse through the linked lists.  I don't know how
> much of a burden is this.  We will probably need hundreds of thousands to
a
> million queries to build the URL queue (once grub takes off, of course).

        That's not too much...it's very possible, depends only on having
good
enough hardware to run it fast...like good Fibre channel storage with 800Mhz
memory and a few Xeon processors...

> Besides this, another serious problem with this system may be -- what if
we
> lose part of the URL database (due to hardware failure, for example)?
Then,
> chances are, most of the linked lists will be screwed up.  Well, that not
so
> bad of the problem after all (for our system -- losing the URL list by
> itself is extremely terrible anyway).

        Why linked lists?? And we have to have a good backup system, we
can't
risk losing our database...I suggest optical storage systems...those can
easily
get up to one terabyte...

> Give me your thoughts.

        I like your ideas, we just have to think more about the URL
overcrowding...think like this - in average, each new crawled URL will have
5
to 10 new links, and that goes like that in a G.P. again...while our users
increase in a A.P.(hey, that sounds like the Malthusian Theory for computing
LOL)...

> Cheers,
>
> ozra.

Max

[Grub-develop] RE: Storing and scheduling URLs

From: Igor S. <oz...@gr...> - 2000-09-25 19:11:16

Here is what I was thing about scheduling.  It is not too complicated for
implementation.

Say, these are the fields of the URL table:

url_id, next_in_list, next_update, other fields... (they are unimportant in
our case)

...where next_in_q is a url_id of a next in the list of URLs to be crawled
in the same time interval.  next_update is calculated time interval which
tells how much time must elapse before it is scheduled again, and which is
calulated by halving/doubling (or by factor of 1.5) based on how often it
changed.

For example, let's say the first day grub Clients run we receive four new
URLs (never crawled before), and we stick them in the URL table:

URL table:
url_id  next_in_list next_update in_crawl
U1      U2           8           0
U2      U3           8           0
U3      U4           8           0
U4      (null)       8           0

next_update tells that once this URL is crawled; if its contents changed
(which is sure to happen for never-crawled pages), the time will be halved
to 4 for all of them.

Then in addition we need a table to represent a time line.  Let's say that
the smallest time interval is a week, and the week_id is computed as a
number of weeks since the beginning of year 2000:

Timeline table:
time_id front_url_id
0       U1

(Say that this week is 30th.) When week 30 comes, a queue of URLs is
assembled via the linked lists of URLs that the timeline table points to,
using front_url_id.  Here we have time_id = 0, where zero is a special value
meaning new URLs never crawled.  Now it is week 30, but since we don't have
an entry for time_id = 30, we go to time_id = 0 to get URLs never crawled
before to attain a crawling queue.  It points to U1, and using next_in_list
we can derive the queue:

QUEUE: U1, U2, U3, U4.

We also set the in_crawl flag to 1.  This may be useful to make a clean-up
of URLs that were never reported back from Clients.

When Clients report back the results, the in_crawl is set back to zero, and
next_update is halved:

URL table:
url_id  next_in_list next_update in_crawl
U1      U2           4           0
U2      U3           4           0
U3      U4           4           0
U4      (null)       4           0

We create entry for time_id = 34 (current week plus next_update -- 30 + 4)
and now we have:

Timeline table:
time_id front_url_id
0       (null)          -- we assume we never got any new URLs
34      U1

...34 is the week these URLs should be recrawled.

Now, week 34 comes along, and we derive a queue again, which would look the
same (assuming that we never got any new URLs in meantime):

QUEUE: U1, U2, U3, U4.

Say URLs U1 and U3 have been updated, and U2, U4 not.

Then we would double/half the next_update fields appropriately, and now
there would be two linked lists (U1->U3, and U2->U4).  Every URL in the
table belong to a linked list.

URL table:
url_id  next_in_list next_update in_crawl
U1      U3           2           0
U2      U4           8           0
U3      (null)       2           0
U4      (null)       8           0

Timeline table:
time_id front_url_id
0       (null)
36      U1
42      U2

When week 36 comes, a queue containing U1 and U3 will be crawled, and so on.
I think you got my point.

When we add a new URLs, it is prepended at (inserted at beginning of the
list of) the list under time_id = 0.

Deleting URLs from the URL list should not be very difficult.  We might add
an extra flag to indicate DELETED, and any time a QUEUE for URLs to be
crawled is created, another queue for URLs to be deleted can be created at
the same time as well by looking at the deleted flag.  Which means, that
URLs will not be immediately removed, because they are part of the linked
list.  Of course, we could devise a doubly linked list, but this is
absolutely unnecessary.

At the beginning we will be overwhelmed with new URLs to be crawled, and if
we don't think smart about it, we will not have any updates on previously
crawled URLs, but only crawling new URLs, or the other way around.  That is
why all new URLs are put under time_id = 0, so that our algorithm may
combine crawling new URLs and old URLs.  It would be great if (for example)
60% of the crawling is spent on crawling (and finishing) old URLs, and 40%
on newly-found ones.  In this way our database will grow and will be
up-to-date.  But if crawling old URLs takes around or more then 100%, then
we will have to sacrifice the "up-to-dateness" for crawling new URLs.

To cope with the problem, if it is week 50 and we are backed up to week 35,
our algorithm must be configurable so that we may decrease the amount of new
URLs being crawled.

Plus if we are so overwhelmed by new coming URLs, we may schedule them in
the future, and not make them due immediately, or even reschedule sublists
of URLs, which is a very efficient and simple operation.

I am just not sure about one thing -- there will be a lot of simple SQL
statements needed to traverse through the linked lists.  I don't know how
much of a burden is this.  We will probably need hundreds of thousands to a
million queries to build the URL queue (once grub takes off, of course).

Besides this, another serious problem with this system may be -- what if we
lose part of the URL database (due to hardware failure, for example)?  Then,
chances are, most of the linked lists will be screwed up.  Well, that not so
bad of the problem after all (for our system -- losing the URL list by
itself is extremely terrible anyway).

Give me your thoughts.

Cheers,

ozra.

--------------------------------------------------------------
Igor Stojanovski                                 Grub.Org Inc.
Chief Technical Officer                 5100 N. Brookline #830
                                       Oklahoma City, OK 73112
oz...@gr...                            Voice: (405) 917-9894
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------

-----Original Message-----
From: rda...@ca... [mailto:rda...@ca...]On
Behalf Of Rodrigo Damazio
Sent: Friday, September 22, 2000 4:27 PM
To: Igor Stojanovski
Subject: Re: Storing and scheduling URLs

Igor Stojanovski wrote:

> To Rodrigo:
>
> We need to add a table(s) to our database that will store the URLs and
some
> statistics along with it.  We need a mechanism which will use the
statistics
> for dispatching/scheduling URLs to the Clients to crawl.  We have already
> talked on this issue.
>
> When Clients connect to the Server, they make a request.  The response is
a
> list of URLs for crawling.  This module should offer an interface that
will
> return a list, and update this action appropriately to the DB.
>
> I would like you to work on this module.
>
> You should devise an algorithm that will figure out how to schedule the
> URLs.  Use some of the ideas we have already presented via our emails.  I
> pasted an excerpt form an older email at the bottom of the msg.
>
> You must take into account several things in you model, though:
> 1) At the beginning, we will not have many Clients at our disposal, and
our
> database will be overwhelmed with new URLs to be crawled.  You must design
> this algorithm so that even when we have millions of URLs that were never
> crawled, our Clients will get back to those old pages, so that our
database
> will stay up-to-date as much as possible.  Remember, our goal is to have
the
> most up-to-date search engine on the net.
> 2) In future, we will provide means to measure each Client's crawling
> performance (pages crawled/day), so that we can assign appropriate number
of
> URLs to each one of them.  Don't worry about this one for now.
> 3) Also, we must think security.  We may need to introduce a certain
amount
> of redundancy in order to check whether we get good data from our Clients.
> For example, we may have 10% redundancy in crawling.  If data does not
match
> from two Clients, a third Client may be assigned to crawl the page in
> question, and to figure out which Client "cheated".  Of course, the page
may
> have changed in the short amount of time the two Clients crawled, and we
may
> wrongfully conclude that a Client is rogue.  Anyway, I say, don't worry
> about security for now.  Let's leave this for a later stage.
> 4) The URL scheduling algorithm must be highly configurable and modular
> enough so that we may add new capabilities to it easily.
> 5) Many other things I haven't accounted for.  Like for example, taking
into
> account the proximity of Clients to sites in dispatching the URLs...
>
> >From an old message:
>
> About dispatching/scheduling URLs to Clients:
>
> [ozra] Dispatching (term I borrowed from Robert) is a mechanism for
> scheduling URLs
> to Clients for crawling.  Here is my suggestion on how to schedule the
URLs.
> Every page that is crawled for the first time by our system is
automatically
> scheduled to be crawled again in (say) two weeks.  If in two weeks a
Client
> crawls the page and finds that the page has changed, the next crawling
time
> will be set for one week, or half the previous time;  if next week the
page
> changed again, the time will be halfed again to 3 days, and so on.  If on
> the other hand, a page didn't change, we might perhaps double the next
> scheduled time from two weeks to a month, etc.
>
>         [Rodrigo] Hmmm sounds good to me...just change doubling and
halving
> to multiplying and dividing by 1.5, I guess that's a more proper
> value...also, we have to consider the situation where a client starts
> crawling a HUGE site(Geocities for instance)...of course no one client
will
> crawl all of it, so we have to make it schedule the parts it doesn't...and
> develop a good schema so that no two clients will be crawling the same
> thing, and no pages will be left uncrawled...
>
> [ozra] Let's not forget that for each URLs that will be crawled, Client
> needs to get "permission" for the Server.  No exception.
>
> ---end msg---
>
> Give me your thoughts on this.
>
> Also, which one of your email addresses should I use now?

        Use this one only...the old one will bump messages...
        About the algorythm, I agree, and there's one thing I think we
should
add - make updates(or perhaps even rating) of most visited pages more
frequent...so if a page is only visited once a year by someone(through our
search of course), it won't be updated every day or anything...to measure
how
often a page is visited, just add a redirect script instead of putting
direct
links to the pages...
        Anyway, you're asking for a high complexity algorythm...I'll try to
do
it...anyway, we have to organize our ideas on it better...we could always
start
with a random-picking altorythm..something like this - "take the list of all
new URLs to be crawled, add it to the list of websites not recently updated,
RANDOMLY mix it all, and start sending it to the clients"...also, we gotta
use
a cyclic queue reading, in a way that an entry is only removed when a client
actually returns the crawled page instead of right when it's sent to a
client,
yet it won't be sent again to another client for a while(until the queue end
has been reached, which means all URLs have been sent to the clients, then
ir
repeats)...btw what will we do if we have an empy queue?? Start updating
everything again?? It's not likely to happen in the future but it'll
probably
happen in the beginning...
        Oh, actually, one little addition to the process above - the URLs to
be
updated will be interpolated with the new ones, so do a sort ONLY on the old
ones so that the most recently updated will be the last IN the sequence,
that
is, rearrange the URLs without changing the positions they got from random
mixing...

Max

[Grub-develop] To Rodrigo: Storing and scheduling URLs

From: Igor S. <oz...@gr...> - 2000-09-15 17:30:44

To Rodrigo:

We need to add a table(s) to our database that will store the URLs and some
statistics along with it.  We need a mechanism which will use the statistics
for dispatching/scheduling URLs to the Clients to crawl.  We have already
talked on this issue.

When Clients connect to the Server, they make a request.  The response is a
list of URLs for crawling.  This module should offer an interface that will
return a list, and update this action appropriately to the DB.

I would like you to work on this module.

You should devise an algorithm that will figure out how to schedule the
URLs.  Use some of the ideas we have already presented via our emails.  I
pasted an excerpt form an older email at the bottom of the msg.

You must take into account several things in you model, though:
1) At the beginning, we will not have many Clients at our disposal, and our
database will be overwhelmed with new URLs to be crawled.  You must design
this algorithm so that even when we have millions of URLs that were never
crawled, our Clients will get back to those old pages, so that our database
will stay up-to-date as much as possible.  Remember, our goal is to have the
most up-to-date search engine on the net.
2) In future, we will provide means to measure each Client's crawling
performance (pages crawled/day), so that we can assign appropriate number of
URLs to each one of them.  Don't worry about this one for now.
3) Also, we must think security.  We may need to introduce a certain amount
of redundancy in order to check whether we get good data from our Clients.
For example, we may have 10% redundancy in crawling.  If data does not match
from two Clients, a third Client may be assigned to crawl the page in
question, and to figure out which Client "cheated".  Of course, the page may
have changed in the short amount of time the two Clients crawled, and we may
wrongfully conclude that a Client is rogue.  Anyway, I say, don't worry
about security for now.  Let's leave this for a later stage.
4) The URL scheduling algorithm must be highly configurable and modular
enough so that we may add new capabilities to it easily.
5) Many other things I haven't accounted for.  Like for example, taking into
account the proximity of Clients to sites in dispatching the URLs...


From an old message:

About dispatching/scheduling URLs to Clients:

[ozra] Dispatching (term I borrowed from Robert) is a mechanism for
scheduling URLs
to Clients for crawling.  Here is my suggestion on how to schedule the URLs.
Every page that is crawled for the first time by our system is automatically
scheduled to be crawled again in (say) two weeks.  If in two weeks a Client
crawls the page and finds that the page has changed, the next crawling time
will be set for one week, or half the previous time;  if next week the page
changed again, the time will be halfed again to 3 days, and so on.  If on
the other hand, a page didn't change, we might perhaps double the next
scheduled time from two weeks to a month, etc.

        [Rodrigo] Hmmm sounds good to me...just change doubling and halving
to multiplying and dividing by 1.5, I guess that's a more proper
value...also, we have to consider the situation where a client starts
crawling a HUGE site(Geocities for instance)...of course no one client will
crawl all of it, so we have to make it schedule the parts it doesn't...and
develop a good schema so that no two clients will be crawling the same
thing, and no pages will be left uncrawled...

[ozra] Let's not forget that for each URLs that will be crawled, Client
needs to get "permission" for the Server.  No exception.

---end msg---

Give me your thoughts on this.


Cheers,

ozra.

--------------------------------------------------------------
Igor Stojanovski                                 Grub.Org Inc.
Chief Technical Officer                 5100 N. Brookline #830
                                       Oklahoma City, OK 73112
oz...@gr...                            Voice: (405) 917-9894
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------

[Grub-develop] More about the Ranker

From: Igor S. <oz...@gr...> - 2000-09-07 17:14:07

To Wagner:

In the context of the email that I sent to the grub-database list titled
"Pre-computed ranking vs. ranking on-the-fly", even though storing
pre-computed rank value has a great disadvantage over generating it
on-the-fly in that it takes a lot of effort to rebuild the database once you
change the ranking parameters, I think we should go with it for now, as we
get a lot of performance gain.  Therefore, I think your module should
implement the second type -- CUMULATIVE.

However, in order to assure that your module will be used in the future, I
think it needs to be modular enough so that if we needed to use it just for
getting the words from pages and figuring out the types and positions (and
not rank/weigh them), we would be able to do this.

Here is why.  Initially, we want the Ranker to be located at the Server.
The Clients will pass back full contents of pages to the Server, and the
Server will use the Ranker to get the words out, figure out the type,
position, and their weight/rank.  This a cumulative rank will be generated,
upon which the searches will be done.

In later stages of the project, we may actually move the Ranker (your
module) to the Client, but its responsibility will be somewhat limited -- it
will NOT rank the pages, but only "preprocess" them.  This means, it will
get the words, associate appropriate type with them (REGULAR, ANCHOR, META,
TITLE, ...), position, ... and send them to the Server.  The Server will do
the ranking on the partially processed data, and hence utilize more
processing power on the Clients.  I have actually included this capability
in the Client/Server protocol.

Another option would be to have the Clients do the ranking, where they will
be highly configurable from the Server on the parameters to be used for
ranking, and what to rank upon.

But let's not worry too much about the later stages.  Just to have them in
mind so that we won't get into too much trouble rewriting code when we get
there.


Cheers,

ozra.

--------------------------------------------------------------
Igor Stojanovski                                 Grub.Org Inc.
Chief Technical Officer                 5100 N. Brookline #830
                                       Oklahoma City, OK 73112
oz...@gr...                            Voice: (405) 917-9894
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------

1013 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 5 6 7 8 9 > >> (Page 7 of 9)