Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.
Please answer it
Can i call Web Crawler - A search Engine ???
WebCrawler, the Web’s first comprehensive full-text search engine, is a tool that assists users in their Web navigation by automating the task of link traversal, creating a searchable index of the web, and fulfilling searchers’ queries from the index.
How much worthy this statement is ??
Becuase What i have read on internet,everywhere it is written Web crawler extracts information from Web and store it in Search Engine database.
My question is Who Index It then in search Engine??
Who will perform Searching of user query when user querying to serach engine???
Is Web Crawler and Search Engine Same??
I understand those words like the following:
A search engine is a big database which provides a search inferface. Like Goolge, Yahoo etc.
This engine is "feeded" by the data the web crawler collects and sends back to its mother ship (called search engine).
Therefore, Google is a search engine which has a web crawler (Googlebot) which collects data for it.
You can say a web crawler is part of a web search engine.
Here is a picture showing the different parts of a global search engine: http://jaeksoft.github.io/opensearchserver/assets/tutorial/schema3_en.png
And here are two definitions, for "crawlers" and "index":
The crawler is responsible for giving data to the index. Then the "querying" part of the search engines can be used to access this data.
Can anyone tell me approximately how much time crawler needs to index pages from site with about 1,000,000 videos?
It depends up on your Network speed.
OpenSearchServer limits one thread per domain so if you are crawling from same domain it will be little slower.
Thank you for your reply,
I have this config in crawler:
Number of URLs to crawl: 50000
Fetch interval between re-fetches: 30
Number of simultaneous threads: 10
Maximum number of URLs per host: 500
Delay between each successive access, in seconds: 10
It's running 2nd day and there's about 150MB of data in index. Is this good or I have to tweak configuration?
Goal is to fetch all info about these 1,000,000 videos