Readme

Junjie Peng

A.What's DoSearch?
DoSearch is simple search engine for web pages, written by Junjie Peng
junjie.peng.hn@gmail.com. It can crawl pages from the world wide web,
build indexes and search the specified query and return the result quickly.

B.What can DoSearch do?
In one word, DoSearch can crawl pages from the world wide web, build indexes
and search the specified query and return the result quickly.
Let's browse it's functions one by one:

  1. Crawling
    1.1 Crawl task can be configured.
    Regexes can be used to guide DoSearch crawling on only urls specified, and
    storing only pages specified. Besides, you can specify the mount of urls
    to crawl and store, the delay between accessing two pages.
    A crawl task may be in the form below:
    1351662624174 : 150000 : 100000 : 2000 : ^http://movie.douban.com/((doulist|subject)/\d+/|tag/.*) : ^http://movie.douban.com/subject/\d+/

1.2 Restore task from last terminated.
When crawling, a directory named by id will be create, and a file of
UrlsCrawled.txt will be created in the directory to track urls crawled, and
the other file of UrlsToCrawl.txt to track urls to crawled in future. Progress
will be updated crawling per 50 urls, which is able to configured. If you plan
to crawl a huge mount of urls, but the computer is down in some time, the
progress kept will be sharply useful this time.

1.3 Crawl the pages can be stored priorly.
A priority queue is kept in DoSearch, which guarantees the pages you hope to
store will be crawled first. This function is especially useful to help you
get the result at the first time.

2.Indexing
2.1. Support huge data.
For huge amount of pages to process, a technology named multi-ways merge
is introduced. When processed some pages, a segment of index file is created,
after all, many index files will be created, and then they are merged into one
big index file.

2.2. Break Chinese term just by single characters.
In the current version, word-breaking is still very poor, we just break
Chinese one character by another. Through this, the mount of terms is big,
many high-frequency term occurs, and a chain for one term may be very long.

2.3. Search the index fast.
In the current version, to search the chain of a given term in middle of
more than 200,000 terms, in a CPU of 1.9GMHZ, it just cost less than 50ms,
with very small memory.

3.Ranking
3.1 Boolean Model is applied first
The documents which contains all the terms will be chosen. If a term
in the query has not indexes, it will be omitted simply.

3.2 Merge improved efficiently
For multi-ways merge, the ways are sorted in ascending order first. Then merged
two by two. While for 2-ways merge, it's divided into three situations:
a. N is near to M (N,M is the length of chains respectively).
Two chains will by merged one doc id by another, Which cost O(M + N).
b. N is greatly bigger than M.
Each doc id in M will be searched half-to-half in N, Which cost O(M * lg(N)).
c. N is 10 times bigger than M.
Each doc id in M will be searched by jump-read in N, Which cost O(2 * sqrt(M * N)).
For 2 given chains, DoSearch will choose which method is the optimal, then act on it.

3.3 TF-IDF Model
After filtered by Boolean Model, TF-IDF Model and Order-Weighted model are
applied to compute the weights for each doc filtered. TF-IDF info has been
computed early in the phrase of index building.

3.4 Order-Weighted model
With Order-Weighted model, "good boy" and "boy good" can be distinguished.

C.How to run DoSearch?
How to run DoSearch depends on what you plan to do.
1.To Crawl.
java -Xms64m -Xmx1024m -cp dosearch.jar com.pjj.se.CrawSystem

2.To Search
java -jar dosearch.jar "Query"
Please make sure that all data files are in the proper place. They are
doc.content, doc.index, doc.url, term.index, term.content.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks