[CLucene-dev] Distributed Indexing

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I have to develop a distributed search engine for my company. I’m very 
interested with the Lucene index format, and I want to use it. The main 
problem is how to distribute the index in the different machines.

The solution is not just copy the index, because I have to manage 50Gb 
of data. I want to distribute the index in a more efficient way, I´ve 
thought the best was doing it by document, by field or by term.

For example, I can distribute half the data by document, having 
documents in one computer and other documents in another (that´s the 
simplest way, but you have to send a query to all the servers).

Another way is distribute the index by field (I work only with XML data, 
so a field will be a Xml element), having one field in one server, and 
another one in other. This version seems more efficient because you only 
have to send the query depending what field the user want to search.

The last solution is distribute the index by term. Then you only have to 
send the query to those servers containing this term. Using the field 
and term solution, I don’t want to distribute the documents with the 
index (because on document can contain a lot of terms, and I don’t want 
to distribute the same document in all the computers), just the 
documentID, and have the pair <documentID, document> in another computer.

I’m planning to have an indexing server, which will be in charge of 
indexing the data and then distribute it over the n backend servers 
(depending if the distribution is per document, per field or per term; 
the indexing server will have to look the lucene index and distribute 
only the needed parts). Then, the search server will know these 
distribution and will only send the queries to the appropriate servers.

I want a include replication in order to increase the fiability as well.

Any comments, suggestions and/or problems. Do you think It is a 
realistic/good solution?

Thanks,

Albert