|
From: Albert V. P. <av...@im...> - 2003-09-22 08:02:25
|
Hi, I have to develop a distributed search engine for my company. I’m very interested with the Lucene index format, and I want to use it. The main problem is how to distribute the index in the different machines. The solution is not just copy the index, because I have to manage 50Gb of data. I want to distribute the index in a more efficient way, I´ve thought the best was doing it by document, by field or by term. For example, I can distribute half the data by document, having documents in one computer and other documents in another (that´s the simplest way, but you have to send a query to all the servers). Another way is distribute the index by field (I work only with XML data, so a field will be a Xml element), having one field in one server, and another one in other. This version seems more efficient because you only have to send the query depending what field the user want to search. The last solution is distribute the index by term. Then you only have to send the query to those servers containing this term. Using the field and term solution, I don’t want to distribute the documents with the index (because on document can contain a lot of terms, and I don’t want to distribute the same document in all the computers), just the documentID, and have the pair <documentID, document> in another computer. I’m planning to have an indexing server, which will be in charge of indexing the data and then distribute it over the n backend servers (depending if the distribution is per document, per field or per term; the indexing server will have to look the lucene index and distribute only the needed parts). Then, the search server will know these distribution and will only send the queries to the appropriate servers. I want a include replication in order to increase the fiability as well. Any comments, suggestions and/or problems. Do you think It is a realistic/good solution? Thanks, Albert |