You can subscribe to this list here.
| 2008 |
Jan
|
Feb
(60) |
Mar
(65) |
Apr
(44) |
May
(6) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|---|
|
From: Yonik S. <yo...@ap...> - 2008-03-12 14:25:00
|
On Tue, Mar 11, 2008 at 5:49 PM, Doug Cutting <cu...@ap...> wrote: > > An example application can be an online email system. > > The keys of a user's emails are prefixed by the user name, > > so a user's emails are located together on the ring. When > > a user searches his/her emails, the query is only sent to > > servers which cover that range, instead of the entire ring. > > We could just define doc ids as 128-bit numbers rather than strings. > Then user-provided hash values wouldn't be a special case. A > constructor could convert string ids to 128-bit ids, and also store the > original string in a field named "id". Allowing the user to specify a hash value doesn't seem so different from allowing them to specify a numeric id... it's just 32 bits vs 128 bits. Ning's use case doesn't seem to require collision-free hashes. Perhaps there are other advantages to a numeric id that also serves as the hash though. -Yonik > 128-bits is big enough that we shouldn't see collisions. > > http://en.wikipedia.org/wiki/Birthday_attack > > By my reading, it would take over a trillion collections of a trillion > documents each before a collision is probable. |
|
From: Doug C. <cu...@ap...> - 2008-03-11 21:49:17
|
Ning Li wrote: >> > 1 Consistent hashing uses hash values because hash values >> > distribute uniformly on the ring. Can we support >> > application-specified keys for the ring? >> >> Seems like we could allow the user to specify their own hash value. >> What's the usecase here? > > An example application can be an online email system. > The keys of a user's emails are prefixed by the user name, > so a user's emails are located together on the ring. When > a user searches his/her emails, the query is only sent to > servers which cover that range, instead of the entire ring. We could just define doc ids as 128-bit numbers rather than strings. Then user-provided hash values wouldn't be a special case. A constructor could convert string ids to 128-bit ids, and also store the original string in a field named "id". 128-bits is big enough that we shouldn't see collisions. http://en.wikipedia.org/wiki/Birthday_attack By my reading, it would take over a trillion collections of a trillion documents each before a collision is probable. Doug |
|
From: Doug C. <cu...@ap...> - 2008-03-11 21:29:59
|
Ning Li wrote: > Well, does it look all right? Feel free to add more details! This looks great! Thanks for writing it up! Doug |
|
From: Yonik S. <yo...@ap...> - 2008-03-11 14:11:59
|
On Mon, Mar 10, 2008 at 11:39 AM, Ning Li <nin...@gm...> wrote: > On Fri, Mar 7, 2008 at 8:36 PM, Yonik Seeley <yo...@ap...> wrote: > > On Thu, Mar 6, 2008 at 5:47 PM, Ning Li <nin...@gm...> wrote: > > > > > 5 A document database? > > > - We store documents anyway. > > > - We don't support sub-document updates. > > > > Field updates? We could if we store all the fields. Solr has a patch > > for this, but it might be more efficient to implement in Lucene. It > > requires being able to get the *latest* stored fields for a doc, even > > if they are uncommitted. > > Let's not worry about performance for now. As you pointed > out, if we update one stored field for a doc, we have to figure > out the "latest" of all the other stored fields for the doc - but > it's impossible because of distributed update and eventual > consistency. Well, we can keep a revision number for each > stored field, but... Ah, right... I was talking more about just retrieving the latest stored fields in a particular lucene index. It's something we will need to do for replication anyway. > > > Here are a few comments on the features: > > > 1 Consistent hashing uses hash values because hash values > > > distribute uniformly on the ring. Can we support > > > application-specified keys for the ring? > > > > Seems like we could allow the user to specify their own hash value. > > What's the usecase here? > > An example application can be an online email system. > The keys of a user's emails are prefixed by the user name, > so a user's emails are located together on the ring. When > a user searches his/her emails, the query is only sent to > servers which cover that range, instead of the entire ring. Great example! This could really increase scalability for some systems. > > > The difference > > > is that the distribution may not be uniform so we need > > > to rebalance sometimes (remove a virtual node and insert > > > it somewhere else). > > > > I'll refer back again to my comments on separating replication (the > > range of node X is replicated on nodes X-1 and X-2) from key > > partitioning (the range of node X is 0-1000 + 5000-6000 for example). > > One can change the key partitioning w/o touching the replication configuration. > > I think your point is that we need re-balancing in any case? More about what rebalancing means too... when rebalancing, can you leave all the nodes in place (the replication configuration) and just change what keys map to a node? -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-03-11 14:04:42
|
BTW, I submitted a contrib package to Hadoop on (batch) building/updating a Lucene index using Map/Reduce. It's not competing with Bailey. :) It's something I think could be useful to some people. Ning |
|
From: Ning L. <nin...@gm...> - 2008-03-10 15:38:54
|
On Fri, Mar 7, 2008 at 8:36 PM, Yonik Seeley <yo...@ap...> wrote: > On Thu, Mar 6, 2008 at 5:47 PM, Ning Li <nin...@gm...> wrote: > > > 5 A document database? > > - We store documents anyway. > > - We don't support sub-document updates. > > Field updates? We could if we store all the fields. Solr has a patch > for this, but it might be more efficient to implement in Lucene. It > requires being able to get the *latest* stored fields for a doc, even > if they are uncommitted. Let's not worry about performance for now. As you pointed out, if we update one stored field for a doc, we have to figure out the "latest" of all the other stored fields for the doc - but it's impossible because of distributed update and eventual consistency. Well, we can keep a revision number for each stored field, but... > > Here are a few comments on the features: > > 1 Consistent hashing uses hash values because hash values > > distribute uniformly on the ring. Can we support > > application-specified keys for the ring? > > Seems like we could allow the user to specify their own hash value. > What's the usecase here? An example application can be an online email system. The keys of a user's emails are prefixed by the user name, so a user's emails are located together on the ring. When a user searches his/her emails, the query is only sent to servers which cover that range, instead of the entire ring. > > The difference > > is that the distribution may not be uniform so we need > > to rebalance sometimes (remove a virtual node and insert > > it somewhere else). > > I'll refer back again to my comments on separating replication (the > range of node X is replicated on nodes X-1 and X-2) from key > partitioning (the range of node X is 0-1000 + 5000-6000 for example). > One can change the key partitioning w/o touching the replication configuration. I think your point is that we need re-balancing in any case? > > 1 On the assumption that an application specifies document > > version number. It greatly simplifies things, but is > > it practical? > > I think so... > If the application can't provide it, the server (or client proxy) > could perhaps provide it via a timestamp. It's hard for the servers to sync their clocks, so timestamp is not reliable... Ning |
|
From: Yonik S. <yo...@ap...> - 2008-03-08 01:36:13
|
On Thu, Mar 6, 2008 at 5:47 PM, Ning Li <nin...@gm...> wrote: > 5 A document database? > - We store documents anyway. > - We don't support sub-document updates. Field updates? We could if we store all the fields. Solr has a patch for this, but it might be more efficient to implement in Lucene. It requires being able to get the *latest* stored fields for a doc, even if they are uncommitted. > Here are a few comments on the features: > 1 Consistent hashing uses hash values because hash values > distribute uniformly on the ring. Can we support > application-specified keys for the ring? Seems like we could allow the user to specify their own hash value. What's the usecase here? > The difference > is that the distribution may not be uniform so we need > to rebalance sometimes (remove a virtual node and insert > it somewhere else). I'll refer back again to my comments on separating replication (the range of node X is replicated on nodes X-1 and X-2) from key partitioning (the range of node X is 0-1000 + 5000-6000 for example). One can change the key partitioning w/o touching the replication configuration. > 1 On the assumption that an application specifies document > version number. It greatly simplifies things, but is > it practical? I think so... If the application can't provide it, the server (or client proxy) could perhaps provide it via a timestamp. -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-03-08 01:20:09
|
Thanks Ning, great writeup! I've just skimmed it so far, but will be re-reading in the near future. -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-03-07 23:29:54
|
DESIGN
1 System composition and functionality
The system consists of a "master", "nodes", and "clients".
We view the docid hash value space as a "ring". Each "node"
maintains an "index" on a "range" of documents on the ring.
The "master" keeps track of the mapping between the nodes
and the ranges they index.
The "node" here refers to a virtual/logical node. A physical
"host" hosts a number of nodes.
The system supports the following operations:
- addDoc(doc), deleteDoc(doc), updateDoc(doc)
- search(query)
An "application" conducts the operations through a "client".
A client communicates with the master to obtain the mapping
between the nodes and the ranges, and communicates with the
nodes directly to add/delete/update a document or conduct
a search.
2 Partitioning
We choose to use consistent hashing for load balancing.
A node X is assigned a "position" on the ring. The "next"
node/position on the ring is X+1. The range of node X is
[X, X+1).
This is similar to range partitioning.
3 Replication
We replicate to achieve fault tolerance. N-way replication
means that data is stored/indexed on N nodes. Assume N=3,
the range of node X is now [X, X+3). The range [X, X+1) is
now also served on node X-2 and node X-1 besides node X.
4 search(query)
Based on the mapping between the nodes and the ranges,
a client sends a search request to a number of nodes. The
client specifies a search range for each node the request
is sent to, so that 1) the search range for a node is
covered by the node's range, 2) the search ranges together
covers the ring once and only once, and 3) try to minimize
the number of nodes the request is sent to.
For example, if we have 7 nodes and 3-way replication,
a search can be sent to node X with search range [X, X+3),
node X+3 with search range [X+3, X+6), and node X+6 with
search range [X+6, X+7), where X+7 == X.
An alternative is to cover the ring R times (e.g. R=2)
to avoid latencies of re-querying.
5 addDoc(doc)/deleteDoc(doc)/updateDoc(doc)
We assume each document addition, deletion or update comes
with an application-specified revision number for the
document. Without this assumption, we'd have to do
something much more complicated such as vector clocks.
Each node maintains a light-weight log which is flushed
to disk only periodically. A log is a sequence of log
entries. Each log entry is <{ADD|DEL}, docid, revision>.
A write request is sent to multiple nodes, all of whose
ranges must include the document.
When a node receives a write request of a document with
revision R, it checks that revision R is newer than the
current revision stored/indexed on the node. Then the node
stores and indexes the new revision in memory and put a
log entry in the log.
A W-way write means at least W number of nodes have to
successfully serve the write request for the write to be
considered "complete". 1 < W <= N. W > 1 is to provide
fault tolerance to write operations.
6 Consistency
Each node syncs with the nodes with/on overlapping ranges
- a.k.a log propogation. But since in most cases W+R <= N,
the system provides eventual consistency.
Since a node checks that a revision for a write request
is newer than the current revision stored/indexed on the
node, monotonic write consistency is supported.
The implementation of client should support session
consistency. And it'd be nice to support read-your-writes
and monotonic read consistency.
Now back to log propogation. Node X serves range [X, X+3)
and periodically syncs:
range [X, X+1) with node X-2
range [X, X+2) with node X-1
range [X+1, X+3) with node X+1
range [X+2, X+3) with node X+2
Here is how changes of range [X, X+1) on node X is
propogated to node X-2. Assume during the last sync,
changes upto log entry E on node X was propogated to
node X-2 and the current log entry on node X is F.
- Now log entries (E, F] are scaned sequentially, and
those in the range [X, X+1) are sent to node X-2.
- Node X-2 checks the log entries received from node X.
For a log entry <{ADD|DEL}, docid, revision>, if the
current revision on node X-2 is older, node X-2 puts
the docid in a request list. Then node X-2 sends the
request list of docids to node X.
- Node X ships to node X-2 the stored copy of the
documents on the request list.
- Upon receiving the documents from node X, node X-2
processes it as a normal write request, which means
node X-2 will check the revision again to ensure
monotonic write consistency, and log.
7 Add node / remove node
A new position is assigned when a new node is added. Assume
a new node X' is assigned the position X' between X and X+1.
The neighbouring nodes will be affected and switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)
Node X' prepares range [X', X+3) by copying index files
from one or more neighbouring nodes. Then node X' starts
to sync with the nodes with/on overlapping ranges.
A flavor of two-phase commit is used to guarantee that
no writes or write propogations are lost in the process.
Well, does it look all right? Feel free to add more details!
Cheers,
Ning
On Thu, Mar 6, 2008 at 5:47 PM, Ning Li <nin...@gm...> wrote:
> Here are the main features discussed so far:
> 1 Load balancing by using consistent hashing.
> 2 Fault tolerance by replications.
> 3 Online update by making all replicas updateable.
> 4 Eventual consistency.
> - An update is sent to W replicas before it completes.
> - Assume an application specifies doc version number.
> - Using the terms in [1], we should support session
> consistency and monotonic write consistency, and
> maybe read-your-writes and monotonic read consistency.
> 5 A document database?
> - We store documents anyway.
> - We don't support sub-document updates.
> - Do we support document versioning? Other features?
>
> Here are a few comments on the features:
> 1 Consistent hashing uses hash values because hash values
> distribute uniformly on the ring. Can we support
> application-specified keys for the ring? The difference
> is that the distribution may not be uniform so we need
> to rebalance sometimes (remove a virtual node and insert
> it somewhere else).
> 1 On the assumption that an application specifies document
> version number. It greatly simplifies things, but is
> it practical?
> 2 How much a document database we want it to be? I'm not
> sure if CouchDB is a typical document database...
>
>
> Design for the features is to come...
>
> Ning
>
|
|
From: Ning L. <nin...@gm...> - 2008-03-06 22:47:51
|
Here are the main features discussed so far:
1 Load balancing by using consistent hashing.
2 Fault tolerance by replications.
3 Online update by making all replicas updateable.
4 Eventual consistency.
- An update is sent to W replicas before it completes.
- Assume an application specifies doc version number.
- Using the terms in [1], we should support session
consistency and monotonic write consistency, and
maybe read-your-writes and monotonic read consistency.
5 A document database?
- We store documents anyway.
- We don't support sub-document updates.
- Do we support document versioning? Other features?
Here are a few comments on the features:
1 Consistent hashing uses hash values because hash values
distribute uniformly on the ring. Can we support
application-specified keys for the ring? The difference
is that the distribution may not be uniform so we need
to rebalance sometimes (remove a virtual node and insert
it somewhere else).
1 On the assumption that an application specifies document
version number. It greatly simplifies things, but is
it practical?
2 How much a document database we want it to be? I'm not
sure if CouchDB is a typical document database...
Design for the features is to come...
Ning
|
|
From: Ning L. <nin...@gm...> - 2008-03-06 16:30:47
|
> > 3. Implement the "distributed" system using threads per node. > > I think last week's discussions gave us enough of a design to start > implementing this. Should we attempt to summarize that design anywhere > before trying to code it? Good idea. I'll start an email thread to summarize the high-level features and/or assumptions and then the design. After we agree on it, we can post it on wiki and start the implementation, hopefully soon! :) Ning |
|
From: Doug C. <cu...@ap...> - 2008-03-05 22:41:53
|
Doug Cutting wrote: > We might get started by using threads and method calls instead of RPC. > This should help us get our design straight before we invest in a "real" > implementation. So we might: > > 1. Write a simple client API: addDoc(), removeDoc(), updateDoc(), query(). I finally committed the abstract API. I'm out of town Thursday and Friday, but with luck may get some time to start a naive, in-memory, synchronouse implementation that, e.g., uses Java Collections as its index, that we can test against. > 2. Write a multi-threaded test program that uses this, simulating a > large, active index. It randomly adds, removes and updates randomly > created documents, and periodically queries and checks that results are > correct. I hope to start this next week. (That line sounds familiar...) > 3. Implement the "distributed" system using threads per node. I think last week's discussions gave us enough of a design to start implementing this. Should we attempt to summarize that design anywhere before trying to code it? Sorry this has been so slow to get started! The good news is that I've cleared lots of other stuff off my plate recently and hope to spend more time on Bailey. Doug |
|
From: Yonik S. <yo...@ap...> - 2008-03-01 01:54:14
|
On Fri, Feb 29, 2008 at 8:50 PM, Yonik Seeley <yo...@ap...> wrote: > On Fri, Feb 29, 2008 at 7:21 PM, Ning Li <nin...@gm...> wrote: > > I'm confused with which is clockwise and which is counter-clockwise now. :( > > I noticed the switch, too. It just depends if your reference is the > node or the key. > a doc that falls on node A will be replicated in B and C (that's > clockwise, like dynamo). > If you are looking for a replica of C, it will be at A and B > (counter-clockwise). Ignore this previous part, I think I confused myself :-) > I don't think it really matters what the direction ends up being > though. A node will be replicated to N unique hosts in a consistent > (but arbitrary) direction. > > Personally, it makes more sense to thing of it as HostA containing > NodeA, NodeB, NodeC (or IndexA, IndexB, IndexC). It's replicas of > nodes rather than overlapping ranges. > > -Yonik > |
|
From: Yonik S. <yo...@ap...> - 2008-03-01 01:50:00
|
On Fri, Feb 29, 2008 at 7:21 PM, Ning Li <nin...@gm...> wrote: > I'm confused with which is clockwise and which is counter-clockwise now. :( I noticed the switch, too. It just depends if your reference is the node or the key. a doc that falls on node A will be replicated in B and C (that's clockwise, like dynamo). If you are looking for a replica of C, it will be at A and B (counter-clockwise). I don't think it really matters what the direction ends up being though. A node will be replicated to N unique hosts in a consistent (but arbitrary) direction. Personally, it makes more sense to thing of it as HostA containing NodeA, NodeB, NodeC (or IndexA, IndexB, IndexC). It's replicas of nodes rather than overlapping ranges. -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-03-01 00:21:05
|
On Fri, Feb 29, 2008 at 6:54 PM, Doug Cutting <cu...@ap...> wrote: > Ning Li wrote: > >> You've switched to counter-clockwise replication, but I think that's > > > > Oops, I always used clockwise replication. :) > > Just to be clear: when a new document D arrives, the question is what > nodes on the ring do we send it to? The N clockwise (greater) or > counterclockwise from (less than) D? The most recent messages assume > counter-clockwise (less than). A document between C and D would be in > the indexes on nodes A B and C. The Dynamo paper says "clockwise", but > I think that direction makes things more confusing and that we should > instead use counterclockwise. Note that the index at node C contains > documents in range C-F, three nodes clockwise, but the Dynamo paper uses > the term clockwise when talking about assigning instances to nodes, not > the range that each node contains. Are we in agreement about this? I'm confused with which is clockwise and which is counter-clockwise now. :( I thought node C serves D-E and E-F in addition to C-D, so it's clockwise. You are saying because range C-D is on nodes A, B and C, so it should be called counter-clockwise. :) Sounds reasonable. In any case, I always thought and I think we agree that node B serves range B-E, node C serves range C-F, etc. So we had the same assumption in the most recent discussions. Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-29 23:54:43
|
Ning Li wrote: >> You've switched to counter-clockwise replication, but I think that's > > Oops, I always used clockwise replication. :) Just to be clear: when a new document D arrives, the question is what nodes on the ring do we send it to? The N clockwise (greater) or counterclockwise from (less than) D? The most recent messages assume counter-clockwise (less than). A document between C and D would be in the indexes on nodes A B and C. The Dynamo paper says "clockwise", but I think that direction makes things more confusing and that we should instead use counterclockwise. Note that the index at node C contains documents in range C-F, three nodes clockwise, but the Dynamo paper uses the term clockwise when talking about assigning instances to nodes, not the range that each node contains. Are we in agreement about this? Doug |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 22:46:26
|
On Fri, Feb 29, 2008 at 4:57 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > It depends on the documents I guess... if they are big, putting them > > in the index can be a burden because they get copied on every segment > > merge, and loading the other stored fields takes longer. > > Didn't Mike change that? Segments can now point to fields in a separate > file, according to: > > http://lucene.apache.org/java/docs/fileformats.html#Segments%20File > > I think that's so that they don't have to be copied with every merge. Ah right... they won't be copied at all for a single indexing session. So full index builds won't be impacted much, just merges due to incremental adds/changes. -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-29 22:38:02
|
On Fri, Feb 29, 2008 at 5:18 PM, Doug Cutting <cu...@ap...> wrote: > Ning Li wrote: > > I think this is how it works with replication=3: > > Node A serves range A-D, node B serves range B-E, node C serves range C-F... > > Node C syncs with nodes A & B on range C-D, syncs with nodes B and D > > on range D-E, and syncs with nodes D & E on range E-F. > > You've switched to counter-clockwise replication, but I think that's Oops, I always used clockwise replication. :) > generally more intuitive anyway. I also think of syncing as directional > and pulled, not pushed. So I think I'd state it (equivalently) as: > > C syncs C-D from A, C-E from B, D-F from D, and E-F from F. It's not node F, but node E: C syncs C-D from A, C-E from B, D-F from D, and E-F from E. > In other words, from each host it overlaps it syncs the overlapping > range. Numbers might be simpler: > > X has X to X+1 and syncs: It should be "X has X to X+3 and syncs". > X to X+1 from X-2 > X to X+2 from X-1 > X+1 to X+3 from X+1 > X+2 to X+3 from X+2 The rest looks correct. Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-29 22:23:38
|
Yonik Seeley wrote: > For this phase it doesn't even seem like facets are required. > Wouldn't simple queries exercise the read-side of things enough for > testing and developing the distributed strategy? Perhaps. But facets are something that we need to be sure to account for correctly. They're harder to handle than duplicate hits, which can simply be skipped. I was concerned that there are optimizations we might make that wouldn't work for facets, so putting them in from the start would ensure we didn't make those cheats. Overkill? Perhaps. Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-29 22:19:36
|
On Fri, Feb 29, 2008 at 4:24 PM, Doug Cutting <cu...@ap...> wrote: > Finally, as you observed, there would be hiccups whenever a node fails. > Hiccups affect a small percentage of BigTable clients, only those > touching the tablet on the failed node. But, in distributed search, > every query touches a large portion of the nodes. So, in a 1000 node > cluster, a failure might delay .1% of BigTable users, but might delay > 33% of distributed search users (assuming 3-way replication). So search > can be much more sensitive to this. > > So I'm not convinced that the BigTable model is as appropriate for > distributed full-text search as consistent hashing. Thoughts? I used BigTable as an example for single-write-replica. I didn't intend to use BigTable's single-read/write-replica model for distributed full-text search. I was thinking more like a single-write-multi-read-replica model (using consistent hashing). I considered your proposal as a multi-write/read-replica model. As I said before, the multi-write/read-replica model is more powerful than the single-write-multi-read-replica model. I was just worried about its complexity. But now I think it can be achieved without being too complicated. :) Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-29 22:18:16
|
Ning Li wrote: > I think this is how it works with replication=3: > Node A serves range A-D, node B serves range B-E, node C serves range C-F... > Node C syncs with nodes A & B on range C-D, syncs with nodes B and D > on range D-E, and syncs with nodes D & E on range E-F. You've switched to counter-clockwise replication, but I think that's generally more intuitive anyway. I also think of syncing as directional and pulled, not pushed. So I think I'd state it (equivalently) as: C syncs C-D from A, C-E from B, D-F from D, and E-F from F. In other words, from each host it overlaps it syncs the overlapping range. Numbers might be simpler: X has X to X+1 and syncs: X to X+1 from X-2 X to X+2 from X-1 X+1 to X+3 from X+1 X+2 to X+3 from X+2 Does that sound right? Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-29 21:57:30
|
Yonik Seeley wrote: > It depends on the documents I guess... if they are big, putting them > in the index can be a burden because they get copied on every segment > merge, and loading the other stored fields takes longer. Didn't Mike change that? Segments can now point to fields in a separate file, according to: http://lucene.apache.org/java/docs/fileformats.html#Segments%20File I think that's so that they don't have to be copied with every merge. > There are also two levels of "Document"... things like PDF, Word, etc, > also need to be parsed and have fields extracted make a lucene-style > Document. I assume that's out of scope for this project though. Yes, I think an application could implement that with, e.g., a binary field for the raw data, another field for the mime type, and a third for the extracted text to index. The raw data and text might be compressed. Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-29 21:24:43
|
Ning Li wrote: > On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote: >> A unique updatetable index per document would be nice, but I'm not yet >> entirely convinced it is practical. > > Not if short glitches are not acceptable. In BigTable, a tablet is served > by a single tablet server. I wonder if they find it to be a problem. BigTable points towards a different architecture, where all modifications are logged to a shared filesystem, and a single node handles both updates and searches for that range of ids. Perhaps we should consider this more seriously. We want to scale flexibly both in collection size and in search traffic. If search traffic is low, then indexes might be large, and if search traffic is high, indexes might be smaller and replication might be higher. But, with no search node replication, system performance tops out a the rate that a node can process queries on a tiny index, which is not infinite. So you'd probably want to add read-only replicas onto the BigTable model. But then, when you have lots of writes, you don't fully utilize your cluster, and our writes are much more compute intensive than BigTable writes. I think configuring a cluster in this model would be more complicated and less fluid. Finally, as you observed, there would be hiccups whenever a node fails. Hiccups affect a small percentage of BigTable clients, only those touching the tablet on the failed node. But, in distributed search, every query touches a large portion of the nodes. So, in a 1000 node cluster, a failure might delay .1% of BigTable users, but might delay 33% of distributed search users (assuming 3-way replication). So search can be much more sensitive to this. So I'm not convinced that the BigTable model is as appropriate for distributed full-text search as consistent hashing. Thoughts? Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-29 21:10:21
|
Ning Li wrote: > A copy of a document will be stored in a stored field, right? If we don't keep complete documents in the log, then I figured that all fields would be stored, so that document retrieved from the index could be directly added to another index. > First, keeping both a doc and its inverted form in an index > means storing the doc and indexing the doc are done in the > same "transaction". A traditional document database often > store a doc first and then index it later (hopefully soon). It should be possible to configure Lucene so that adding a document never triggers any blocking merges. So folks would have to wait for the single-document inversion, but no more. Today, it won't yet be visible to search for a bit, until a commit, but perhaps in a future version of Lucene it might be possible to search unflushed segments. > Second, a traditional document database often supports > updating a doc's "metadata" such as author or date. We > don't support this or we say a document is name-value > pairs and we reconstruct from stored fields and support > such update? I assume that incrementally updateable fields must not be indexed? In any case, I hope that indexing is fast enough that updating by add+delete is okay. CouchDB seems to get away with it... > #5 with non-sync light-weight logging? Should work. Yes. Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-29 19:17:15
|
On Fri, Feb 29, 2008 at 11:23 AM, Ning Li <nin...@gm...> wrote: > On Fri, Feb 29, 2008 at 9:08 AM, Yonik Seeley <yo...@ap...> wrote: > > On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote: > > > I'd imagined each node periodically querying its neighbors for changes > > > in the range they share. We shouldn't rely on clock synchronization, so > > > each node would keep the last revision of each neighbor that it has > > > sync'd with. So, the first time they connect, they pass revision zero > > > and receive all updates for their overlap. The next time they only need > > > to retrieve updates since the last. > > > > Sounds good! > > > > We'll need to make sure that it handles this case: > > Say C syncs from B, then B syncs from A and get some new docs. Now A > > goes away for a while... it seems like C will need to get the new docs > > from B next time it syncs. If the docs that B got from A are treated > > like any other new additions, this should work right? > > I think this is how it works with replication=3: > Node A serves range A-D, node B serves range B-E, node C serves range C-F... > Node C syncs with nodes A & B on range C-D, syncs with nodes B and D > on range D-E, and syncs with nodes D & E on range E-F. > > Now node D goes away. Node C continues to serve range C-F until it gets > range F-G from node E or node F and starts to serve range C-G. Then, > node C syncs with nodes A & B on range C-D (same as before), syncs with > nodes B and E on range D-F, and syncs with nodes E & F on range F-G. I see the following as a possible implementation: Each node logs its changes in a light-weight log. A log is a sequence of log entries, with the entry number monotonically increasing. Let's say node X periodically syncs with node Y. A sync works as follows: Node X starts with entry p in node Y's log where node X synced with node Y last time. After node X finishes syncing (node X puts actual changes into its own log), it sets its sync entry w/ node Y to entry q. When node D goes away, node C copies range F-G from node E or node F, let's say node E. For the copy, node E's own log is at entry s and node E is synced with node F at node F's entry t. Node C does the copy but does not log the copy. After node C finishes the copy and before it switches from serving range C-F to serving range C-G, it sets its sync entry with node E to min(its current sync entry with node E, entry s) and sets its sync entry with node F to entry t. Is this efficient enough? Ning |