From: Randy M. <rwm...@gm...> - 2009-04-19 00:54:48
|
Hi Andrew, Since no one has commented yet.... On Fri, Apr 3, 2009 at 1:03 PM, Andrew Booth <ab...@pt...> wrote: > In a lab setup we recently noticed that the TIPC name tables on a few > TIPC nodes are out of sync. The question I have is about recovery from > this state. We're not sure exactly how we got in this state yet, we're > still gathering data. It's possible that packets were leaking between > two networks that weren't supposed to talk. Any progress on determining the root cause? That really should your primary focus. You might look at what happens when tipc can't allocated memory for a packet. I've upped min_free_kbytes to ensure that we don't over-commit memory. If you want to tell us more about your network (tipc version, linux version, physical network, machine type, memory, disk, etc) > > Now that the node status appears to have stabilized, we notice that the > name tables are out of sync between some nodes. Connection attempts to > those mismatched name table entries consistently fail. > > As I understand it, there is no user command to sync the tables, because > they aren't supposed to get out of sync. The only way to sync name > tables on two nodes is to withdraw one of them from the network and then > bring it back. Right. > I'm considering implementing a user command to "republish" the name > table to all nodes. Before I dig too deeply though, I thought I'd ask > whether this was likely to be a large or small project. Small but not tiny, I'd say. > What I had in mind was a tipc-config command that would resend the name > table information to all known nodes. On receipt of the name table > updates, the receiver could create missing entries or correct a name to > address/port assignment. This approach could correct missing or > incorrect name table entries at a remote node, it would not correct > extra name table entries at the remote node. I guess you could do it as an emergency measure but you really should design you apps so that you can reset a node if such problems occur and as I said above find and fix the root cause of the problem. > Any thoughts on the complexity of this endeavor? I'm hoping its as > simple as: > * add a hook for the command > * acquire appropriate locks > * send a message to each known node about each name table entry, as if > it was a first time publish > * release appropriate locks > The first issue I suspect is that if the receiver knows about the name > but has a mismatch in address/port, that it may reject the notification > rather than update its assignment. This sounds reasonable. Have you checked out the code or started to implement this at all? > > Another approach would be to provide a tipc-config type command to > explicitly set or clear name table entries. This would possibly be > simpler than the above, but would also have much more chance of being > used incorrectly. Yeah, I don't really like this approach. I think one should implement a manual re-learn without dropping good data then once you get some experience with that approach implement it as a periodic audit driven either by a timer or by failed sends/connects? Has anyone else had these problems? -- ../Randy/.. |