From: Stephens, A. <all...@wi...> - 2009-04-20 14:19:20
|
Hi there: Personally, I'm not in favor of introducing the sort of name table resynching that Andrew proposes, if we can avoid it. It is much more desirable to determine what the problem is that allowed the name tables to get out of synch and fix that instead. One major problem I see with Andrew's proposal is that it only deals with missing or out-of-date name table entries, but doesn't purge obsolete name table entries. Such entries are problematic for two reasons: a) they may cause applications to send messages to ports that no longer exist in the network, and b) they may prevent new name table entries from being added in the future (although this is relatively unlikely). I'd be much happier seeing a proposal that gets rid of these stale entries, too. FYI, Andrew has sent me some WireShark traces and other info that leads me to suspect that the source of his problem is a breakdown in the behavior of TIPC's broadcast link. (These weren't sent to the mailing list since the files involved were rather large.) It's also worth pointing out that he's running a network with dual Ethernet LANs, which provides redundant links between nodes. Since changes were made to the broadcast link in TIPC 1.7.6 to address other issues that arose in this type of network, it's possible that the changes I made were insufficient to avoid (or, gulp, actually introduced) the problems Andrew is seeing. As well, if the issue is limited to networks running redundant links it may explain why no one else has reported having name table inconsistency problems yet, since this kind of network seems to be less commonly used than the single-link-between-nodes variety. Regards, Al > -----Original Message----- > From: Randy MacLeod [mailto:rwm...@gm...] > Sent: Saturday, April 18, 2009 8:55 PM > To: Andrew Booth > Cc: tipc-discussion > Subject: Re: [tipc-discussion] Synching TIPC Name Tables > > Hi Andrew, > > Since no one has commented yet.... > > On Fri, Apr 3, 2009 at 1:03 PM, Andrew Booth <ab...@pt...> wrote: > > In a lab setup we recently noticed that the TIPC name > tables on a few > > TIPC nodes are out of sync. The question I have is about recovery > > from this state. We're not sure exactly how we got in this > state yet, > > we're still gathering data. It's possible that packets > were leaking > > between two networks that weren't supposed to talk. > > Any progress on determining the root cause? That really > should your primary focus. > You might look at what happens when tipc can't allocated > memory for a packet. > I've upped min_free_kbytes to ensure that we don't over-commit memory. > If you want to tell us more about your network (tipc version, > linux version, physical network, machine type, memory, disk, etc) > > > > > Now that the node status appears to have stabilized, we notice that > > the name tables are out of sync between some nodes. Connection > > attempts to those mismatched name table entries consistently fail. > > > > As I understand it, there is no user command to sync the tables, > > because they aren't supposed to get out of sync. The only > way to sync > > name tables on two nodes is to withdraw one of them from > the network > > and then bring it back. > > Right. > > > I'm considering implementing a user command to "republish" the name > > table to all nodes. Before I dig too deeply though, I > thought I'd ask > > whether this was likely to be a large or small project. > > Small but not tiny, I'd say. > > > What I had in mind was a tipc-config command that would resend the > > name table information to all known nodes. On receipt of the name > > table updates, the receiver could create missing entries or > correct a > > name to address/port assignment. This approach could > correct missing > > or incorrect name table entries at a remote node, it would > not correct > > extra name table entries at the remote node. > > I guess you could do it as an emergency measure but you > really should design you apps so that you can reset a node if > such problems occur and as I said above find and fix the root > cause of the problem. > > > Any thoughts on the complexity of this endeavor? I'm hoping its as > > simple as: > > * add a hook for the command > > * acquire appropriate locks > > * send a message to each known node about each name table > entry, as > > if it was a first time publish > > * release appropriate locks > > The first issue I suspect is that if the receiver knows > about the name > > but has a mismatch in address/port, that it may reject the > > notification rather than update its assignment. > > This sounds reasonable. Have you checked out the code or > started to implement this at all? > > > > > > Another approach would be to provide a tipc-config type command to > > explicitly set or clear name table entries. This would possibly be > > simpler than the above, but would also have much more > chance of being > > used incorrectly. > > Yeah, I don't really like this approach. > I think one should implement a manual re-learn without > dropping good data then once you get some experience with > that approach implement it as a periodic audit driven either > by a timer or by failed sends/connects? > > Has anyone else had these problems? > > -- > ../Randy/.. > > -------------------------------------------------------------- > ---------------- > Stay on top of everything new and different, both inside and > around Java (TM) technology - register by April 22, and save > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. > 300 plus technical and hands-on sessions. Register today. > Use priority code J9JMT32. http://p.sf.net/sfu/p > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |