From: Andrew B. <ab...@pt...> - 2009-04-03 17:36:41
|
Hi, In a lab setup we recently noticed that the TIPC name tables on a few TIPC nodes are out of sync. The question I have is about recovery from this state. We're not sure exactly how we got in this state yet, we're still gathering data. It's possible that packets were leaking between two networks that weren't supposed to talk. Now that the node status appears to have stabilized, we notice that the name tables are out of sync between some nodes. Connection attempts to those mismatched name table entries consistently fail. As I understand it, there is no user command to sync the tables, because they aren't supposed to get out of sync. The only way to sync name tables on two nodes is to withdraw one of them from the network and then bring it back. I'm considering implementing a user command to "republish" the name table to all nodes. Before I dig too deeply though, I thought I'd ask whether this was likely to be a large or small project. What I had in mind was a tipc-config command that would resend the name table information to all known nodes. On receipt of the name table updates, the receiver could create missing entries or correct a name to address/port assignment. This approach could correct missing or incorrect name table entries at a remote node, it would not correct extra name table entries at the remote node. Any thoughts on the complexity of this endeavor? I'm hoping its as simple as: * add a hook for the command * acquire appropriate locks * send a message to each known node about each name table entry, as if it was a first time publish * release appropriate locks The first issue I suspect is that if the receiver knows about the name but has a mismatch in address/port, that it may reject the notification rather than update its assignment. Another approach would be to provide a tipc-config type command to explicitly set or clear name table entries. This would possibly be simpler than the above, but would also have much more chance of being used incorrectly. Comments welcome, Andrew |
From: Randy M. <rwm...@gm...> - 2009-04-19 00:54:48
|
Hi Andrew, Since no one has commented yet.... On Fri, Apr 3, 2009 at 1:03 PM, Andrew Booth <ab...@pt...> wrote: > In a lab setup we recently noticed that the TIPC name tables on a few > TIPC nodes are out of sync. The question I have is about recovery from > this state. We're not sure exactly how we got in this state yet, we're > still gathering data. It's possible that packets were leaking between > two networks that weren't supposed to talk. Any progress on determining the root cause? That really should your primary focus. You might look at what happens when tipc can't allocated memory for a packet. I've upped min_free_kbytes to ensure that we don't over-commit memory. If you want to tell us more about your network (tipc version, linux version, physical network, machine type, memory, disk, etc) > > Now that the node status appears to have stabilized, we notice that the > name tables are out of sync between some nodes. Connection attempts to > those mismatched name table entries consistently fail. > > As I understand it, there is no user command to sync the tables, because > they aren't supposed to get out of sync. The only way to sync name > tables on two nodes is to withdraw one of them from the network and then > bring it back. Right. > I'm considering implementing a user command to "republish" the name > table to all nodes. Before I dig too deeply though, I thought I'd ask > whether this was likely to be a large or small project. Small but not tiny, I'd say. > What I had in mind was a tipc-config command that would resend the name > table information to all known nodes. On receipt of the name table > updates, the receiver could create missing entries or correct a name to > address/port assignment. This approach could correct missing or > incorrect name table entries at a remote node, it would not correct > extra name table entries at the remote node. I guess you could do it as an emergency measure but you really should design you apps so that you can reset a node if such problems occur and as I said above find and fix the root cause of the problem. > Any thoughts on the complexity of this endeavor? I'm hoping its as > simple as: > * add a hook for the command > * acquire appropriate locks > * send a message to each known node about each name table entry, as if > it was a first time publish > * release appropriate locks > The first issue I suspect is that if the receiver knows about the name > but has a mismatch in address/port, that it may reject the notification > rather than update its assignment. This sounds reasonable. Have you checked out the code or started to implement this at all? > > Another approach would be to provide a tipc-config type command to > explicitly set or clear name table entries. This would possibly be > simpler than the above, but would also have much more chance of being > used incorrectly. Yeah, I don't really like this approach. I think one should implement a manual re-learn without dropping good data then once you get some experience with that approach implement it as a periodic audit driven either by a timer or by failed sends/connects? Has anyone else had these problems? -- ../Randy/.. |
From: Stephens, A. <all...@wi...> - 2009-04-20 14:19:20
|
Hi there: Personally, I'm not in favor of introducing the sort of name table resynching that Andrew proposes, if we can avoid it. It is much more desirable to determine what the problem is that allowed the name tables to get out of synch and fix that instead. One major problem I see with Andrew's proposal is that it only deals with missing or out-of-date name table entries, but doesn't purge obsolete name table entries. Such entries are problematic for two reasons: a) they may cause applications to send messages to ports that no longer exist in the network, and b) they may prevent new name table entries from being added in the future (although this is relatively unlikely). I'd be much happier seeing a proposal that gets rid of these stale entries, too. FYI, Andrew has sent me some WireShark traces and other info that leads me to suspect that the source of his problem is a breakdown in the behavior of TIPC's broadcast link. (These weren't sent to the mailing list since the files involved were rather large.) It's also worth pointing out that he's running a network with dual Ethernet LANs, which provides redundant links between nodes. Since changes were made to the broadcast link in TIPC 1.7.6 to address other issues that arose in this type of network, it's possible that the changes I made were insufficient to avoid (or, gulp, actually introduced) the problems Andrew is seeing. As well, if the issue is limited to networks running redundant links it may explain why no one else has reported having name table inconsistency problems yet, since this kind of network seems to be less commonly used than the single-link-between-nodes variety. Regards, Al > -----Original Message----- > From: Randy MacLeod [mailto:rwm...@gm...] > Sent: Saturday, April 18, 2009 8:55 PM > To: Andrew Booth > Cc: tipc-discussion > Subject: Re: [tipc-discussion] Synching TIPC Name Tables > > Hi Andrew, > > Since no one has commented yet.... > > On Fri, Apr 3, 2009 at 1:03 PM, Andrew Booth <ab...@pt...> wrote: > > In a lab setup we recently noticed that the TIPC name > tables on a few > > TIPC nodes are out of sync. The question I have is about recovery > > from this state. We're not sure exactly how we got in this > state yet, > > we're still gathering data. It's possible that packets > were leaking > > between two networks that weren't supposed to talk. > > Any progress on determining the root cause? That really > should your primary focus. > You might look at what happens when tipc can't allocated > memory for a packet. > I've upped min_free_kbytes to ensure that we don't over-commit memory. > If you want to tell us more about your network (tipc version, > linux version, physical network, machine type, memory, disk, etc) > > > > > Now that the node status appears to have stabilized, we notice that > > the name tables are out of sync between some nodes. Connection > > attempts to those mismatched name table entries consistently fail. > > > > As I understand it, there is no user command to sync the tables, > > because they aren't supposed to get out of sync. The only > way to sync > > name tables on two nodes is to withdraw one of them from > the network > > and then bring it back. > > Right. > > > I'm considering implementing a user command to "republish" the name > > table to all nodes. Before I dig too deeply though, I > thought I'd ask > > whether this was likely to be a large or small project. > > Small but not tiny, I'd say. > > > What I had in mind was a tipc-config command that would resend the > > name table information to all known nodes. On receipt of the name > > table updates, the receiver could create missing entries or > correct a > > name to address/port assignment. This approach could > correct missing > > or incorrect name table entries at a remote node, it would > not correct > > extra name table entries at the remote node. > > I guess you could do it as an emergency measure but you > really should design you apps so that you can reset a node if > such problems occur and as I said above find and fix the root > cause of the problem. > > > Any thoughts on the complexity of this endeavor? I'm hoping its as > > simple as: > > * add a hook for the command > > * acquire appropriate locks > > * send a message to each known node about each name table > entry, as > > if it was a first time publish > > * release appropriate locks > > The first issue I suspect is that if the receiver knows > about the name > > but has a mismatch in address/port, that it may reject the > > notification rather than update its assignment. > > This sounds reasonable. Have you checked out the code or > started to implement this at all? > > > > > > Another approach would be to provide a tipc-config type command to > > explicitly set or clear name table entries. This would possibly be > > simpler than the above, but would also have much more > chance of being > > used incorrectly. > > Yeah, I don't really like this approach. > I think one should implement a manual re-learn without > dropping good data then once you get some experience with > that approach implement it as a periodic audit driven either > by a timer or by failed sends/connects? > > Has anyone else had these problems? > > -- > ../Randy/.. > > -------------------------------------------------------------- > ---------------- > Stay on top of everything new and different, both inside and > around Java (TM) technology - register by April 22, and save > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. > 300 plus technical and hands-on sessions. Register today. > Use priority code J9JMT32. http://p.sf.net/sfu/p > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |
From: Andrew B. <ab...@pt...> - 2009-04-20 17:42:17
|
Hi Allan, Agreed, finding the root cause is probably a better approach. I've pretty much given up on the idea of re-publishing the name table for now, if the broadcast link is confused then the name publications might not get sent anyway. Also, as you mentioned Allan, my proposal would not fix all possible name table discrepancies anyway. We'll keep working here to try and reproduce the issue and discover the root cause. Allan and Randy, thanks for the feedback, Andrew Stephens, Allan wrote: > Hi there: > > Personally, I'm not in favor of introducing the sort of name table > resynching that Andrew proposes, if we can avoid it. It is much more > desirable to determine what the problem is that allowed the name tables > to get out of synch and fix that instead. > > One major problem I see with Andrew's proposal is that it only deals > with missing or out-of-date name table entries, but doesn't purge > obsolete name table entries. Such entries are problematic for two > reasons: a) they may cause applications to send messages to ports that > no longer exist in the network, and b) they may prevent new name table > entries from being added in the future (although this is relatively > unlikely). I'd be much happier seeing a proposal that gets rid of these > stale entries, too. > > FYI, Andrew has sent me some WireShark traces and other info that leads > me to suspect that the source of his problem is a breakdown in the > behavior of TIPC's broadcast link. (These weren't sent to the mailing > list since the files involved were rather large.) It's also worth > pointing out that he's running a network with dual Ethernet LANs, which > provides redundant links between nodes. Since changes were made to the > broadcast link in TIPC 1.7.6 to address other issues that arose in this > type of network, it's possible that the changes I made were insufficient > to avoid (or, gulp, actually introduced) the problems Andrew is seeing. > As well, if the issue is limited to networks running redundant links it > may explain why no one else has reported having name table inconsistency > problems yet, since this kind of network seems to be less commonly used > than the single-link-between-nodes variety. > > Regards, > Al > > >> -----Original Message----- >> From: Randy MacLeod [mailto:rwm...@gm...] >> Sent: Saturday, April 18, 2009 8:55 PM >> To: Andrew Booth >> Cc: tipc-discussion >> Subject: Re: [tipc-discussion] Synching TIPC Name Tables >> >> Hi Andrew, >> >> Since no one has commented yet.... >> >> On Fri, Apr 3, 2009 at 1:03 PM, Andrew Booth <ab...@pt...> wrote: >> >>> In a lab setup we recently noticed that the TIPC name >>> >> tables on a few >> >>> TIPC nodes are out of sync. The question I have is about recovery >>> from this state. We're not sure exactly how we got in this >>> >> state yet, >> >>> we're still gathering data. It's possible that packets >>> >> were leaking >> >>> between two networks that weren't supposed to talk. >>> >> Any progress on determining the root cause? That really >> should your primary focus. >> You might look at what happens when tipc can't allocated >> memory for a packet. >> I've upped min_free_kbytes to ensure that we don't over-commit memory. >> If you want to tell us more about your network (tipc version, >> linux version, physical network, machine type, memory, disk, etc) >> >> >>> Now that the node status appears to have stabilized, we notice that >>> the name tables are out of sync between some nodes. Connection >>> attempts to those mismatched name table entries consistently fail. >>> >>> As I understand it, there is no user command to sync the tables, >>> because they aren't supposed to get out of sync. The only >>> >> way to sync >> >>> name tables on two nodes is to withdraw one of them from >>> >> the network >> >>> and then bring it back. >>> >> Right. >> >> >>> I'm considering implementing a user command to "republish" the name >>> table to all nodes. Before I dig too deeply though, I >>> >> thought I'd ask >> >>> whether this was likely to be a large or small project. >>> >> Small but not tiny, I'd say. >> >> >>> What I had in mind was a tipc-config command that would resend the >>> name table information to all known nodes. On receipt of the name >>> table updates, the receiver could create missing entries or >>> >> correct a >> >>> name to address/port assignment. This approach could >>> >> correct missing >> >>> or incorrect name table entries at a remote node, it would >>> >> not correct >> >>> extra name table entries at the remote node. >>> >> I guess you could do it as an emergency measure but you >> really should design you apps so that you can reset a node if >> such problems occur and as I said above find and fix the root >> cause of the problem. >> >> >>> Any thoughts on the complexity of this endeavor? I'm hoping its as >>> simple as: >>> * add a hook for the command >>> * acquire appropriate locks >>> * send a message to each known node about each name table >>> >> entry, as >> >>> if it was a first time publish >>> * release appropriate locks >>> The first issue I suspect is that if the receiver knows >>> >> about the name >> >>> but has a mismatch in address/port, that it may reject the >>> notification rather than update its assignment. >>> >> This sounds reasonable. Have you checked out the code or >> started to implement this at all? >> >> >> >>> Another approach would be to provide a tipc-config type command to >>> explicitly set or clear name table entries. This would possibly be >>> simpler than the above, but would also have much more >>> >> chance of being >> >>> used incorrectly. >>> >> Yeah, I don't really like this approach. >> I think one should implement a manual re-learn without >> dropping good data then once you get some experience with >> that approach implement it as a periodic audit driven either >> by a timer or by failed sends/connects? >> >> Has anyone else had these problems? >> >> -- >> ../Randy/.. >> >> -------------------------------------------------------------- >> ---------------- >> Stay on top of everything new and different, both inside and >> around Java (TM) technology - register by April 22, and save >> $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. >> 300 plus technical and hands-on sessions. Register today. >> Use priority code J9JMT32. http://p.sf.net/sfu/p >> _______________________________________________ >> tipc-discussion mailing list >> tip...@li... >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> >> |
From: Andrew B. <ab...@pt...> - 2009-05-28 18:39:05
|
Hi Allan, We've got a system again in a state where the name tables on two cards are out of sync in a lab system. This issue is currently ongoing, we can poke the card for more diagnostics as required. A similar issue came up about a month ago. Any ideas about how to troubleshoot this would be appreciated. I'll provide context for the question, but context always takes a while so I apologize in advance for the long email. The basic sections of this email are: * a description of the IP and physical network + O/S and TIPC versions * a description of the problem that is currently manifesting. * a description of the problems we say a month ago, and some of the investigations we did since. Note that the system in question is a development chassis and hence is subjected to some unusual stresses. It is not inconceivable that the issue manifested when someone had a process spinning the CPU or sending lots of UDP or TIPC traffic. Certainly since the TIPC name table issue manifested there has been a large amount of logging and network traffic related to recovery attempts. Thanks for any ideas, Andrew ===== Network ===== The TIPC network consists of 18 CPU cards and two switch cards in a chassis. IP communication between cards takes place across the chassis midplane. The midplane supports two disjoint Ethernet fabrics. Each CPU card has two Ethernet ports on the midplane, with each Ethernet port attached to one midplane fabric. Each of the switch cards manages traffic on one of the midplane fabrics. The switches are configured so TIPC traffic does not leave the chassis (based on VLAN configuration). So all that to say that the TIPC network consists of 20 nodes, each node can communicate with any other node using either of two disjoint IP subnets. The TIPC address of the card in slot x is 1.1.x. This set of IP networks within the chassis may come up again, call it the System Area Network (SAN). Some of the CPU cards have additional logical Ethernet interfaces for non-TIPC IP communication outside the chassis. This traffic also uses the midplane Ethernet fabrics, but is VLAN tagged within the chassis to distinguish it from SAN traffic. This setup of cards within a chassis is replicated several times in a lab. The chassis are connected to a common lab IP network. ==== CPU card details ==== The CPU cards come in three varieties: * Routing CPU 1.8 GHz dual-core AMD Opteron with 2 or 4 GB RAM These cards have Compact Flash cards for operating system and software. Logging is forwarded to Management CPUs over UDP. /proc/sys/vm/min_free_kbytes is 3816 * Management CPUs As Routing CPUs, but with a hard drive for logging * Line CPUs 800 MHz PPC440GX with 512 MB RAM These cards have Compact Flash cards for operating system and software. Logging is forwarded to Management CPUs over UDP. /proc/sys/vm/min_free_kbytes is 2698 In each case they are running Linux kernel 2.6.20 and TIPC 1.7.6. ===== Issue Summary: May 28, 2009 ===== Originally noticed: applications from other cards cannot open a TIPC connection to a named port on slot 4 (about May 21, 2009). On further investigation, there is a name table mismatch between slot4 and the other cards: each card recognizes that named port (50000, 3004) is on 1.1.4, but the associated port number is different on slot4 and on the other cards. Perform the following test: * log in to slot 4 * open a new TIPC named port 51000, 4000 * tipc-config -nt on slot4 shows the new named port * tipc-config -nt on slot3 does not show the new named port A tipc-config -ls=broadcast-link on slot4, before and after: Slot 4, before opening the named port: Link <broadcast-link> Window:20 packets RX packets:368 fragments:0/0 bundles:0/0 TX packets:20 fragments:0/0 bundles:2/29 RX naks:0 defs:0 dups:0 TX naks:0 acks:25 dups:0 Congestion bearer:0 link:0 Send queue max:22 avg:16 Slot 4, after opening the named port: Link <broadcast-link> Window:20 packets RX packets:368 fragments:0/0 bundles:0/0 TX packets:20 fragments:0/0 bundles:2/30 RX naks:0 defs:0 dups:0 TX naks:0 acks:25 dups:0 Congestion bearer:0 link:0 Send queue max:22 avg:16 Note that TX packets does not change except for bundles (2/29 to 2/30). This trend continues if we restart the tipc service a few times. Tracing the code, the printed stats appear to come from tipc_bclink_stats(). Given that bcl->stats.sent_bundles is unchanged and bcl->stats.sent_bundled increases, we must call link_bundle_buf(), but not increase bundles. There is only one place this can happen and it gives some clues about internal data: call tipc_link_send_buf() The following conditions are true, since we have to get to the call to link_bundle_buf() at the end: !(queue_size >= queue_limit) !(size > max_packet) (tipc_bearer_congested(l_ptr->b_ptr, l_ptr) || link_congested(l_ptr)) ((msg_user(msg) != CHANGEOVER_PROTOCOL) && (msg_user(msg) != MSG_FRAGMENTER)) In this branch, we must hit the first call to link_bundle_buf() so it can increment l_ptr->stats.sent_bundled, since the second call would increase l_ptr->stats.sent_bundles and that is not shown by the tipc-config -ls output. /* Try adding message to an existing bundle */ Here, l_ptr->next_out must be true so we call ( link_bundle_buf(l_ptr, l_ptr->last_out, buf) ) This in turn must return 1, since we incremented l_ptr->stats.sent_bundled. Hence tipc_bearer_resolve_congestion(l_ptr->b_ptr, l_ptr) also gets called. If we assume that tipc_link_send_buf() was called from tipc_bclink_send_msg() (since it is a name table update), then we also know that bclink->bcast_nodes.count is not zero. I'm not sure where things break down after that. ===== Issue Summary from April, 2009 ===== I'll give point form notes here. We had a similar issue in April, 2009. The card in slot20 (a management CPU) could not publish name publications. Also, there were several TIPC links to Line CPUs that were in the DEFUNCT state and would not recover. Unfortunately, after poking at the system for a bit we lost power during a lightning storm and the affected system eventually lost power. On follow up tests we were able to cause strange behavior on the Line CPUs as follows: * pick an Line CPU, say slot7 * from a management CPU (say slot3), bombard the Line CPU with Ethernet packets on fabric 1 * from the management CPU, ping the Line CPU on fabric 0 * note that the pings stop responding * stop the flood of packets * note that there could be name table mismatches between slot7 and others * note that there could be defunct links between slot7 and others We changed the logging configuration so TIPC status messages (such as "link failed") were not echoed to the console and the situation improved: there was an interruption of service to slot7 during the Ethernet storm, but the system recovered afterwards. The best guess at the moment is that printk() can be a slow function call when it echoes to the console. This logging slowed things down enough that the Ethernet receive buffer could overflow, at which point the Ethernet driver would reset and lose packets in the receive queue. This affected traffic on both Ethernet interfaces. Andrew Booth wrote: > Hi Allan, > > Agreed, finding the root cause is probably a better approach. > I've pretty much given up on the idea of re-publishing the name table > for now, if the broadcast link is confused then the name publications > might not get sent anyway. Also, as you mentioned Allan, my proposal > would not fix all possible name table discrepancies anyway. > > We'll keep working here to try and reproduce the issue and discover the > root cause. > > Allan and Randy, thanks for the feedback, > Andrew > > Stephens, Allan wrote: > >> Hi there: >> >> Personally, I'm not in favor of introducing the sort of name table >> resynching that Andrew proposes, if we can avoid it. It is much more >> desirable to determine what the problem is that allowed the name tables >> to get out of synch and fix that instead. >> >> One major problem I see with Andrew's proposal is that it only deals >> with missing or out-of-date name table entries, but doesn't purge >> obsolete name table entries. Such entries are problematic for two >> reasons: a) they may cause applications to send messages to ports that >> no longer exist in the network, and b) they may prevent new name table >> entries from being added in the future (although this is relatively >> unlikely). I'd be much happier seeing a proposal that gets rid of these >> stale entries, too. >> >> FYI, Andrew has sent me some WireShark traces and other info that leads >> me to suspect that the source of his problem is a breakdown in the >> behavior of TIPC's broadcast link. (These weren't sent to the mailing >> list since the files involved were rather large.) It's also worth >> pointing out that he's running a network with dual Ethernet LANs, which >> provides redundant links between nodes. Since changes were made to the >> broadcast link in TIPC 1.7.6 to address other issues that arose in this >> type of network, it's possible that the changes I made were insufficient >> to avoid (or, gulp, actually introduced) the problems Andrew is seeing. >> As well, if the issue is limited to networks running redundant links it >> may explain why no one else has reported having name table inconsistency >> problems yet, since this kind of network seems to be less commonly used >> than the single-link-between-nodes variety. >> >> Regards, >> Al >> >> >> >>> -----Original Message----- >>> From: Randy MacLeod [mailto:rwm...@gm...] >>> Sent: Saturday, April 18, 2009 8:55 PM >>> To: Andrew Booth >>> Cc: tipc-discussion >>> Subject: Re: [tipc-discussion] Synching TIPC Name Tables >>> >>> Hi Andrew, >>> >>> Since no one has commented yet.... >>> >>> On Fri, Apr 3, 2009 at 1:03 PM, Andrew Booth <ab...@pt...> wrote: >>> >>> >>>> In a lab setup we recently noticed that the TIPC name >>>> >>>> >>> tables on a few >>> >>> >>>> TIPC nodes are out of sync. The question I have is about recovery >>>> from this state. We're not sure exactly how we got in this >>>> >>>> >>> state yet, >>> >>> >>>> we're still gathering data. It's possible that packets >>>> >>>> >>> were leaking >>> >>> >>>> between two networks that weren't supposed to talk. >>>> >>>> >>> Any progress on determining the root cause? That really >>> should your primary focus. >>> You might look at what happens when tipc can't allocated >>> memory for a packet. >>> I've upped min_free_kbytes to ensure that we don't over-commit memory. >>> If you want to tell us more about your network (tipc version, >>> linux version, physical network, machine type, memory, disk, etc) >>> >>> >>> >>>> Now that the node status appears to have stabilized, we notice that >>>> the name tables are out of sync between some nodes. Connection >>>> attempts to those mismatched name table entries consistently fail. >>>> >>>> As I understand it, there is no user command to sync the tables, >>>> because they aren't supposed to get out of sync. The only >>>> >>>> >>> way to sync >>> >>> >>>> name tables on two nodes is to withdraw one of them from >>>> >>>> >>> the network >>> >>> >>>> and then bring it back. >>>> >>>> >>> Right. >>> >>> >>> >>>> I'm considering implementing a user command to "republish" the name >>>> table to all nodes. Before I dig too deeply though, I >>>> >>>> >>> thought I'd ask >>> >>> >>>> whether this was likely to be a large or small project. >>>> >>>> >>> Small but not tiny, I'd say. >>> >>> >>> >>>> What I had in mind was a tipc-config command that would resend the >>>> name table information to all known nodes. On receipt of the name >>>> table updates, the receiver could create missing entries or >>>> >>>> >>> correct a >>> >>> >>>> name to address/port assignment. This approach could >>>> >>>> >>> correct missing >>> >>> >>>> or incorrect name table entries at a remote node, it would >>>> >>>> >>> not correct >>> >>> >>>> extra name table entries at the remote node. >>>> >>>> >>> I guess you could do it as an emergency measure but you >>> really should design you apps so that you can reset a node if >>> such problems occur and as I said above find and fix the root >>> cause of the problem. >>> >>> >>> >>>> Any thoughts on the complexity of this endeavor? I'm hoping its as >>>> simple as: >>>> * add a hook for the command >>>> * acquire appropriate locks >>>> * send a message to each known node about each name table >>>> >>>> >>> entry, as >>> >>> >>>> if it was a first time publish >>>> * release appropriate locks >>>> The first issue I suspect is that if the receiver knows >>>> >>>> >>> about the name >>> >>> >>>> but has a mismatch in address/port, that it may reject the >>>> notification rather than update its assignment. >>>> >>>> >>> This sounds reasonable. Have you checked out the code or >>> started to implement this at all? >>> >>> >>> >>> >>>> Another approach would be to provide a tipc-config type command to >>>> explicitly set or clear name table entries. This would possibly be >>>> simpler than the above, but would also have much more >>>> >>>> >>> chance of being >>> >>> >>>> used incorrectly. >>>> >>>> >>> Yeah, I don't really like this approach. >>> I think one should implement a manual re-learn without >>> dropping good data then once you get some experience with >>> that approach implement it as a periodic audit driven either >>> by a timer or by failed sends/connects? >>> >>> Has anyone else had these problems? >>> >>> -- >>> ../Randy/.. >>> >>> -------------------------------------------------------------- >>> ---------------- >>> Stay on top of everything new and different, both inside and >>> around Java (TM) technology - register by April 22, and save >>> $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. >>> 300 plus technical and hands-on sessions. Register today. >>> Use priority code J9JMT32. http://p.sf.net/sfu/p >>> _______________________________________________ >>> tipc-discussion mailing list >>> tip...@li... >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> >>> >>> > > ------------------------------------------------------------------------------ > Stay on top of everything new and different, both inside and > around Java (TM) technology - register by April 22, and save > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. > 300 plus technical and hands-on sessions. Register today. > Use priority code J9JMT32. http://p.sf.net/sfu/p > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |