From: Nayman Felix-Q. <Fel...@mo...> - 2010-08-24 16:39:18
|
Al, So we ran the test you suggested to just use "Al's patch without Surya's" and we were able to reproduce our name table distribution problem. >From reading your post and the comment about link changeover, I had a question. Are you saying that this problem is related to having redundant links? We recently (2 months ago) changed to use redundant links in an active/standby configuration. This coincided in time with us switching to TIPC 1.7.6 and seeing all of the problems that we've recently seen. So I'm wondering if we went back to using only 1 link between nodes instead of 2, do you think that our name table distribution problem will go away? Thanks, Felix -----Original Message----- From: Stephens, Allan [mailto:all...@wi...] Sent: Monday, August 16, 2010 1:07 PM To: Nayman Felix-QA5535; Suryanarayana Garlapati; tip...@li... Subject: RE: [tipc-discussion] TIPC Name table distribution problem in TIPC1.7.6 Hi Felix: I'm finally back from vacation and can respond to your recent emails (at least to some degree). I'd be interested in knowing whether you experience any problems if you try running TIPC 1.7.7-rc1 with Al's patch (presumably the one entitled "Prevent broadcast link stalling in dual LAN environment") but not Surya's patch. From the data you supplied below it is conceivable that this might be sufficient to resolve your issue. I did incorporate a portion of Surya's patch into the TIPC 1.7 stream, but I haven't yet brought in the part that changes the default link window size or which resets the link's reset_checkpoint when a link changeover is aborted; the delay is because I'm not yet convinced that either of these changes is necessary and/or correct. If you encounter problems when Surya's changesa re missing this would provide evidence that they are actually needed. I don't have any guidance to provide on what link window size values should be used for either the unicast links or for the broadcast link; I suspect that the value will depend heavily on the type of hardware you're running in your system and the nature of the traffic load you're passing between nodes, and that you'll need to experiment to see what values work best for your system. It looks like Peter was running his traffic over high speed Ethernet interfaces and found that he needed to increase his unicast window size to prevent the links from declaring congestion prematurely; presumably the larger window sizes helped improve his link throughput values. I've got no idea where the broadcast link window size of 224 came from; as with the unicast links, you're probably best off to experiment to see what values work best in your system. I'm continuing to investigate the entire broadcast link and dual link areas to see what issues remain unresolved, as there are still some known issues that look to be problematic. I suspect that there will be a few more patches added to TIPC 1.7 before things are totally stabilized. Regards, Al > -----Original Message----- > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > Sent: Sunday, August 15, 2010 10:41 PM > To: Suryanarayana Garlapati; tip...@li... > Subject: Re: [tipc-discussion] TIPC Name table distribution > problem in TIPC1.7.6 > > All, > > So we've run a number of tests: > 1)TIPC 1.7.6 - we were able to continuously reproduce the issue. > 2) Surya's patch on top of 1.7.6. Result: We could not reproduce the > issue > 3) Surya's patch on top of 1.7.6 with the window size set to > 20 instead of 50. Result: We were able to reproduce the issue. > This means that at least for our test run, the window size > change was the main reason why we didn't see the name table > distribution problem. > 4) Surya's and Al's patch on top of 1.7.6. Result: We could > not reproduce the issue > 5) Surya's and Al's patch on top of 1.7.6 with the window > size set to 20 instead of 50. Result: We could not reproduce > the issue > 6) TIPC 1.7.7-rc1 with Surya's and Al's patch. Result: We > could not reproduce the issue > > > So it looks like a combination of Al's patch and Surya's > patch seems to prevent our name table distribution problem > from happening no matter if the window size is 20 or 50, but > Surya's patch does not work when we reduced the window size > down to 20. > > I saw that in TIPC 1.7.6 support for window sizes as high as > 8192 was added. We are considering increasing our window > size for both the broadcast link and unicast links from the > current default size of 50. > I'm wondering if there is a recommended maximum value. From > Peter Litov's post it appears that he's tried 1024 for the > unicast link window size. Are there any issues with this > value? Has it been found to improve throughput and alleviate > congestion issues? Are there any drawbacks or side effects? > He mentions a broadcast link window size of 224. I'm not > sure why 224 is a magic number, but are there any > recommendations for the broadcast link window size. > > Thanks for any feedback, > Felix > > -----Original Message----- > From: Suryanarayana Garlapati [mailto:SGa...@go...] > Sent: Tuesday, August 03, 2010 12:45 AM > To: Nayman Felix-QA5535; tip...@li... > Subject: RE: TIPC Name table distribution problem in TIPC 1.7.6 > > Hi Felix, > This is a issue of broadcast link congestion only. Please try > my patch and let me know whether it works. I had faced the > similar issue and with > > my patch, the issue was fixed. > > Regards > Surya > > > > -----Original Message----- > > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > > Sent: Tuesday, August 03, 2010 1:06 AM > > To: Suryanarayana Garlapati; tip...@li... > > Subject: RE: TIPC Name table distribution problem in TIPC 1.7.6 > > > > All, > > > > So I tried to disable both bearers and the broadcast-link is still > up?? > > How is that possible? > > From the stats, it appears that we can't send out any > messages on the > > broadcast link. I tried and nothing got sent so it appears > that the > > link is permanently congested with respect to sending > messages. The > > broadcast link does appear to be receiving messages though. When I > > enabled both bearers the broadcast link did not recover. > > > > We have redundant links between nodes in an active/standby > > configuration with 1 link having a priority of 10 and the other a > > priority of 9. > How > > does the broadcast link choose a bearer? If we only had one bearer > > would we not see this problem? > > > > > > bash-3.1# tipc-config -b > > Bearers: > > eth:bond0 > > eth:bond1 > > bash-3.1# tipc-config -bd eth:bond0 > > bash-3.1# tipc-config -bd eth:bond1 > > bash-3.1# tipc-config -ls=broadcast-link Link statistics: > > Link <broadcast-link> > > Window:20 packets > > RX packets:14407 fragments:0/0 bundles:39/62 > > TX packets:20 fragments:0/0 bundles:111/2314 > > RX naks:0 defs:0 dups:0 > > TX naks:0 acks:903 dups:0 > > Congestion bearer:0 link:0 Send queue max:131 avg:75 > > > > bash-3.1# tipc-config -l > > Links: > > broadcast-link: up > > > > > > > > Thanks, > > Felix > > > > > > -----Original Message----- > > From: Nayman Felix-QA5535 > > Sent: Monday, August 02, 2010 10:14 AM > > To: 'Suryanarayana Garlapati'; tip...@li... > > Subject: RE: TIPC Name table distribution problem in TIPC 1.7.6 > > > > Surya, > > > > Thanks for the quick response. Here are the broadcast link > stats from > > that node: > > > > mpug@ATCA35_pl0_1:/usr/vob/mp/common/tools/linux$ tipc-config > > -ls=broadcast-link Link statistics: > > Link <broadcast-link> > > Window:20 packets > > RX packets:14375 fragments:0/0 bundles:39/62 > > TX packets:20 fragments:0/0 bundles:110/2301 > > RX naks:0 defs:0 dups:0 > > TX naks:0 acks:901 dups:0 > > Congestion bearer:0 link:0 Send queue max:130 avg:74 > > > > Wouldn't I expect to see link congestion set to a non-zero > value with > a > > send queue max size of 130? Yes, we are using redundant > links between > > nodes, but with different Ethernet bearers for each link. > > > > I assume that the name table distributions are sent over > the broadcast > > link and that is why you suspect that this is the problem. Would > there > > be any sign of trouble in the syslog when this happens (kernel > messages > > from TIPC). I'll look through the syslog for clues. > > > > Thanks, > > Felix > > > > -----Original Message----- > > From: Suryanarayana Garlapati [mailto:SGa...@go...] > > Sent: Monday, August 02, 2010 9:09 AM > > To: Nayman Felix-QA5535; tip...@li... > > Subject: RE: TIPC Name table distribution problem in TIPC 1.7.6 > > > > Hi Felix, > > I am suspecting that this is a broadcast link congestion being > > happening. > > Please use the following patch on top of 1.7.6-RC3 and let me know > > whether it solves your problem. > > By the way are you using any redudant links for the bearer? > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > +++++++++ > > + > > +++++++++++++++++++++ > > diff -uNr findings_tipc/net/tipc/tipc_bcast.c > > patch_tipc/net/tipc/tipc_bcast.c > > --- findings_tipc/net/tipc/tipc_bcast.c 2008-10-01 > 23:27:21.000000000 > > +0530 > > +++ patch_tipc/net/tipc/tipc_bcast.c 2009-07-30 > 10:43:35.000000000 > > +0530 > > @@ -50,7 +50,7 @@ > > > > #define MAX_PKT_DEFAULT_MCAST 1500 /* bcast link max packet size > > (fixed) */ > > > > -#define BCLINK_WIN_DEFAULT 20 /* bcast link window > size > > (default) */ > > +#define BCLINK_WIN_DEFAULT 50 /* bcast link window > size > > (default) */ > > > > #define BCLINK_LOG_BUF_SIZE 0 > > > > diff -uNr findings_tipc/net/tipc/tipc_node.c > > patch_tipc/net/tipc/tipc_node.c > > --- findings_tipc/net/tipc/tipc_node.c 2008-10-01 > 23:27:21.000000000 > > +0530 > > +++ patch_tipc/net/tipc/tipc_node.c 2009-07-30 10:45:21.000000000 > > +0530 > > @@ -313,7 +313,7 @@ > > for (i = 0; i < TIPC_MAX_BEARERS; i++) { > > l_ptr = n_ptr->links[i]; > > if (l_ptr != NULL) { > > - l_ptr->reset_checkpoint = l_ptr->next_in_no; > > + l_ptr->reset_checkpoint = 1; > > l_ptr->exp_msg_count = 0; > > tipc_link_reset_fragments(l_ptr); > > } > > @@ -361,6 +361,8 @@ > > tipc_bclink_acknowledge(n_ptr, > mod(n_ptr->bclink.acked + 10000)); > > tipc_bclink_remove_node(n_ptr->elm.addr); > > } > > + > > + memset(&n_ptr->bclink,0,sizeof(n_ptr->bclink)); > > > > #ifdef CONFIG_TIPC_MULTIPLE_LINKS > > node_abort_link_changeover(n_ptr); > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > +++++++++ > > + > > +++++++++++++++++++++ > > > > > > > > Regards > > Surya > > > > > -----Original Message----- > > > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > > > Sent: Monday, August 02, 2010 12:42 AM > > > To: tip...@li... > > > Subject: [tipc-discussion] TIPC Name table distribution problem in > > TIPC > > > 1.7.6 > > > > > > All, > > > > > > > > > > > > Just started running TIPC 1.7.6 on one of our lab systems and I'm > > > seeing a problem where it appears that one of the nodes in the > > > system is > not > > > distributing its name table entries to the rest of the > nodes in the > > > system. > > > > > > > > > > > > As an example , I have a process with a tipc name type of > 75 and on > > the > > > node with a tipc address of 1.1.2 those entries are present: > > > > > > > > > > > > mpug@ATCA35_pl0_1:~$ tipc-config -nt=75 > > > > > > Type Lower Upper Port Identity > > Publication > > > Scope > > > > > > 75 1 1 <1.1.3:1384022146> > > 1384022147 > > > cluster > > > > > > <1.1.7:282460177> > 282460178 > > > cluster > > > > > > <1.1.6:2055446601> > > 2055446602 > > > cluster > > > > > > <1.1.5:2320122096> > > 2320122097 > > > cluster > > > > > > <1.1.4:83304688> > 83304689 > > > cluster > > > > > > <1.1.2:1118560329> > > 1118560330 > > > cluster > > > > > > <1.1.1:2317705313> > > 2317705314 > > > cluster > > > > > > <cut off for the sake of brevity> > > > > > > If I view the same entry on any other node in the system, > I see the > > > following: > > > > > > > > > > > > appadm@ATCA35_cm1:~$ tipc-config -nt=75 > > > > > > Type Lower Upper Port Identity > > Publication > > > Scope > > > > > > 75 1 1 <1.1.3:1384022146> > > 1384022147 > > > cluster > > > > > > <1.1.7:282460177> > 282460178 > > > cluster > > > > > > <1.1.6:2055446601> > > 2055446602 > > > cluster > > > > > > <1.1.5:2320122096> > > 2320122097 > > > cluster > > > > > > <1.1.4:83304688> > 83304689 > > > cluster > > > > > > <1.1.1:2317705313> > > 2317705314 > > > cluster > > > > > > <cut off for the sake of brevity> > > > > > > > > > > > > The entries for 1.1.2 are not there. In fact none of the > processes > > > running on the 1.1.2 node are visible in the tipc name > table outside > > of > > > that node. Therefore, we cannot send any messages to > that node (We > > use > > > the topology service to verify that a tipc name, domain > combo that > > > we're sending to is available. I've currently left the system in > > this > > > state. Any idea on why this could be happening? Or what we can > look > > > for to debug this problem? We were not seeing this problem with > > > 1.5.12. > > > > > > > > > > > > Thanks, > > > > > > Felix > > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > --------- > > > ------- > > > The Palm PDK Hot Apps Program offers developers who use > the Plug-In > > > Development Kit to bring their C/C++ apps to Palm for a > share > > > of $1 Million in cash or HP Products. Visit us here for more > details: > > > http://p.sf.net/sfu/dev2dev-palm > > > _______________________________________________ > > > tipc-discussion mailing list > > > tip...@li... > > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > -------------------------------------------------------------- > ---------------- > This SF.net email is sponsored by > > Make an app they can't live without > Enter the BlackBerry Developer Challenge > http://p.sf.net/sfu/RIM-dev2dev > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |