From: Jon M. <jon...@er...> - 2010-08-24 17:18:02
|
Hi Felix, I think the problem you are encountering is a different bug, related to dual links, one which was identified by Laser last month. Apparently the following patch solved that problem, so you should try it. Regards ///Jon Hi Laser, I think your solution would work, but it looks a little complicated. I would suggest you try something like this: + /* Synchronize broadcast link information */ + + if (!tipc_node_is_up(l_ptr->owner) { + l_ptr->owner->bclink.last_sent = + l_ptr->owner->bclink.last_in = + msg_last_bcast(msg); + l_ptr->owner->bclink.oos_state = 0; + } + link_state_event(l_ptr, msg_type(msg)); l_ptr->peer_session = msg_session(msg); l_ptr->peer_bearer_id = msg_bearer_id(msg); - /* Synchronize broadcast link information */ - - if (!tipc_node_has_redundant_links(l_ptr->owner)) { - l_ptr->owner->bclink.last_sent = - l_ptr->owner->bclink.last_in = - msg_last_bcast(msg); - l_ptr->owner->bclink.oos_state = 0; - } break; case STATE_MSG: This would ensure that a synchronization only happens when there is no working link at all, and we are just about to go to one working link. This is the only case we need to care about. Regards ///jon Jon Maloy M.Sc. EE Researcher Ericsson Canada Broadband and Systems Research 8400 Decarie H4P 2N2, Montreal, Quebec, Canada Phone + 1 514 345-7900 x42056 Mobile + 1 514 591-5578 jon...@er... www.ericsson.com > -----Original Message----- > From: Laser [mailto:got...@gm...] > Sent: July-16-10 10:54 > To: tipc-discussion > Subject: Re: [tipc-discussion] Issue with TIPC broadcast packets > > Hi, > > I suspect that there is an issue with how "synchronization of > broadcast links" are done. Please find the proposed fix and the > explanation below. Let me know your comments. > > Fix is done in tipc_link.c > /vxworks-6.3/target/src/tipc/tipc_link.c > > - /* Synchronize broadcast link information */ > - > - if (!tipc_node_has_redundant_links(l_ptr->owner)) { > + /* Synchronize broadcast link information, > + * 1) When the link owner has no redundant links > + * 2) There is a single link up and the current link's > + * state is not RESET_RESET. > + */ > + > + if (!tipc_node_has_redundant_links(l_ptr->owner) && > + !((l_ptr->owner->working_links == 1) && > + (l_ptr->state == RESET_RESET)) > > In the existing code, the synchronization of broadcast link info is > done when there are no redundant working links (i.e number of working > link <= 1). > In our case the link over Ethernet was working and the link over HDLC > was being setup. This matched the condition and the link info got > synchronized (last_in set as 0 from 1). But this should not have > happened. So the code was modified to avoid scenarios when the number > of working link is 1 and the current link is in RESET_RESET state. > > When ever a link receives a RESET packet it goes to RESET_RESET state > and so the fix takes care of RESET packets in all states. Had we > received an ACTIVATE message (as there is a fall through for RESET and > ACTIVATE in code), the state of current link would be WORKING_WORKING > and the number of links will be two (Ethernet and HDLC). > > Even if HDLC was the only link available and it is running and an old > RESET packet is received after a broadcast packet is received, then it > will get dropped as the packets session id will be same as the session > id in link. > > > if (less_eq(msg_session(msg), l_ptr->peer_session)) { > /* dbg("Duplicate RESET-set-break : > %u<->%u\n", msg_session(msg), l_ptr->peer_session); */ > break; /* duplicate: ignore */ > > Regards, > Laser > > > On Thu, Jul 15, 2010 at 9:03 PM, Laser <got...@gm...> wrote: > > > Hi, > > > > We are using TIPC 1.7.6 in Vxworks 6.3. We are facing an issue with > > TIPC packets over broadcast links. The setup has 2 units, > say A and B > > running TIPC over ethernet and HDLC bearers. There are few > other units > > connected to A and B, running TIPC over HDLC. > > > > The sequence of events are, > > 1. TIPC is enabled over ethernet and HDLC in A and B. The link > > priority of Ethernet is configured higher than HDLC. > > 2. TIPC session gets established between A and B over ethernet. > > 3. A creates a port and binds it with an address in > TIPC_ZONE_SCOPE. > > This is advertised to all units by a broadcast packet. > > 4. A sends a STATE packet to B with "last sent broadcast" > as 1 and B > > also acknowledges the broadcast packet, it in a STATE packet to A. > > This happens twice. > > 5. B now sends a TIPC BCAST packet (user type 5) to A > requesting for > > broadcast packet with sequence number 1. This happens every time A > > sends a STATE to B. > > 6. TIPC session gets established between two units over HDLC. > > 7. A does not send any broadcast packet to B and B > continues sending > > TIPC BCAST packets. The reason would be that A would have already > > discarded that broadcast packet as B has acknowledged it earlier. > > 8. Later the application in A sends a TIPC multicast packet to all > > other units. > > 9. B continues to send TIPC BCAST packets. For every such > packet A now > > responds with 2 TIPC multicast packets sent at step 7. > > 10. This goes for a while till the retransmit packet count in A > > reaches 100 and then A resets the link with B. > > > > The issue is at step 5. On analysis, it was found that a > reset packet > > is received by B from A, some time after step 4 and this resets the > > last_in counter in B from 1 to zero. The code snippet around > > tipc_link.c line#2302 is given below, > > > > /* Synchronize broadcast link information */ > > > > if (!tipc_node_has_redundant_links(l_ptr->owner)) { > > l_ptr->owner->bclink.last_sent = > > l_ptr->owner->bclink.last_in = > > msg_last_bcast(msg); > > l_ptr->owner->bclink.oos_state = 0; > > } > > break; > > > > Here we suspect that the RESET packet was sent by A before step 3, > > over HDLC and because of a congestion in HDLC medium, the > packet arrived late. > > This > > packet has "bcast packets sent counter" as zero and B also sets the > > last_in as 0 (this is verified by a debug print). So, B > starts sending > > TIPC BCAST packets. At step 8, A sends multicast packet (another > > packet over broadcast > > link) and as B have not "received" the first packet itself > (last_in is > > zero), it defers it. "A" retransmits the packet 100 times > and resets > > the link. > > > > I would like to know, how to handle this issue. Supposing that the > > congestion over HDLC may not be avoided, could we keep track of > > timestamp in the RESET packets and discard packets with stale > > timestamps? Are there any alternatives? Please clarify. > > > > Thanks in Advance, > > Laser > > > > > -------------------------------------------------------------- > ---------------- > This SF.net email is sponsored by Sprint What will you do first with > EVO, the first 4G phone? > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > Jon Maloy M.Sc. EE Researcher Ericsson Canada Broadband and Systems Research 8400 Decarie H4P 2N2, Montreal, Quebec, Canada Phone + 1 514 345-7900 x42056 Mobile + 1 514 591-5578 jon...@er... www.ericsson.com > -----Original Message----- > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > Sent: August-24-10 12:39 > To: Stephens, Allan; tip...@li... > Subject: Re: [tipc-discussion] TIPC Name table distribution > problem in TIPC1.7.6 > > Al, > > So we ran the test you suggested to just use "Al's patch > without Surya's" and we were able to reproduce our name table > distribution problem. > >From reading your post and the comment about link changeover, I had a > question. Are you saying that this problem is related to > having redundant links? > We recently (2 months ago) changed to use redundant links in > an active/standby configuration. This coincided in time with > us switching to TIPC 1.7.6 and seeing all of the problems > that we've recently seen. > So I'm wondering if we went back to using only 1 link between > nodes instead of 2, do you think that our name table > distribution problem will go away? > > Thanks, > Felix > > > > > -----Original Message----- > From: Stephens, Allan [mailto:all...@wi...] > Sent: Monday, August 16, 2010 1:07 PM > To: Nayman Felix-QA5535; Suryanarayana Garlapati; > tip...@li... > Subject: RE: [tipc-discussion] TIPC Name table distribution problem in > TIPC1.7.6 > > Hi Felix: > > I'm finally back from vacation and can respond to your recent > emails (at least to some degree). > > I'd be interested in knowing whether you experience any > problems if you try running TIPC 1.7.7-rc1 with Al's patch > (presumably the one entitled "Prevent broadcast link stalling > in dual LAN environment") but not Surya's patch. From the > data you supplied below it is conceivable that this might be > sufficient to resolve your issue. I did incorporate a portion > of Surya's patch into the TIPC 1.7 stream, but I haven't yet > brought in the part that changes the default link window size > or which resets the link's reset_checkpoint when a link > changeover is aborted; the delay is because I'm not yet > convinced that either of these changes is necessary and/or > correct. If you encounter problems when Surya's changesa re > missing this would provide evidence that they are actually needed. > > I don't have any guidance to provide on what link window size > values should be used for either the unicast links or for the > broadcast link; I suspect that the value will depend heavily > on the type of hardware you're running in your system and the > nature of the traffic load you're passing between nodes, and > that you'll need to experiment to see what values work best > for your system. It looks like Peter was running his traffic > over high speed Ethernet interfaces and found that he needed > to increase his unicast window size to prevent the links from > declaring congestion prematurely; presumably the larger > window sizes helped improve his link throughput values. I've > got no idea where the broadcast link window size of 224 came > from; as with the unicast links, you're probably best off to > experiment to see what values work best in your system. > > I'm continuing to investigate the entire broadcast link and > dual link areas to see what issues remain unresolved, as > there are still some known issues that look to be > problematic. I suspect that there will be a few more patches > added to TIPC 1.7 before things are totally stabilized. > > Regards, > Al > > > -----Original Message----- > > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > > Sent: Sunday, August 15, 2010 10:41 PM > > To: Suryanarayana Garlapati; tip...@li... > > Subject: Re: [tipc-discussion] TIPC Name table distribution > problem in > > TIPC1.7.6 > > > > All, > > > > So we've run a number of tests: > > 1)TIPC 1.7.6 - we were able to continuously reproduce the issue. > > 2) Surya's patch on top of 1.7.6. Result: We could not > reproduce the > > issue > > 3) Surya's patch on top of 1.7.6 with the window size set to 20 > > instead of 50. Result: We were able to reproduce the issue. > > This means that at least for our test run, the window size > change was > > the main reason why we didn't see the name table > distribution problem. > > 4) Surya's and Al's patch on top of 1.7.6. Result: We could not > > reproduce the issue > > 5) Surya's and Al's patch on top of 1.7.6 with the window > size set to > > 20 instead of 50. Result: We could not reproduce the issue > > 6) TIPC 1.7.7-rc1 with Surya's and Al's patch. Result: We > could not > > reproduce the issue > > > > > > So it looks like a combination of Al's patch and Surya's > patch seems > > to prevent our name table distribution problem from happening no > > matter if the window size is 20 or 50, but Surya's patch > does not work > > when we reduced the window size down to 20. > > > > I saw that in TIPC 1.7.6 support for window sizes as high as > > 8192 was added. We are considering increasing our window size for > > both the broadcast link and unicast links from the current default > > size of 50. > > I'm wondering if there is a recommended maximum value. From Peter > > Litov's post it appears that he's tried 1024 for the unicast link > > window size. Are there any issues with this value? Has it > been found > > to improve throughput and alleviate congestion issues? Are > there any > > drawbacks or side effects? > > He mentions a broadcast link window size of 224. I'm not > sure why 224 > > is a magic number, but are there any recommendations for > the broadcast > > link window size. > > > > Thanks for any feedback, > > Felix > > > > -----Original Message----- > > From: Suryanarayana Garlapati [mailto:SGa...@go...] > > Sent: Tuesday, August 03, 2010 12:45 AM > > To: Nayman Felix-QA5535; tip...@li... > > Subject: RE: TIPC Name table distribution problem in TIPC 1.7.6 > > > > Hi Felix, > > This is a issue of broadcast link congestion only. Please > try my patch > > and let me know whether it works. I had faced the similar issue and > > with > > > > my patch, the issue was fixed. > > > > Regards > > Surya > > > > > > > -----Original Message----- > > > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > > > Sent: Tuesday, August 03, 2010 1:06 AM > > > To: Suryanarayana Garlapati; tip...@li... > > > Subject: RE: TIPC Name table distribution problem in TIPC 1.7.6 > > > > > > All, > > > > > > So I tried to disable both bearers and the broadcast-link is still > > up?? > > > How is that possible? > > > From the stats, it appears that we can't send out any > > messages on the > > > broadcast link. I tried and nothing got sent so it appears > > that the > > > link is permanently congested with respect to sending > > messages. The > > > broadcast link does appear to be receiving messages > though. When I > > > enabled both bearers the broadcast link did not recover. > > > > > > We have redundant links between nodes in an active/standby > > > configuration with 1 link having a priority of 10 and the other a > > > priority of 9. > > How > > > does the broadcast link choose a bearer? If we only had > one bearer > > > would we not see this problem? > > > > > > > > > bash-3.1# tipc-config -b > > > Bearers: > > > eth:bond0 > > > eth:bond1 > > > bash-3.1# tipc-config -bd eth:bond0 > > > bash-3.1# tipc-config -bd eth:bond1 > > > bash-3.1# tipc-config -ls=broadcast-link Link statistics: > > > Link <broadcast-link> > > > Window:20 packets > > > RX packets:14407 fragments:0/0 bundles:39/62 > > > TX packets:20 fragments:0/0 bundles:111/2314 > > > RX naks:0 defs:0 dups:0 > > > TX naks:0 acks:903 dups:0 > > > Congestion bearer:0 link:0 Send queue max:131 avg:75 > > > > > > bash-3.1# tipc-config -l > > > Links: > > > broadcast-link: up > > > > > > > > > > > > Thanks, > > > Felix > > > > > > > > > -----Original Message----- > > > From: Nayman Felix-QA5535 > > > Sent: Monday, August 02, 2010 10:14 AM > > > To: 'Suryanarayana Garlapati'; > tip...@li... > > > Subject: RE: TIPC Name table distribution problem in TIPC 1.7.6 > > > > > > Surya, > > > > > > Thanks for the quick response. Here are the broadcast link > > stats from > > > that node: > > > > > > mpug@ATCA35_pl0_1:/usr/vob/mp/common/tools/linux$ tipc-config > > > -ls=broadcast-link Link statistics: > > > Link <broadcast-link> > > > Window:20 packets > > > RX packets:14375 fragments:0/0 bundles:39/62 > > > TX packets:20 fragments:0/0 bundles:110/2301 > > > RX naks:0 defs:0 dups:0 > > > TX naks:0 acks:901 dups:0 > > > Congestion bearer:0 link:0 Send queue max:130 avg:74 > > > > > > Wouldn't I expect to see link congestion set to a non-zero > > value with > > a > > > send queue max size of 130? Yes, we are using redundant > > links between > > > nodes, but with different Ethernet bearers for each link. > > > > > > I assume that the name table distributions are sent over > > the broadcast > > > link and that is why you suspect that this is the problem. Would > > there > > > be any sign of trouble in the syslog when this happens (kernel > > messages > > > from TIPC). I'll look through the syslog for clues. > > > > > > Thanks, > > > Felix > > > > > > -----Original Message----- > > > From: Suryanarayana Garlapati [mailto:SGa...@go...] > > > Sent: Monday, August 02, 2010 9:09 AM > > > To: Nayman Felix-QA5535; tip...@li... > > > Subject: RE: TIPC Name table distribution problem in TIPC 1.7.6 > > > > > > Hi Felix, > > > I am suspecting that this is a broadcast link congestion being > > > happening. > > > Please use the following patch on top of 1.7.6-RC3 and > let me know > > > whether it solves your problem. > > > By the way are you using any redudant links for the bearer? > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > +++++++++ > > > + > > > +++++++++++++++++++++ > > > diff -uNr findings_tipc/net/tipc/tipc_bcast.c > > > patch_tipc/net/tipc/tipc_bcast.c > > > --- findings_tipc/net/tipc/tipc_bcast.c 2008-10-01 > > 23:27:21.000000000 > > > +0530 > > > +++ patch_tipc/net/tipc/tipc_bcast.c 2009-07-30 > > 10:43:35.000000000 > > > +0530 > > > @@ -50,7 +50,7 @@ > > > > > > #define MAX_PKT_DEFAULT_MCAST 1500 /* bcast link > max packet size > > > (fixed) */ > > > > > > -#define BCLINK_WIN_DEFAULT 20 /* bcast link window > > size > > > (default) */ > > > +#define BCLINK_WIN_DEFAULT 50 /* bcast link window > > size > > > (default) */ > > > > > > #define BCLINK_LOG_BUF_SIZE 0 > > > > > > diff -uNr findings_tipc/net/tipc/tipc_node.c > > > patch_tipc/net/tipc/tipc_node.c > > > --- findings_tipc/net/tipc/tipc_node.c 2008-10-01 > > 23:27:21.000000000 > > > +0530 > > > +++ patch_tipc/net/tipc/tipc_node.c 2009-07-30 > 10:45:21.000000000 > > > +0530 > > > @@ -313,7 +313,7 @@ > > > for (i = 0; i < TIPC_MAX_BEARERS; i++) { > > > l_ptr = n_ptr->links[i]; > > > if (l_ptr != NULL) { > > > - l_ptr->reset_checkpoint = l_ptr->next_in_no; > > > + l_ptr->reset_checkpoint = 1; > > > l_ptr->exp_msg_count = 0; > > > tipc_link_reset_fragments(l_ptr); > > > } > > > @@ -361,6 +361,8 @@ > > > tipc_bclink_acknowledge(n_ptr, > > mod(n_ptr->bclink.acked + 10000)); > > > tipc_bclink_remove_node(n_ptr->elm.addr); > > > } > > > + > > > + memset(&n_ptr->bclink,0,sizeof(n_ptr->bclink)); > > > > > > #ifdef CONFIG_TIPC_MULTIPLE_LINKS > > > node_abort_link_changeover(n_ptr); > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > +++++++++ > > > + > > > +++++++++++++++++++++ > > > > > > > > > > > > Regards > > > Surya > > > > > > > -----Original Message----- > > > > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > > > > Sent: Monday, August 02, 2010 12:42 AM > > > > To: tip...@li... > > > > Subject: [tipc-discussion] TIPC Name table distribution > problem in > > > TIPC > > > > 1.7.6 > > > > > > > > All, > > > > > > > > > > > > > > > > Just started running TIPC 1.7.6 on one of our lab > systems and I'm > > > > seeing a problem where it appears that one of the nodes in the > > > > system is > > not > > > > distributing its name table entries to the rest of the > > nodes in the > > > > system. > > > > > > > > > > > > > > > > As an example , I have a process with a tipc name type of > > 75 and on > > > the > > > > node with a tipc address of 1.1.2 those entries are present: > > > > > > > > > > > > > > > > mpug@ATCA35_pl0_1:~$ tipc-config -nt=75 > > > > > > > > Type Lower Upper Port Identity > > > Publication > > > > Scope > > > > > > > > 75 1 1 <1.1.3:1384022146> > > > 1384022147 > > > > cluster > > > > > > > > <1.1.7:282460177> > > 282460178 > > > > cluster > > > > > > > > <1.1.6:2055446601> > > > 2055446602 > > > > cluster > > > > > > > > <1.1.5:2320122096> > > > 2320122097 > > > > cluster > > > > > > > > <1.1.4:83304688> > > 83304689 > > > > cluster > > > > > > > > <1.1.2:1118560329> > > > 1118560330 > > > > cluster > > > > > > > > <1.1.1:2317705313> > > > 2317705314 > > > > cluster > > > > > > > > <cut off for the sake of brevity> > > > > > > > > If I view the same entry on any other node in the system, > > I see the > > > > following: > > > > > > > > > > > > > > > > appadm@ATCA35_cm1:~$ tipc-config -nt=75 > > > > > > > > Type Lower Upper Port Identity > > > Publication > > > > Scope > > > > > > > > 75 1 1 <1.1.3:1384022146> > > > 1384022147 > > > > cluster > > > > > > > > <1.1.7:282460177> > > 282460178 > > > > cluster > > > > > > > > <1.1.6:2055446601> > > > 2055446602 > > > > cluster > > > > > > > > <1.1.5:2320122096> > > > 2320122097 > > > > cluster > > > > > > > > <1.1.4:83304688> > > 83304689 > > > > cluster > > > > > > > > <1.1.1:2317705313> > > > 2317705314 > > > > cluster > > > > > > > > <cut off for the sake of brevity> > > > > > > > > > > > > > > > > The entries for 1.1.2 are not there. In fact none of the > > processes > > > > running on the 1.1.2 node are visible in the tipc name > > table outside > > > of > > > > that node. Therefore, we cannot send any messages to > > that node (We > > > use > > > > the topology service to verify that a tipc name, domain > > combo that > > > > we're sending to is available. I've currently left the > system in > > > this > > > > state. Any idea on why this could be happening? Or what we can > > look > > > > for to debug this problem? We were not seeing this > problem with > > > > 1.5.12. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Felix > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > > --------- > > > > ------- > > > > The Palm PDK Hot Apps Program offers developers who use > > the Plug-In > > > > Development Kit to bring their C/C++ apps to Palm for a > > share > > > > of $1 Million in cash or HP Products. Visit us here for more > > details: > > > > http://p.sf.net/sfu/dev2dev-palm > > > > _______________________________________________ > > > > tipc-discussion mailing list > > > > tip...@li... > > > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > > -------------------------------------------------------------- > > ---------------- > > This SF.net email is sponsored by > > > > Make an app they can't live without > > Enter the BlackBerry Developer Challenge > > http://p.sf.net/sfu/RIM-dev2dev > > _______________________________________________ > > tipc-discussion mailing list > > tip...@li... > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > > -------------------------------------------------------------- > ---------------- > Sell apps to millions through the Intel(R) Atom(Tm) Developer > Program Be part of this innovative community and reach > millions of netbook users worldwide. Take advantage of > special opportunities to increase revenue and speed > time-to-market. Join now, and jumpstart your future. > http://p.sf.net/sfu/intel-atom-d2d > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |