Re: [Aoetools-discuss] Use of rtt in AOE protocol
Brought to you by:
ecashin,
elcapitansam
From: Sam H. <sa...@co...> - 2009-11-16 15:50:40
|
I'm going to break this down so others can digest the conversation. I'm not being condescending by explaining the basics; I just want this to be well understood. The current (aoe6-73) driver implements Van Jacobson's congestion avoidance (CA) algorithm. Prior to this we tried several different custom approaches (including a dynamic min timer), but none of them performed as well as CA for the majority of cases. The role of the retransmit algorithm is to estimate with a high certainty that a packet has been lost. It does this by using past round trip times as an estimate of the future. As it's only an estimate, both false negatives (waiting too long to retransmit) and false positives (retransmitting something too soon) are possible. An algorithm's efficacy is gauged by these metrics. I believe you're suggesting that the existence of false positives indicates a problem, but that isn't necessarily true. The algorithm must adapt to accomodate changing network and load conditions and can, as a result, occasionally be wrong. Eliminating false positives is ideal, but not generally possible. I originally added a min timer for the exact reason you did. We saw a benefit and it lived in the driver for a while, but it wasn't until last year that we realized it was helping because we weren't managing our outstanding window properly. The min timer has the downside that it prohibits the algorithm from being able to respond quickly to packet loss. Set sufficiently high and you end up hurting throughput in loss situations (because you wait too long to retransmit). It was tossed when we implemented CA because we now manage the command window better, making false retransmits cost less. This isn't to say that the CA is the best we can do; it's just the best we can do right now. We've discussed doing finer grained estimation by, eg, separating fast responses (cache access) from long ones (disk access), as well as considering reads/writes separately. Multipathing could be handled more intelligently. There are ways to confuse the algorithm as it stands, but it performs very well in common configurations. If you'd like to test/tune, you need a lossy network as well as one that perfoms well. Tally the total successful transactions (command+response) along with the false retransmissions and balance that with throughput analysis (the best gauge of false negatives). Tweak, retest, analyze, repeat. We've discovered that there's always a way to tweak and optimize one case, but it usually has an adverse effect on something else. A long term goal of ours has been to come up with a suite of tests and hardware setups that we could use to judge various algorithm attempts. One day we may be able to do that, but for now CA is pretty darn good. Cheers, Sam > Hi all, > > I notice there is use of a network congestion algorithm in the AOE driver > which tries to limit the number of outstanding requests on the wire. An > implementation like this makes a lot of sense for congested networks and > cases where more than one host may be accessing a particular target. > > However, I am seeing a lot of cases (depending on the load) where this > algorithm is being confused by the time it takes for the response to come > back from the target. Since there is network-level ACK in the AOE > protocol, the actual time it takes for a response to come back depends > entirely on how long it gets wedged in the target buffer queues, disk > speed, etc. A request that happens to hit cache can come back very > quickly, while other requests can be quite slow. > > I am not sure how to fix this. I wrote a simple patch to the Linux AOE > driver to specify "aoe_minout", and saw a little bit of a concurrent > latency improvement, but overall throughput wasn't noticeably changed > (based on not so accurate munin graphs). However, with this or the stock > driver, /dev/etherd/err shows a lot of "retransmit" and matching > "unexpected rsp" errors, which are probably not helping performance if > this is actually resulting in multiple rewrites to disk. > > It would seem to me that the best solution to this would be to actually > have ACKs in the protocol, but that would be a pretty drastic change. > Hmm... Comments? Ideas? > > Simon- > > Example /dev/ethed/err snippet, which scrolls at about 15 lines per > second during a typical backup run: > > retransmit e6.1 oldtag=3ac07753@12332781a newtag=3cae781a s=0015c5e92481 d=003048d6026e nout=25 > retransmit e6.1 oldtag=3ac37753@12332781a newtag=3caf781a s=0015c5e92481 d=003048d6026e nout=26 > retransmit e6.1 oldtag=3ac17753@12332781a newtag=3cb0781a s=0015c5e92481 d=003048d6026e nout=26 > retransmit e6.1 oldtag=3ac47753@12332781a newtag=3cb1781a s=0015c5e92481 d=003048d6026e nout=27 > retransmit e6.1 oldtag=3ac27753@12332781b newtag=3cb2781b s=0015c5e92481 d=003048d6026e nout=27 > retransmit e6.1 oldtag=3ac57756@12332781b newtag=3cb3781b s=0015c5e92481 d=003048d6026e nout=28 > retransmit e6.1 oldtag=3cb0781a@1233278fc newtag=3d9b78fc s=0015c5e92481 d=003048d6026e nout=15 > retransmit e6.1 oldtag=3cb3781b@1233278fc newtag=3d9c78fc s=0015c5e92481 d=003048d6026e nout=16 > retransmit e6.1 oldtag=3cae781a@123327906 newtag=3d9d7906 s=0015c5e92481 d=003048d6026e nout=16 > retransmit e6.1 oldtag=3cb1781a@123327906 newtag=3d9e7906 s=0015c5e92481 d=003048d6026e nout=17 > retransmit e6.1 oldtag=3caf781a@123327906 newtag=3d9f7906 s=0015c5e92481 d=003048d6026e nout=17 > retransmit e6.1 oldtag=3cb2781b@123327912 newtag=3da07912 s=0015c5e92481 d=003048d6026e nout=17 > unexpected rsp e6.1 tag=3ac47753@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3ac37753@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3ac57756@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3caf781a@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3cb1781a@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3cb3781b@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3cae781a@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3ac17753@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3cb0781a@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3ac07753@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3ac27753@1233279b3 s=003048d6026e d=0015c5e92481 > unexpected rsp e6.1 tag=3cb2781b@1233279b3 s=003048d6026e d=0015c5e92481 > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > Aoetools-discuss mailing list > Aoe...@li... > https://lists.sourceforge.net/lists/listinfo/aoetools-discuss |