#6 RX overrun errors ignored

open
None
5
2012-05-08
2006-02-15
Brian Behlendorf
No

Hi Folks,

I recently uncovered another issue which I thought
would be of interest to you. While trying to get to
the bottom of a mysterious networking issue, I
discovered that the e1000 driver does not track RX
descriptor overflow errors. So I crafted a patch to
ensure stats for those errors were tracked and re-ran
my tests. Long story short I was seeing boatloads of
them and simply increasing the RX descriptor size
resolved the issue.

Since the driver wasn't tracking these stats ethtool
always reported zero rx_over_errors. Which means it
took me quite a while to suspect this and run the issue
to ground. Can we get the attached patch or a version
of it in to the next release!

Discussion

  • Tracks RX overrun stats

     
  • Andreas Dilger
    Andreas Dilger
    2006-02-15

    Logged In: YES
    user_id=15987

    As a secondary goal, it might be nice to have the driver
    increase the size of the RX descriptor ring buffer (to some
    upper limit) if this is happening, instead of requiring it
    to be increased manually (which can only be done if the
    module is unloaded and reloaded).

     
  • Logged In: YES
    user_id=631160

    Okay, thanks for the report and the patch. The reason I
    don't like this patch is
    a) it puts more code in the interrupt handler, which is a
    performance issue.
    b) this functionality was already there, when you get RNBC
    errors (Receive No Buffer Count) it implies you need to
    increase your allocation. RNBC does not imply that you've
    dropped the packet, unless rx_missed_errors goes up too.
    w.r.t. the second comment, you can increase your receive
    descriptor count with ethtool -G ethX rx

     
  • Logged In: YES
    user_id=422580

    Thanks for the prompt reply, let me follow up on your comments.

    Concerning putting more code in the interrupt handler I
    understand your wanting to keep it as minimal as possible.
    But I can assure it doesn't cause detectable performance
    impact on our systems. We're currently running with the
    patch on several thousand nodes.

    As for the functionality, I see your right you could
    derive the same information from other stats.
    Unfortunately, since I'm not really a network guy I took the
    0 value in rx_over_errors as truth until forced to
    investigate further. The 0 value there is deceptive, I'm
    probably not the only one which has been mislead. This
    seems like a pretty minimal change to potentially save end
    user aggrevation.