#394 Ixgbe driver sets affinity_mask for wrong NUMA node

closed
Todd Fujinaka
None
in-kernel_driver
1
2014-12-14
2014-01-21
Arkadiusz B
No

Hello,
I'm using SuperMicro X9DRH-7TF server with two Intel® Xeon® Processor E5-2640 v2 and NUMA enabled and Intel X520-SR2 network adapter. Ixgbe driver version is: 3.19.1.

Ixgbe driver creates MSI-X interrupt for every CPU available. This causes irqbalance doesn't affect IRQs where affinity_hint reports CPUs on wrong numa node and produces follwong warnings:

Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 111 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 112 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 113 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 114 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 115 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 116 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 117 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 118 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 127 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 128 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 129 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 130 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 131 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 132 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 133 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 134 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 144 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 145 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 146 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 147 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 148 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 149 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 150 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 151 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 160 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 161 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 162 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 163 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 164 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 165 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 166 affinity_hint subset empty
Jan 21 11:03:45 [daemon.warning] /usr/sbin/irqbalance: irq 167 affinity_hint subset empty

Both interfaces are on numa node 0:
cat '/sys/bus/pci/devices/0000:02:00.0/numa_node'
0
cat '/sys/bus/pci/devices/0000:02:00.1/numa_node'
0

Node 0 uses following mapping:
cat /sys/devices/system/node/node0/cpumap
00000000,00000000,00000000,00ff00ff
cat /sys/devices/system/node/node1/cpumap
00000000,00000000,00000000,ff00ff00

For example:
cat /proc/irq/144/affinity_hint
00000100
cat /proc/irq/144/smp_affinity
00ff00ff

affinity_hint mask reports CPU from other NUMA node.

Discussion

  • Todd Fujinaka
    Todd Fujinaka
    2014-01-21

    • assigned_to: Todd Fujinaka
     
  • Todd Fujinaka
    Todd Fujinaka
    2014-01-21

    Part of the problem is that you're trying to use irqbalance, which automatically sets affinities and trying to set them manually at the same time. I would disable irqbalance, if you want to manually set affinities.

     
  • Arkadiusz B
    Arkadiusz B
    2014-01-22

    I don't want to set affinities manually. Irqbalance works well on non NUMA architecture and works well on NUMA with other devices eg Mellanox. Only cards using ixgbe driver have this problem.

    I created patch which creates the same number of MSI-X vectors as there are CPUs on device NUMA node. I also added mapping for affinity_hint to use only CPUs on device NUMA node. Patch is available in attachment.

     
    Last edit: Arkadiusz B 2014-01-22
  • Emil Tantilov
    Emil Tantilov
    2014-01-22

    By default the ixgbe driver loads with number of queues = number of CPUs, and because ideally we want to spread the queues on each CPU there will be some queues on a CPU from the oposite node. The way the driver handles this is to allocate memory from the node which is local to the CPU on which the IRQ is handled. This goes along with the set_irq_affinity script provided with the ixgbe driver.

    The affinity_hints are provided on driver load and irqbalance should be able to override them by setting /proc/irq/smp_affinity. Is irqbalance not setting the smp_affinity correctly because of the affinity_hint, or is this just a warning?

     
    • Arkadiusz B
      Arkadiusz B
      2014-01-23

      Doesn't this soultion cause situation that device needs to communicate with CPU on another NUMA node?

      Irqbalance doesn't touch smp_affinity when this warning occurs. It is set to defalt (all CPUs on device NUMA node).

       
      • Todd Fujinaka
        Todd Fujinaka
        2014-01-23

        The set_irq_affinity script is not meant to be more than a simple way to spread the queues amongst all the available cores; it does not distinguish between packages.

        As I said before, the driver doesn't and shouldn't have to know about the system configuration. It provides hints to irqbalance and so you should be asking the questions to the irqbalance maintainers.

         
        • Arkadiusz B
          Arkadiusz B
          2014-01-24

          Yes, but if device is physically connected to device on first NUMA node and half of queues are allocated on another NUMA node doesn't it cause cross trafic between NUMA nodes?

           
  • Todd Fujinaka
    Todd Fujinaka
    2014-01-24

    I think we should close this as a bug and you should pose this question on open mailing lists such as e1000-devel. The cross-node traffic is something you need to manage, but there are degenerate cases where costly memory accesses increase when you put all the queues on one node.

    I am also forwarding this question to the TME in charge of performance, and Emil is following up with the maintainer of irqbalance.

     
  • Neil Horman
    Neil Horman
    2014-01-24

    Hey there, I'm Neil, and I'm the irqbalance maintainer. Emil got in touch with me and asked me to look at this.

    its a recent irqbalance version I presume? If so, the answer is in the irqbalance man page:

    --hintpolicy=[exact | subset | ignore]
    Set the policy for how irq kernel affinity hinting is treated.
    Can be one of:
    exact - irq affinity hint is applied unilaterally and never viloated
    subset - irq is balanced, but the assigned object will be a subset of
    the affintiy hint
    ignore - irq affinity hint value is completely ignored

    If they want irqbalance to ignore the affinty hint provided by the e1000 driver,
    they should add --hintpolicy=ignore to IRQBALANCE_ARGS in
    /etc/sysconfig/irqbalance, or in the unit file if using systemd.

    If the version of irqbalance is sufficiently recent, and you want more fine
    grained control than whats detailed above, you can also check out the
    --policyscript option. That will run a user specified script for each
    discovered irq, and that script can return on stdout an series of key=value
    pairs that specifies per irq configuration overrides. That doesn't support
    affinty hint honoring levels yet, but it certainly can. Open a feature request
    at the github project page for irqbalance if thats something you're interested
    in.

     
  • Arkadiusz B
    Arkadiusz B
    2014-01-25

    Thank you for your response. Yes I'm using the latest irqbalance and it works as expected. Hint policy ignore won't resolve the problem of cross NUMA traffic.

    I have a question. Assume that there are two NUMA nodes. Ethernet adapter is connected to first node and queues are on both nodes. All cores/threads on the first node are under heavy load and there are data to send to this adapter on the second NUMA node. Won't this situation cause problems (timeouts / dropped packages etc.)?

    Does attached patch make sense if I want to avoid that situation?

    I have a lot of problems with NUMA architecture and cross node traffic before. So I want to be sure that resources on the second NUMA node won't cause much more problems in the future.

     
  • Neil Horman
    Neil Horman
    2014-01-26

    "Hint policy ignore won't resolve the problem of cross NUMA traffic."

    It will if thats the only problem you're having. irqbalance, if it can ignore affinity_hint and is running on a system with a properly populated acpi SLIT table, will keep each irq on the local numa node of the device. It might help to understand here that affinity_hints from the driver in ixgbe, IIRC do not honor numa node locality. Because ixgbe creates a queue per cpu, it sets the affinty hint for the irq tied to each queue to a unique cpu, ignoring any numa locality. Thats a perfectly reasonable thing to do, as it prioritizes parallel operation over numa locality. If thats not the right choice for you howver, the answer is to ignore the affinity hint, and let irqbalance preform its default operation, which is to spread irqs in the same way, but keep them all local to the devices reported numa node.

    "I have a question. Assume that there are two NUMA nodes. Ethernet ..."

    No, it won't cause packet drops, at least not in and of itself. The answer to your question is really you're choice. Certainly transmits from the remote node will be delayed slightly due to the additional pci bus traversals it has to do, but weather or not that additional delay is better or worse than moving the additional processes to the more local numa node (incurring the additional process competition for cpu time and memory), is up to you and your workload requirements.

    "Does attached patch make sense if I want to avoid that situation?"

    Honestly, no. Its certainly functional, but from a philosophical point I don't like it. As discussed above, enforcing numa node locality or maximizing parallel behavior is really a policy decision, which places it in the realm of user space function. Adding it to the kernel is really enforcing a policy that isn't in the best interests of all users. You can do the same thing by just ignoring the affinity_hint in irqbalance.

    In fact, affinity_hint is becomming something of a hold over from before irqbalance was rewritten to parse sysfs properly. A few years ago, irqbalance wasn't really msi-aware and had a very hard time determining how to ballance these interrupts. Intel at the time solved this problem by creating the affinity_hint interface to drive more correct mappings. Since then irqbalance has been re-written and can now gather information about irqs from sysfs, and make decisions better than the driver can. I should probably change the irqbalance hint policy default to ignore soon.

     
    • Arkadiusz B
      Arkadiusz B
      2014-01-27

      Thank you for explanation. I'll do some tests with ignore policy.

       
  • Todd Fujinaka
    Todd Fujinaka
    2014-02-05

    • status: open --> closed