|
From: Alexander S. <al...@mo...> - 2018-01-16 13:51:28
|
Hello, i'm having a hard time scaling performance over multiple LUN. What happens is that a single core get's hammered (indirectly by the SRP-Initiator ?) with SI - and so becomes a serious bottleneck: cat /proc/interrupts |grep -i mlx 28: 244 0 0 12331 0 0 0 0 PCI-MSI 524288-edge mlx4-async@pci:0000:01:00.0 29: 1 0 0 0 0 0 8117957 0 PCI-MSI 524289-edge mlx4-1@0000:01:00.0 30: 0 0 0 0 0 0 0 0 PCI-MSI 524290-edge mlx4-2@0000:01:00.0 31: 0 0 0 0 0 0 0 0 PCI-MSI 524291-edge mlx4-3@0000:01:00.0 32: 0 0 0 0 0 0 0 0 PCI-MSI 524292-edge mlx4-4@0000:01:00.0 33: 0 0 0 0 0 0 0 0 PCI-MSI 524293-edge mlx4-5@0000:01:00.0 34: 0 0 0 0 0 0 0 0 PCI-MSI 524294-edge mlx4-6@0000:01:00.0 35: 0 0 0 0 0 0 0 0 PCI-MSI 524295-edge mlx4-7@0000:01:00.0 36: 0 0 0 0 0 0 0 0 PCI-MSI 524296-edge mlx4-8@0000:01:00.0 For my understanding, with multiple channels the load should be spread from 29-36, that could be assigned to different cores ? The setup for both initiator and target KVM-VM with ConnectX-3 VF, Debian 9 (now with Backports-Kernel), 8 Cores, 4Gb RAM Exported disks are 4x 10Gb from 900P, limited to 50k r/w IOPS each, virtio-blk with io-thread on host. Host itself is Supermicro X10 with E5 2683v3 running Proxmox. The disks scale, when i mount all 8 Volumes in a single VM performance in mdraid 10 is absolutely on-point (> 350k reads, 200k writes - what is quite amazing given the fact this is all on a single 900P ...) Load on the target is always fine. Benchmarking a single dev imported via srp on the initiator is also on-point (although SI starts to get high but still not saturates the Core). Saturation starts when i RAID-0 three or more LUN's, as the Core (in this case CPU6) gets saturated: In top it looks like this - what also reflects the output from /proc/interrupts: top - 14:40:31 up 32 min, 3 users, load average: 1.17, 0.28, 0.24 Tasks: 211 total, 5 running, 112 sleeping, 0 stopped, 0 zombie %Cpu0 : 10.7 us, 64.4 sy, 0.0 ni, 23.5 id, 0.7 wa, 0.0 hi, 0.0 si, 0.7 st %Cpu1 : 5.0 us, 46.6 sy, 0.0 ni, 47.7 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu2 : 3.3 us, 29.9 sy, 0.0 ni, 64.5 id, 1.0 wa, 0.0 hi, 0.3 si, 1.0 st %Cpu3 : 5.0 us, 38.1 sy, 0.0 ni, 55.0 id, 1.0 wa, 0.0 hi, 0.3 si, 0.7 st %Cpu4 : 3.0 us, 37.5 sy, 0.0 ni, 58.5 id, 0.3 wa, 0.0 hi, 0.0 si, 0.7 st %Cpu5 : 5.0 us, 38.7 sy, 0.0 ni, 55.3 id, 0.0 wa, 0.0 hi, 0.0 si, 1.0 st %Cpu6 : 4.3 us, 32.2 sy, 0.0 ni, 7.8 id, 0.4 wa, 0.0 hi, 53.7 si, 1.6 st %Cpu7 : 3.3 us, 27.4 sy, 0.0 ni, 67.3 id, 1.0 wa, 0.0 hi, 0.0 si, 1.0 st Optimizations done so far: 1. Backports-Kernel to get block-mq+ recent SCST - 4.14.0-0.bpo.2-amd64 #1 SMP Debian 4.14.7-1~bpo9+1 (2017-12-22) x86_64 GNU/Linux 2. Use block-mq and multiple channels for ib_srp - Booting Kernel with "pti=off scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1" - options ib_srp cmd_sg_entries=255 indirect_sg_entries=255 ch_count=6 - set scheduler to mq-deadline on the blockdevices I verified block-mq working and multiple channels are used on the target by scstadm --list_session by the bytes_written on each session. 3. removed dm-multipath, mdraid etc. from the io-path to narrow things down, tried non-journaling filesystem 4. tried ib_srp-backport - built fine on debian agains the kernel from backports - unfortunately this ended with going write-iops totally south, stalling on high io-wait - read-iops and general cpu-utilization improved a lot - on a single LUN performance is still on-point, and CPU utilization much better - but still on one core - trouble starts when mdraid 0 over 4 LUN's - completely stalls, never reaches the ca. 170k write-iops i get with in-tree ib_srp.ko 5. multiple other's - all somehow helped but didn't solve the problem - pinning cpu-cores of the VM's - etc. So, any idea's how to get this sorted ? Happy to give more details / outputs, build against patched sources etc. Best regards, Alex |