Linux

如何覆蓋 NVME 設備的 IRQ 關聯

  • May 13, 2021

我正在嘗試將所有中斷移至核心 0-3,以保持其餘核心空閒,以實現高速、低延遲的虛擬化。

我寫了一個快速腳本來將 IRQ 親和性設置為 0-3:

#!/bin/bash

while IFS= read -r LINE; do
   echo "0-3 -> \"$LINE\""
   sudo bash -c "echo 0-3 > \"$LINE\""
done <<< "$(find /proc/irq/ -name smp_affinity_list)"

這似乎適用於 USB 設備和網路設備,但不適用於 NVME 設備。他們都產生這個錯誤:

bash: line 1: echo: write error: Input/output error

他們頑固地繼續在我幾乎所有的核心上均勻地產生中斷。

如果我檢查這些設備的目前親和力:

$ cat /proc/irq/81/smp_affinity_list 
0-1,16-17
$ cat /proc/irq/82/smp_affinity_list
2-3,18-19
$ cat /proc/irq/83/smp_affinity_list
4-5,20-21
$ cat /proc/irq/84/smp_affinity_list
6-7,22-23
...

似乎“某事”正在完全控制跨核心傳播 IRQ,而不是讓我改變它。

將這些移動到其他核心是完全關鍵的,因為我在這些核心上的虛擬機中執行大量 IO,並且 NVME 驅動器正在產生大量中斷。這不是 Windows,我應該能夠決定我的機器做什麼。

什麼是控制這些設備的 IRQ 親和性以及如何覆蓋它?


我在 Gigabyte Auros X570 Master 主機板上使用 Ryzen 3950X CPU,3 個 NVME 驅動器連接到主機板上的 M.2 埠。

(更新:我現在使用的是 5950X,仍然有完全相同的問題)

核心:5.12.2-arch1-1

lspci -v與 NVME 相關的輸出:

01:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
   Subsystem: Phison Electronics Corporation E12 NVMe Controller
   Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 0, IOMMU group 14
   Memory at fc100000 (64-bit, non-prefetchable) [size=16K]
   Capabilities: [80] Express Endpoint, MSI 00
   Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
   Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
   Capabilities: [f8] Power Management version 3
   Capabilities: [100] Latency Tolerance Reporting
   Capabilities: [110] L1 PM Substates
   Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
   Capabilities: [200] Advanced Error Reporting
   Capabilities: [300] Secondary PCI Express
   Kernel driver in use: nvme

04:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
   Subsystem: Phison Electronics Corporation E12 NVMe Controller
   Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0, IOMMU group 25
   Memory at fbd00000 (64-bit, non-prefetchable) [size=16K]
   Capabilities: [80] Express Endpoint, MSI 00
   Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
   Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
   Capabilities: [f8] Power Management version 3
   Capabilities: [100] Latency Tolerance Reporting
   Capabilities: [110] L1 PM Substates
   Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
   Capabilities: [200] Advanced Error Reporting
   Capabilities: [300] Secondary PCI Express
   Kernel driver in use: nvme

05:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
   Subsystem: Phison Electronics Corporation E12 NVMe Controller
   Flags: bus master, fast devsel, latency 0, IRQ 40, NUMA node 0, IOMMU group 26
   Memory at fbc00000 (64-bit, non-prefetchable) [size=16K]
   Capabilities: [80] Express Endpoint, MSI 00
   Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
   Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
   Capabilities: [f8] Power Management version 3
   Capabilities: [100] Latency Tolerance Reporting
   Capabilities: [110] L1 PM Substates
   Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
   Capabilities: [200] Advanced Error Reporting
   Capabilities: [300] Secondary PCI Express
   Kernel driver in use: nvme
$ dmesg | grep -i nvme
[    2.042888] nvme nvme0: pci function 0000:01:00.0
[    2.042912] nvme nvme1: pci function 0000:04:00.0
[    2.042941] nvme nvme2: pci function 0000:05:00.0
[    2.048103] nvme nvme0: missing or invalid SUBNQN field.
[    2.048109] nvme nvme2: missing or invalid SUBNQN field.
[    2.048109] nvme nvme1: missing or invalid SUBNQN field.
[    2.048112] nvme nvme0: Shutdown timeout set to 10 seconds
[    2.048120] nvme nvme1: Shutdown timeout set to 10 seconds
[    2.048127] nvme nvme2: Shutdown timeout set to 10 seconds
[    2.049578] nvme nvme0: 8/0/0 default/read/poll queues
[    2.049668] nvme nvme1: 8/0/0 default/read/poll queues
[    2.049716] nvme nvme2: 8/0/0 default/read/poll queues
[    2.051211]  nvme1n1: p1
[    2.051260]  nvme2n1: p1
[    2.051577]  nvme0n1: p1 p2

什麼是控制這些設備的 IRQ 親和性?

自v4.8以來的Linux 核心在 NVMe 驅動程序中自動使用MSI/MSI-X中斷屏蔽;和IRQD_AFFINITY_MANAGED, 自動管理核心中的 MSI/MSI-X 中斷。

查看這些送出:

  1. 90c9712fbb388077b5e53069cae43f1acbb0102a- NVMe:始終使用 MSI/MSI-X 中斷
  2. 9c2555835bb3d34dfac52a0be943dcc4bedd650f- genirq:引入IRQD_AFFINITY_MANAGED標誌

通過輸出查看您的核心版本和設備功能lspci -v,顯然是這樣。

以及如何覆蓋它?

除了禁用標誌和重新編譯核心之外,還可能禁用 MSI/MSI-X 到您的 PCI 橋接器(而不是設備):

echo 1 > /sys/bus/pci/devices/$bridge/msi_bus

請注意,禁用 MSI/MSI-X 會對性能產生影響。有關更多詳細資訊,請參閱此核心文件

除了禁用 MSI/MSI-X,更好的方法是保留 MSI-X,同時在 NVMe 驅動程序中啟用輪詢模式。請參閱Andrew H 的回答

引用自:https://serverfault.com/questions/1052448