不寫入大量數據時，RAID-5 上的連續寫入性能不佳

April 29, 2022

我在為我的 RAID5 + crypt + ext4 獲得可接受的讀/寫性能時遇到了一些問題，最終能夠將其歸結為以下問題：
硬體
硬碟 4x WD RED 3 TB WDC WD30EFRX-68EUZN0 as /dev/sd$$ efgh $$
sde 和 sdf 通過控制器 A 使用 3 Gbps/s SATA 鏈路連接（即使 6 Gbps 本來可用）
sdg 和 sdh 通過控制器 B 使用 6 Gbps/s SATA 鏈路連接
單盤性能
為每個磁碟寫 4 次測試（一切如我所料）
# dd if=/dev/zero of=/dev/sd[efgh] bs=2G count=1 oflag=dsync
sde: 2147479552 bytes (2.1 GB) copied, xxx s, [127, 123, 132, 127] MB/s
sdf: 2147479552 bytes (2.1 GB) copied, xxx s, [131, 130, 118, 137] MB/s
sdg: 2147479552 bytes (2.1 GB) copied, xxx s, [145, 145, 145, 144] MB/s
sdh: 2147479552 bytes (2.1 GB) copied, xxx s, [126, 132, 132, 132] MB/s
使用 hdparm 和 dd 讀取測試（一切如我所料）
# hdparm -tT /dev/sd[efgh]
# echo 3 | tee /proc/sys/vm/drop_caches; dd of=/dev/null if=/dev/sd[efgh] bs=2G count=1 iflag=fullblock

(sde)
Timing cached reads:   xxx MB in  2.00 seconds = [13983.68, 14136.87] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [143.16, 143.14] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [140, 141] MB/s

(sdf)
Timing cached reads:   xxx MB in  2.00 seconds = [14025.80, 13995.14] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [140.31, 140.61] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [145, 141] MB/s

(sdg)
Timing cached reads:   xxx MB in  2.00 seconds = [14005.61, 13801.93] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [153.11, 151.73] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [154, 155] MB/s

(sdh)
Timing cached reads:   xxx MB in  2.00 seconds = [13816.84, 14335.93] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [142.50, 142.12] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [140, 140] MB/s
sd上的分區$$ efgh $$
4x 32 GiB 用於測試
# gdisk -l /dev/sd[efgh]
GPT fdisk (gdisk) version 0.8.10

Partition table scan:
 MBR: protective
 BSD: not present
 APM: not present
 GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sde: 5860533168 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): xxx
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5860533134
Partitions will be aligned on 2048-sector boundaries
Total free space is 5793424237 sectors (2.7 TiB)

Number  Start (sector)    End (sector)  Size       Code  Name
  1            2048        67110911   32.0 GiB    FD00  Linux RAID
突襲陣列
# mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 --chunk=256K /dev/sd[efgh]1
(some tests later ...)
# mdadm --grow --verbose /dev/md0 --layout=right-asymmetric
# mdadm --detail /dev/md0
/dev/md0:
   Version : 1.2
 Creation Time : Sat Dec 10 03:07:56 2016
    Raid Level : raid5
    Array Size : 100561920 (95.90 GiB 102.98 GB)
 Used Dev Size : 33520640 (31.97 GiB 34.33 GB)
  Raid Devices : 4
 Total Devices : 4
   Persistence : Superblock is persistent

   Update Time : Sat Dec 10 23:56:53 2016
         State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
 Spare Devices : 0

        Layout : right-asymmetric
    Chunk Size : 256K

          Name : vm:0  (local to host vm)
          UUID : 80d0f886:dc380755:5387f78c:1fac60da
        Events : 158

   Number   Major   Minor   RaidDevice State
      0       8       65        0      active sync   /dev/sde1
      1       8       81        1      active sync   /dev/sdf1
      2       8       97        2      active sync   /dev/sdg1
      4       8      113        3      active sync   /dev/sdh1
現在的情況
我預計陣列的連續讀寫速度大約在 350 - 400 MB/s 之間。讀取或寫入整個卷實際上會在此範圍內產生完美的結果：
# echo 3 | tee /proc/sys/vm/drop_caches; dd of=/dev/null if=/dev/md0 bs=256K
102975406080 bytes (103 GB) copied, 261.373 s, 394 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256K conv=fdatasync
102975406080 bytes (103 GB) copied, 275.562 s, 374 MB/s
但是，寫入性能很大程度上取決於寫入的數據量。正如預期的那樣，傳輸速率隨著數據量的增加而增加，但在達到 2 GiB 時會下降，並且只有在進一步增加大小時才會緩慢恢復：
# dd if=/dev/zero of=/dev/md0 bs=256K conv=fdatasync count=x
count=1: 262144 bytes (262 kB) copied, xxx s, [3.6, 7.6, 8.9, 8.9] MB/s
count=2: 524288 bytes (524 kB) copied, xxx s, [3.1, 17.7, 15.3, 15.7] MB/s
count=4: 1048576 bytes (1.0 MB) copied, xxx s, [13.2, 23.9, 26.9, 25.4] MB/s
count=8: 2097152 bytes (2.1 MB) copied, xxx s, [24.3, 46.7, 45.9, 42.8] MB/s
count=16: 4194304 bytes (4.2 MB) copied, xxx s, [5.1, 77.3, 42.6, 73.2, 79.8] MB/s
count=32: 8388608 bytes (8.4 MB) copied, xxx s, [68.6, 101, 99.7, 101] MB/s
count=64: 16777216 bytes (17 MB) copied, xxx s, [52.5, 136, 159, 159] MB/s
count=128: 33554432 bytes (34 MB) copied, xxx s, [38.5, 175, 185, 189, 176] MB/s
count=256: 67108864 bytes (67 MB) copied, xxx s, [53.5, 244, 229, 238] MB/s
count=512: 134217728 bytes (134 MB) copied, xxx s, [111, 288, 292, 288] MB/s
count=1K: 268435456 bytes (268 MB) copied, xxx s, [171, 328, 319, 322] MB/s
count=2K: 536870912 bytes (537 MB) copied, xxx s, [228, 337, 330, 334] MB/s
count=4K: 1073741824 bytes (1.1 GB) copied, xxx s, [338, 348, 348, 343] MB/s &lt;-- ok!
count=8K: 2147483648 bytes (2.1 GB) copied, xxx s, [168, 147, 138, 139] MB/s &lt;-- bad!
count=16K: 4294967296 bytes (4.3 GB) copied, xxx s, [155, 160, 178, 144] MB/s
count=32K: 8589934592 bytes (8.6 GB) copied, xxx s, [256, 238, 264, 246] MB/s
count=64K: 17179869184 bytes (17 GB) copied, xxx s, [298, 285] MB/s
count=128K: 34359738368 bytes (34 GB) copied, xxx s, [347, 336] MB/s
count=256K: 68719476736 bytes (69 GB) copied, xxx s, [363, 356] MB/s &lt;-- getting better
（低於 2 GiB 第一次測量似乎表明使用了一些讀取記憶體）
在傳輸 2 GiB 或更多時，我觀察到一些奇怪的東西iotop：
第 1 階段：開始時“Total DISK WRITE”和“Actual DISK WRITE”均約為“400 MB/s”。ddIO 值約為 85 %，而所有其他值都為 0 %。這個階段在較大的轉移中持續時間更長。
階段 2：在傳輸完成前幾秒（約 16 秒），akworker跳入並 /steals/ 30 - 50 個百分點的 IO 來自dd. 分佈在 30:50 % 和 50:30 % 之間波動。同時，“Total DISK WRITE”下降到 0 B/s，“Actual DISK WRITE”在 20 - 70 MB/s 之間跳躍。這個階段似乎會持續一段時間。
第 3 階段：在最後 3 秒內，“Actual DISK WRITE”躍升至 > 400 MB/s，而“Total DISK WRITE”保持在 0 B/s。dd並且kworker都以 0 % 的 IO 值列出
階段 4：IO 值dd單秒上升 5 %。同時傳輸完成。
更多測試
# dd if=/dev/zero of=/dev/md0 bs=256K count=32K oflag=direct
8589934592 bytes (8.6 GB) copied, 173.083 s, 49.6 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256M count=64 oflag=direct
17179869184 bytes (17 GB) copied, 47.792 s, 359 MB/s

# dd if=/dev/zero of=/dev/md0 bs=768M count=16K oflag=direct
50734301184 bytes (51 GB) copied, 136.347 s, 372 MB/s &lt;-- peak performance

# dd if=/dev/zero of=/dev/md0 bs=1G count=16K oflag=direct
41875931136 bytes (42 GB) copied, 112.518 s, 372 MB/s &lt;-- peak performance

# dd if=/dev/zero of=/dev/md0 bs=2G count=16 oflag=direct
34359672832 bytes (34 GB) copied, 103.355 s, 332 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256K count=32K oflag=dsync
8589934592 bytes (8.6 GB) copied, 498.77 s, 17.2 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256M count=64 oflag=dsync
17179869184 bytes (17 GB) copied, 58.384 s, 294 MB/s

# dd if=/dev/zero of=/dev/md0 bs=1G count=8 oflag=dsync
8589934592 bytes (8.6 GB) copied, 26.4799 s, 324 MB/s

# dd if=/dev/zero of=/dev/md0 bs=2G count=8 oflag=dsync
17179836416 bytes (17 GB) copied, 192.327 s, 89.3 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256K; echo "sync"; sync
102975406080 bytes (103 GB) copied, 275.378 s, 374 MB/s
sync
bs=256K oflag=direct-> 100 % IO，不kworker存在，性能不佳
bs=1G oflag=direct-> < 5 % IO，不kworker存在，性能良好
bs=2G oflag=direct-> > 80 % IO，kworker時不時跳進去，性能還行
oflag=dsync-> < 5 % IO，kworker時不時地跳進去；需要巨大的塊大小才能獲得可接受的速度，但 > 2G 會導致性能大幅下降。
echo "sync"; sync-> 與conv=fdatasync;相同 sync立即返回
問題
兩個程序似乎都在爭奪 IO 的神秘第 2 階段是什麼？
誰在第 3 階段將數據傳輸到硬體？
最重要的是：我怎樣才能最小化奇怪的效果來獲得陣列似乎能夠提供的全部 400 MB/s？（或者我什至是在問一個 XY 問題嗎？）
獎金
在目前狀態之前，有一個漫長的反複試驗的故事。我將調度程序從切換cfq到noop並將 RAID 塊大小從 512k 減少到 256k，從而產生了更好的結果。更改為--layout=right-asymmetric並沒有改變任何東西。暫時停用硬碟驅動器的寫入記憶體效果更差。
第一句中提到的crypt層目前完全沒有，後面會重新介紹。
# uname -a
Linux vm 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

您所看到的是dd命令行的產物，特別是來自conv=fdatasync選項。從手冊頁：
每個 CONV 符號可能是：
…
**fdatasync：**在完成之前 物理寫入輸出文件數據
…
conv=fdatasync基本上指示dd在返回之前執行單個最終 fdatasync 系統呼叫。但是，*在 dd 執行時會記憶體寫入。*您的 I/O 階段可以解釋如下：
dd快速寫入頁面記憶體，而不實際接觸磁碟
頁面記憶體快滿了，kworker核心開始刷新它做磁碟。在頁面記憶體刷新期間，dd短暫暫停（導致 high iowait）；釋放一些頁面記憶體後，dd可以恢復操作
TOTAL 和 ACTUAL 磁碟寫入之間的差異iotop取決於 pagecache 分別是如何填充和刷新的
循環重複
簡而言之，這裡沒有問題。如果你想觀察未記憶體的行為，conv=fdatasync用**oflag=direct**: 替換這個標誌，你可以完全繞過頁面記憶體。
為了觀察記憶體但同步的行為，conv=fdatasync用**oflag=sync**: 替換這個標誌，dd在每個塊寫入磁碟時呼叫 fdatasync。
通過微調 I/O 堆棧（即：I/O 調度程序、合併行為、條帶記憶體、ecc）可以獲得進一步的優化，但這完全是另一個問題。

引用自：https://serverfault.com/questions/820031

不寫入大量數據時，RAID-5 上的連續寫入性能不佳

硬體

單盤性能

sd上的分區$$ efgh $$

突襲陣列

現在的情況

更多測試

問題

獎金

相關問答

無法停止突襲設備

恢復被覆蓋的 RAID5 陣列 (XOR)

md raid 因缺少驅動器而無法啟動

禁用損壞的 RAID 1 從健康的 HDD 讀取

mdadm 增長操作給出不間斷的 libata 錯誤

在 Debian 上使用 4 個磁碟的 Raid 5 自動創建一個備用驅動器