AMD 24核伺服器記憶體頻寬
我需要一些幫助來確定我在伺服器上的 Linux 下看到的記憶體頻寬是否正常。這是伺服器規格:
HP ProLiant DL165 G7 2x AMD Opteron 6164 HE 12-Core 40 GB RAM (10 x 4GB DDR1333) Debian 6.0
在此伺服器上使用
mbw
我得到以下數字:foo1:~# mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.58047 MiB: 1024.00000 Copy: 1764.082 MiB/s 1 Method: MEMCPY Elapsed: 0.58012 MiB: 1024.00000 Copy: 1765.152 MiB/s 2 Method: MEMCPY Elapsed: 0.58010 MiB: 1024.00000 Copy: 1765.201 MiB/s AVG Method: MEMCPY Elapsed: 0.58023 MiB: 1024.00000 Copy: 1764.811 MiB/s 0 Method: DUMB Elapsed: 0.36174 MiB: 1024.00000 Copy: 2830.778 MiB/s 1 Method: DUMB Elapsed: 0.35869 MiB: 1024.00000 Copy: 2854.817 MiB/s 2 Method: DUMB Elapsed: 0.35848 MiB: 1024.00000 Copy: 2856.481 MiB/s AVG Method: DUMB Elapsed: 0.35964 MiB: 1024.00000 Copy: 2847.310 MiB/s 0 Method: MCBLOCK Elapsed: 0.23546 MiB: 1024.00000 Copy: 4348.860 MiB/s 1 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.230 MiB/s 2 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.359 MiB/s AVG Method: MCBLOCK Elapsed: 0.23545 MiB: 1024.00000 Copy: 4349.149 MiB/s
在我的其他伺服器之一上(基於 Intel Xeon E3-1270):
foo2:~# mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.18960 MiB: 1024.00000 Copy: 5400.901 MiB/s 1 Method: MEMCPY Elapsed: 0.18922 MiB: 1024.00000 Copy: 5411.690 MiB/s 2 Method: MEMCPY Elapsed: 0.18944 MiB: 1024.00000 Copy: 5405.491 MiB/s AVG Method: MEMCPY Elapsed: 0.18942 MiB: 1024.00000 Copy: 5406.024 MiB/s 0 Method: DUMB Elapsed: 0.14838 MiB: 1024.00000 Copy: 6901.200 MiB/s 1 Method: DUMB Elapsed: 0.14818 MiB: 1024.00000 Copy: 6910.561 MiB/s 2 Method: DUMB Elapsed: 0.14820 MiB: 1024.00000 Copy: 6909.628 MiB/s AVG Method: DUMB Elapsed: 0.14825 MiB: 1024.00000 Copy: 6907.127 MiB/s 0 Method: MCBLOCK Elapsed: 0.04362 MiB: 1024.00000 Copy: 23477.623 MiB/s 1 Method: MCBLOCK Elapsed: 0.04262 MiB: 1024.00000 Copy: 24025.151 MiB/s 2 Method: MCBLOCK Elapsed: 0.04258 MiB: 1024.00000 Copy: 24048.849 MiB/s AVG Method: MCBLOCK Elapsed: 0.04294 MiB: 1024.00000 Copy: 23847.599 MiB/s
作為參考,這是我在基於英特爾的筆記型電腦上得到的:
laptop:~$ mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.40566 MiB: 1024.00000 Copy: 2524.269 MiB/s 1 Method: MEMCPY Elapsed: 0.38458 MiB: 1024.00000 Copy: 2662.638 MiB/s 2 Method: MEMCPY Elapsed: 0.38876 MiB: 1024.00000 Copy: 2634.043 MiB/s AVG Method: MEMCPY Elapsed: 0.39300 MiB: 1024.00000 Copy: 2605.600 MiB/s 0 Method: DUMB Elapsed: 0.30707 MiB: 1024.00000 Copy: 3334.745 MiB/s 1 Method: DUMB Elapsed: 0.30425 MiB: 1024.00000 Copy: 3365.653 MiB/s 2 Method: DUMB Elapsed: 0.30342 MiB: 1024.00000 Copy: 3374.849 MiB/s AVG Method: DUMB Elapsed: 0.30491 MiB: 1024.00000 Copy: 3358.328 MiB/s 0 Method: MCBLOCK Elapsed: 0.07875 MiB: 1024.00000 Copy: 13003.670 MiB/s 1 Method: MCBLOCK Elapsed: 0.08374 MiB: 1024.00000 Copy: 12228.034 MiB/s 2 Method: MCBLOCK Elapsed: 0.07635 MiB: 1024.00000 Copy: 13411.216 MiB/s AVG Method: MCBLOCK Elapsed: 0.07961 MiB: 1024.00000 Copy: 12862.006 MiB/s
所以據
mbw
**我的筆記型電腦比伺服器快3倍!!!**請幫我解釋一下。我也嘗試安裝一個 ram 磁碟並使用 dd 對其進行基準測試,我得到了類似的差異,所以我認為這不是mbw
罪魁禍首。我檢查了 BIOS 設置,記憶體似乎在全速執行。根據託管公司的說法,這些模組都可以。
這可能與NUMA有關嗎?似乎在此伺服器上禁用了節點交錯。啟用它(從而關閉 NUMA)會有所不同嗎?
foo1:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 node 0 size: 8190 MB node 0 free: 7898 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 12288 MB node 1 free: 12073 MB node 2 cpus: 18 19 20 21 22 23 node 2 size: 12288 MB node 2 free: 12034 MB node 3 cpus: 12 13 14 15 16 17 node 3 size: 8192 MB node 3 free: 8032 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10
更新:
已禁用 NUMA(Linux 啟動時 numa=off)並在 BIOS 中禁用 ECC。沒有變化,仍然和上面一樣。
更新 2:
這是根據以下內容的記憶體佈局
dmidecode
:PROC 1 DIMM 1 PROC 1 DIMM 4 PROC 1 DIMM 7 PROC 1 DIMM 10 PROC 1 DIMM 12 PROC 2 DIMM 1 PROC 2 DIMM 4 PROC 2 DIMM 7 PROC 2 DIMM 10 PROC 2 DIMM 12
這些都是4GB 三星模組(元件號 M393B5270CH0-CH9)
我查看了有關如何在此伺服器中填充記憶體的 HP 文件,如果我理解正確,則目前位於 DIMM 12 中的模組應該已放置在 DIMM 3 插槽中。這樣的錯誤配置可以解釋我得到的結果嗎?
更新 3:
我現在移除了 2 個模組,以便在 1-4-7-10 中的每一側 (4-4) 獲得 4x4 GB。
不幸的是,我沒有看到基準測試有任何差異。伺服器現在不應該能夠使用所有四個通道嗎?我也嘗試stream
過多執行緒的基準測試,結果非常令人失望。我唯一能想到的就是要求託管公司更換整個伺服器……更新 4:
當我昨天測試最後一個設置(32 GB)時,我一定做錯了,
stream
因為今天我看到了很好的結果:foo1:~# ./stream ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 2000000, Offset = 0 Total memory required = 45.8 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------- Number of Threads requested = 24 ------------------------------------------------------------- Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 703 microseconds. (= 703 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 36873.0022 0.0009 0.0009 0.0010 Scale: 34699.5160 0.0009 0.0009 0.0010 Add: 30868.8427 0.0016 0.0016 0.0017 Triad: 25558.7904 0.0019 0.0019 0.0020 ------------------------------------------------------------- Solution Validates -------------------------------------------------------------
(我已經放棄了
mbw
,因為它只在單執行緒模式下執行。它仍然在這個伺服器上給出同樣糟糕的結果)。所以問題一定是最後兩個 4GB 模組迫使伺服器以單通道模式執行,就像下面@chx 指出的那樣。現在唯一剩下的問題是是否可以使用 40 GB 並仍然獲得全部頻寬?我可以使用 2 x 8GB + 6 x 4GB 嗎?我將較大的模組放置在哪個通道中是否重要?
您通過每個 CPU 使用 5-5 個模組而不是 4-4 或 8-8 個模組來強制系統在單通道 (!) 模式下執行。這就是原因。嘗試刪除 1 - 1 並報告回來。
6164 是 G34 插槽 CPU,如果記憶體模組設置正確,它能夠進行四通道操作。您的設置是最糟糕的。