Raid

RAID 性能突然變慢

  • December 13, 2017

我們最近注意到我們的數據庫查詢執行時間比平時要長得多。經過一番調查,看起來我們的磁碟讀取速度非常慢。

過去,我們遇到過類似的問題,原因是 RAID 控制器在 BBU 上啟動重新學習週期並切換到直寫。這次似乎不是這種情況。

在幾天的時間裡,我跑bonnie++了幾次。結果如下:

邦妮++的輸出

22-82 M/s 的讀取速度看起來非常糟糕。在原始設備上執行dd幾分鐘顯示讀取速度在 15.8 MB/s 到 225 MB/s 之間(請參閱下面的更新)。iotop並不表示任何其他程序競爭 IO,所以我不確定為什麼讀取速度如此可變。

RAID 卡是 MegaRAID SAS 9280,在 RAID10 中具有 12 個 SAS 驅動器(15k,300GB),具有 XFS 文件系統(在 RAID1 中配置的兩個 SSD 上的作業系統)。我沒有看到任何 SMART 警報,並且陣列似乎沒有降級。

我也執行過xfs_check,似乎沒有任何 XFS 一致性問題。

這裡的下一個調查步驟應該是什麼?

伺服器規格

Ubuntu 12.04.5 LTS
128GB RAM
Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz

輸出xfs_repair -n

Phase 1 - find and verify superblock...
Phase 2 - using internal log
       - scan filesystem freespace and inode maps...
       - found root inode chunk
Phase 3 - for each AG...
       - scan (but don't clear) agi unlinked lists...
       - process known inodes and perform inode discovery...
       - agno = 0
       - agno = 1
       - agno = 2
       - agno = 3
       - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
       - setting up duplicate extent list...
       - check for inodes claiming duplicate blocks...
       - agno = 1
       - agno = 3
       - agno = 2
       - agno = 0
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
       - traversing filesystem ...
       - traversal finished ...
       - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

輸出megacli -AdpAllInfo -aAll

                   Versions
               ================
Product Name    : LSI MegaRAID SAS 9280-4i4e
Serial No       : SV24919344
FW Package Build: 12.12.0-0124

                   Mfg. Data
               ================
Mfg. Date       : 12/06/12
Rework Date     : 00/00/00
Revision No     : 04B
Battery FRU     : N/A

               Image Versions in Flash:
               ================
FW Version         : 2.130.363-1846
BIOS Version       : 3.25.00_4.12.05.00_0x05180000
Preboot CLI Version: 04.04-020:#%00009
WebBIOS Version    : 6.0-51-e_47-Rel
NVDATA Version     : 2.09.03-0039
Boot Block Version : 2.02.00.00-0000
BOOT Version       : 09.250.01.219

               Pending Images in Flash
               ================
None

               PCI Info
               ================
Controller Id   : 0000
Vendor Id       : 1000
Device Id       : 0079
SubVendorId     : 1000
SubDeviceId     : 9282

Host Interface  : PCIE

ChipRevision    : B4

Link Speed       : 0
Number of Frontend Port: 0
Device Interface  : PCIE

Number of Backend Port: 8
Port  :  Address
0        5003048001c1e47f
1        0000000000000000
2        0000000000000000
3        0000000000000000
4        0000000000000000
5        0000000000000000
6        0000000000000000
7        0000000000000000

               HW Configuration
               ================
SAS Address      : 500605b005a6cbc0
BBU              : Present
Alarm            : Present
NVRAM            : Present
Serial Debugger  : Present
Memory           : Present
Flash            : Present
Memory Size      : 512MB
TPM              : Absent
On board Expander: Absent
Upgrade Key      : Absent
Temperature sensor for ROC    : Absent
Temperature sensor for controller    : Absent


               Settings
               ================
Current Time                     : 14:58:51 7/11, 2016
Predictive Fail Poll Interval    : 300sec
Interrupt Throttle Active Count  : 16
Interrupt Throttle Completion    : 50us
Rebuild Rate                     : 30%
PR Rate                          : 30%
BGI Rate                         : 30%
Check Consistency Rate           : 30%
Reconstruction Rate              : 30%
Cache Flush Interval             : 4s
Max Drives to Spinup at One Time : 4
Delay Among Spinup Groups        : 2s
Physical Drive Coercion Mode     : Disabled
Cluster Mode                     : Disabled
Alarm                            : Enabled
Auto Rebuild                     : Enabled
Battery Warning                  : Enabled
Ecc Bucket Size                  : 15
Ecc Bucket Leak Rate             : 1440 Minutes
Restore HotSpare on Insertion    : Disabled
Expose Enclosure Devices         : Enabled
Maintain PD Fail History         : Enabled
Host Request Reordering          : Enabled
Auto Detect BackPlane Enabled    : SGPIO/i2c SEP
Load Balance Mode                : Auto
Use FDE Only                     : No
Security Key Assigned            : No
Security Key Failed              : No
Security Key Not Backedup        : No
Default LD PowerSave Policy      : Controller Defined
Maximum number of direct attached drives to spin up in 1 min : 120
Auto Enhanced Import             : No
Any Offline VD Cache Preserved   : No
Allow Boot with Preserved Cache  : No
Disable Online Controller Reset  : No
PFK in NVRAM                     : No
Use disk activity for locate     : No
POST delay           : 90 seconds
BIOS Error Handling              : Stop On Errors
Current Boot Mode         :Normal
               Capabilities
               ================
RAID Level Supported             : RAID0, RAID1, RAID5, RAID6, RAID00, RAID10, RAID50, RAID60, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span
Supported Drives                 : SAS, SATA

Allowed Mixing:

Mix in Enclosure Allowed
Mix of SAS/SATA of HDD type in VD Allowed

               Status
               ================
ECC Bucket Count                 : 0

               Limitations
               ================
Max Arms Per VD          : 32
Max Spans Per VD         : 8
Max Arrays               : 128
Max Number of VDs        : 64
Max Parallel Commands    : 1008
Max SGE Count            : 80
Max Data Transfer Size   : 8192 sectors
Max Strips PerIO         : 42
Max LD per array         : 16
Min Strip Size           : 8 KB
Max Strip Size           : 1.0 MB
Max Configurable CacheCade Size: 0 GB
Current Size of CacheCade      : 0 GB
Current Size of FW Cache       : 350 MB

               Device Present
               ================
Virtual Drives    : 2
 Degraded        : 0
 Offline         : 0
Physical Devices  : 16
 Disks           : 14
 Critical Disks  : 0
 Failed Disks    : 0

               Supported Adapter Operations
               ================
Rebuild Rate                    : Yes
CC Rate                         : Yes
BGI Rate                        : Yes
Reconstruct Rate                : Yes
Patrol Read Rate                : Yes
Alarm Control                   : Yes
Cluster Support                 : No
BBU                             : Yes
Spanning                        : Yes
Dedicated Hot Spare             : Yes
Revertible Hot Spares           : Yes
Foreign Config Import           : Yes
Self Diagnostic                 : Yes
Allow Mixed Redundancy on Array : No
Global Hot Spares               : Yes
Deny SCSI Passthrough           : No
Deny SMP Passthrough            : No
Deny STP Passthrough            : No
Support Security                : No
Snapshot Enabled                : No
Support the OCE without adding drives : Yes
Support PFK                     : Yes
Support PI                      : No
Support Boot Time PFK Change    : No
Disable Online PFK Change       : No
PFK TrailTime Remaining         : 0 days 0 hours
Support Shield State            : No
Block SSD Write Disk Cache Change: No

               Supported VD Operations
               ================
Read Policy          : Yes
Write Policy         : Yes
IO Policy            : Yes
Access Policy        : Yes
Disk Cache Policy    : Yes
Reconstruction       : Yes
Deny Locate          : No
Deny CC              : No
Allow Ctrl Encryption: No
Enable LDBBM         : No
Support Breakmirror  : No
Power Savings        : No

               Supported PD Operations
               ================
Force Online                            : Yes
Force Offline                           : Yes
Force Rebuild                           : Yes
Deny Force Failed                       : No
Deny Force Good/Bad                     : No
Deny Missing Replace                    : No
Deny Clear                              : No
Deny Locate                             : No
Support Temperature                     : Yes
NCQ                                     : No
Disable Copyback                        : No
Enable JBOD                             : No
Enable Copyback on SMART                : No
Enable Copyback to SSD on SMART Error   : Yes
Enable SSD Patrol Read                  : No
PR Correct Unconfigured Areas           : Yes
Enable Spin Down of UnConfigured Drives : Yes
Disable Spin Down of hot spares         : No
Spin Down time                          : 30
T10 Power State                         : No
               Error Counters
               ================
Memory Correctable Errors   : 0
Memory Uncorrectable Errors : 0

               Cluster Information
               ================
Cluster Permitted     : No
Cluster Active        : No

               Default Settings
               ================
Phy Polarity                     : 0
Phy PolaritySplit                : 0
Background Rate                  : 30
Strip Size                       : 256kB
Flush Time                       : 4 seconds
Write Policy                     : WB
Read Policy                      : Adaptive
Cache When BBU Bad               : Disabled
Cached IO                        : No
SMART Mode                       : Mode 6
Alarm Disable                    : Yes
Coercion Mode                    : None
ZCR Config                       : Unknown
Dirty LED Shows Drive Activity   : No
BIOS Continue on Error           : 0
Spin Down Mode                   : None
Allowed Device Type              : SAS/SATA Mix
Allow Mix in Enclosure           : Yes
Allow HDD SAS/SATA Mix in VD     : Yes
Allow SSD SAS/SATA Mix in VD     : No
Allow HDD/SSD Mix in VD          : No
Allow SATA in Cluster            : No
Max Chained Enclosures           : 16
Disable Ctrl-R                   : Yes
Enable Web BIOS                  : Yes
Direct PD Mapping                : No
BIOS Enumerate VDs               : Yes
Restore Hot Spare on Insertion   : No
Expose Enclosure Devices         : Yes
Maintain PD Fail History         : Yes
Disable Puncturing               : No
Zero Based Enclosure Enumeration : No
PreBoot CLI Enabled              : Yes
LED Show Drive Activity          : Yes
Cluster Disable                  : Yes
SAS Disable                      : No
Auto Detect BackPlane Enable     : SGPIO/i2c SEP
Use FDE Only                     : No
Enable Led Header                : No
Delay during POST                : 0
EnableCrashDump                  : No
Disable Online Controller Reset  : No
EnableLDBBM                      : No
Un-Certified Hard Disk Drives    : Allow
Treat Single span R1E as R10     : No
Max LD per array                 : 16
Power Saving option              : Don't Auto spin down Configured Drives
Max power savings option is  not allowed for LDs. Only T10 power conditions are to be used.
Default spin down time in minutes: 30
Enable JBOD                      : No
TTY Log In Flash                 : No
Auto Enhanced Import             : No
BreakMirror RAID Support         : No
Disable Join Mirror              : No
Enable Shield State              : No
Time taken to detect CME         : 60s

輸出megacli -AdpBbuCmd -GetBbuSTatus -aAll

BBU status for Adapter: 0

BatteryType: iBBU
Voltage: 4068 mV
Current: 0 mA
Temperature: 30 C
Battery State: Optimal
BBU Firmware Status:

 Charging Status              : Charging
 Voltage                                 : OK
 Temperature                             : OK
 Learn Cycle Requested                   : No
 Learn Cycle Active                      : No
 Learn Cycle Status                      : OK
 Learn Cycle Timeout                     : No
 I2c Errors Detected                     : No
 Battery Pack Missing                    : No
 Battery Replacement required            : No
 Remaining Capacity Low                  : No
 Periodic Learn Required                 : No
 Transparent Learn                       : No
 No space to cache offload               : No
 Pack is about to fail & should be replaced : No
 Cache Offload premium feature required  : No
 Module microcode update required        : No


GasGuageStatus:
 Fully Discharged        : No
 Fully Charged           : No
 Discharging             : Yes
 Initialized             : Yes
 Remaining Time Alarm    : No
 Discharge Terminated    : No
 Over Temperature        : No
 Charging Terminated     : No
 Over Charged            : No
 Relative State of Charge: 88 %
 Charger System State: 49169
 Charger System Ctrl: 0
 Charging current: 512 mA
 Absolute state of charge: 87 %
 Max Error: 4 %

Exit Code: 0x00

輸出megacli -LDInfo -Lall -aAll

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 111.281 GB
Sector Size         : 512
Mirror Data         : 111.281 GB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 2
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Is VD Cached: No


Virtual Drive: 1 (Target Id: 1)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Is VD Cached: No

**更新:**根據 Andrew 的建議,我跑dd了幾分鐘,看看我會在原始磁碟讀取上獲得什麼樣的吞吐量:

dd if=/dev/sdb of=/dev/null bs=256k
19701+0 records in
19700+0 records out
5164236800 bytes (5.2 GB) copied, 202.553 s, 25.5 MB/s

其他執行的結果,吞吐量變化很大:

18706857984 bytes (19 GB) copied, 1181.51 s, 15.8 MB/s
20923023360 bytes (21 GB) copied, 388.137 s, 53.9 MB/s
21205876736 bytes (21 GB) copied, 55.5997 s, 381 MB/s
25391005696 bytes (25 GB) copied, 153.903 s, 165 MB/s

**更新 2:**輸出megacli -PDlist -aallhttps ://gist.github.com/danpelota/3fca1e5f90a1f358c2d52a49bfb08ef0

正如 Michal在他的評論中指出的那樣,這個問題是一個“預故障”磁碟。來自 megaraid 控制器和 smartctl 的診斷中沒有危險信號SMART Health Status:OKsmartctl在每個磁碟上執行顯示了巨大的非中等錯誤計數(我編寫了一個快速 bash 腳本來循環遍歷每個磁碟 ID)。以下是完整輸出中的相關位:

# Ran this for each individual disk on the /dev/sdb array:
smartctl -a -d megaraid,18  /dev/sdb

Error counter log:
          Errors Corrected by           Total   Correction     Gigabytes    Total
              ECC          rereads/    errors   algorithm      processed    uncorrected
          fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    7950078        0         0   7950078    7950078        660.801           0
write:         0        0         0         0          0        363.247           0
verify:       12        0         0        12         12          0.002           0

Non-medium error count:  3253718

除此驅動器(磁碟 ID 18)外,其他所有驅動器均顯示非中等錯誤計數為 0。我確定了磁碟,將其更換為新磁碟,然後又恢復了 3gbps 的讀取速度。

根據smartmontools wiki

顯示的錯誤日誌(如果可用)顯示在單獨的行中:

  • 寫錯誤計數器
  • 讀取錯誤計數器
  • 驗證錯誤計數器(僅在非零時顯示)
  • 非中等錯誤計數器(僅顯示一個數字)。這表示除寫入、讀取或驗證錯誤之外的可恢復事件的數量。
  • 錯誤事件保存在“最後 n 個錯誤事件”日誌頁面中。保存的錯誤事件記錄數(即“n”)是特定於供應商的(例如,Hitachi 10K300 型號磁碟最多保存 23 條記錄)。每個錯誤事件記錄的內容都是 ASCII 格式和特定​​於供應商的。與每個錯誤事件記錄相關的參數程式碼表示錯誤事件發生的相對時間。較高的參數程式碼表示錯誤事件發生的時間較晚。如果設備不支持此日誌頁面,則輸出“不支持錯誤事件記錄”。如果支持此日誌頁面並且有錯誤事件記錄,則每個記錄都以“錯誤事件:”為前綴,其中是參數程式碼。

引用自:https://serverfault.com/questions/789155