MegaCli 報告物理磁碟數量不一致

February 28, 2017

首先，這是我的問題的精簡版。我在 RAID 陣列中的驅動器上有一個閃爍的紅燈，雖然 MegaCli 沒有報告任何磁碟故障或警告，但一些 MegaCli 命令顯示 24 個磁碟，而另一些只顯示 23 個。我還看到每天重複出現以下錯誤：

Event Description: Controller encountered a fatal error and was reset

這些東西有關係嗎？這裡有問題嗎？

現在這是更長的版本。我繼承了對託管在數據中心的伺服器（我們稱之為伺服器my_server）的責任，並且我相信它具有具有 RAID 50/RAID 5+0 配置的 LSI MegaRAID SAS 9265-8i。我收到了一封來自數據中心的電子郵件，表明此伺服器的一個硬碟上的紅燈正在閃爍。不幸的是，我對 RAID 陣列幾乎一無所知，所以我必須通過MegaRAID SAS 軟體使用者指南和各種線上教程來摸索。

我 ssh 進入伺服器以嘗試診斷問題。下面是一個範例 shell 會話，它展示了我的努力並提供了一些有關係統的相關資訊。

首先我檢查一些基本的系統資訊：

$ cat /etc/issue
CentOS release 6.4 (Final)
Kernel \r on an \m

$ uname -a
Linux my_server 2.6.32-358.11.1.el6.x86_64 #1
SMP Wed Jun 12 03:34:52 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

接下來我驗證 RAID 陣列和 MegaCli 版本：

$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -aALL | grep "Product Name"
Product Name    : LSI MegaRAID SAS 9265-8i

$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -a0 | grep 'RAID Level'
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3

$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -v

     MegaCLI SAS RAID Management Tool  Ver 8.04.07 May 28, 2012

   (c)Copyright 2011, LSI Corporation, All Rights Reserved.

Exit Code: 0x00

現在一些關於陣列中驅動器的摘要資訊：

$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -A8 "Device Present"

                   Device Present
                   ================
   Virtual Drives    : 1
     Degraded        : 0
     Offline         : 0
   Physical Devices  : 27
     Disks           : 24
     Critical Disks  : 0
     Failed Disks    : 0

在這裡看起來一切都很好。然後我檢查 SMART 警報：

$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep 'S.M.A.R.T.'
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
[...]
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No

沒有 SMART 警報，所以在閱讀了一些教程之後，我執行了一些其他命令：

$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -ldinfo -lall -a0 | grep Drives
Number Of Drives    : 23

$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -aALL | grep -Pi 'SPAN|Span\ Ref|Number\ of'
Number of DISK GROUPS: 1
Number of Spans: 1
SPAN: 0
Span Reference: 0x00
Number of PDs: 23
Number of VDs: 1
Number of dedicated Hotspares: 0
Number Of Drives    : 23
Span Depth          : 1
Drive's postion: DiskGroup: 0, Span: 0, Arm: 0
Drive's postion: DiskGroup: 0, Span: 0, Arm: 1
Drive's postion: DiskGroup: 0, Span: 0, Arm: 2
Drive's postion: DiskGroup: 0, Span: 0, Arm: 3
[...]
Drive's postion: DiskGroup: 0, Span: 0, Arm: 20
Drive's postion: DiskGroup: 0, Span: 0, Arm: 21
Drive's postion: DiskGroup: 0, Span: 0, Arm: 22

現在我有點困惑，因為一些命令（例如 adpallinfo 和 pdlist）顯示存在 24 個磁碟，而其他命令（例如 ldinfo 和 CfgDsply）只顯示 23 個。

最後我生成了一個事件日誌文件並尋找問題的跡象：

$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpeventlog -getevents -f lsi-events.log -a0 -nolog
$ cat lsi-events.log | grep -P -i 'fail|error|warn'

[...]
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset


$ cat lsi-events.log | grep -B6 -A3 -P -i 'fail|error|warn'

[...]
seqNum: 0x000f8644
Time: Sun Feb 26 07:32:16 2017

Code: 0x00000159
Class: 2
Locale: 0x20
Event Description: Controller encountered a fatal error and was reset
Event Data:
===========
None

並查找與插槽 23 相關的消息：

$ cat lsi-events.log | grep -P -i 's23' | tail -30

Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff)
Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff)
Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Global Hot Spare PD 1f(e0x21/s23) (global,rev) disabled
Event Description: State change on PD 1f(e0x21/s23) from HOT SPARE(2) to UNCONFIGURED_GOOD(0)
Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff)
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: State change on PD 1f(e0x21/s23) from UNCONFIGURED_GOOD(0) to HOT SPARE(2)
Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)

我聯繫了數據中心，並被告知驅動器 10 上出現了閃爍的燈，因此我查看了該驅動器：

$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDInfo -PhysDrv [33:10] -a0

Enclosure Device ID: 33
Slot Number: 10
Drive's postion: DiskGroup: 0, Span: 0, Arm: 10
Enclosure position: 1
Device Id: 18
WWN: 5000C500344D5940
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Emulated Drive: No
Firmware state: Online, Spun Up
Commissioned Spare : No
Emergency Spare : No
Device Firmware Level: 0006
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c500344d5941
SAS Address(1): 0x5000c500344d5942
Connected Port Number: 0(path0) 1(path1) 
Inquiry Data: SEAGATE ST32000444SS    00069WM6369D            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :26C (78.80 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Port-1 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

Exit Code: 0x00

我也嘗試使用 smartctl：

$ sudo smartctl -a -d megaraid,18 /dev/sdc

smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.11.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               SEAGATE 
Product:              ST32000444SS    
Revision:             0006
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Logical Unit id:      0x5000c500344d5943
Serial number:        9WM6369D0000914458SC
Device type:          disk
Transport protocol:   SAS
Local Time is:        Tue Feb 28 17:18:33 2017 CST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     26 C
Drive Trip Temperature:        68 C
Manufactured in week 21 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  41
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  41
Elements in grown defect list: 0
Vendor (Seagate) cache information
 Blocks sent to initiator = 3508224337
 Blocks received from initiator = 38846232
 Blocks read from cache and sent to initiator = 44013719
 Number of read and write commands whose size &lt;= segment size = 2649500
 Number of read and write commands whose size &gt; segment size = 4
Vendor (Seagate/Hitachi) factory information
 number of hours powered up = 45862.30
 number of minutes until next internal SMART test = 46

Error counter log:
          Errors Corrected by           Total   Correction     Gigabytes    Total
              ECC          rereads/    errors   algorithm      processed    uncorrected
          fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   22540834        0         0  22540834   22540834        230.346           0
write:         0        0         0         0          0         20.012           0
verify: 161330204        1         0  161330205   161330205       1896.577           0

Non-medium error count:        0

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No self-tests have been logged
Long (extended) Self Test duration: 18500 seconds [308.3 minutes]

您在邏輯驅動器視圖和物理設備視圖之間看到的差異是因為插槽 23 中的驅動器配置為全域熱備用，因此它沒有分配給任何邏輯驅動器，並且如果它進入降級狀態，它可以作為備用進入任何 LD。因此，您有 24 個物理驅動器和 23 個分配給 LD 0 的全域熱備用。
關於驅動器上閃爍的紅燈，您應該向 DC 檢查它是哪個插槽，然後查看有關此驅動器狀態的詳細資訊，MegaCli -PDInfo -PhysDrv [E:S] -a0其中 E 是機箱編號，S 是插槽編號。通常閃爍的紅燈是 PFA/SMART 即將發生故障的標誌，儘管實際里程可能會有所不同。
附帶說明一下，使用grep逐字檢查人類可讀的輸出命令（例如 MegaCli）的結果是一種習慣，最終會導致麻煩。

引用自：https://serverfault.com/questions/835073

MegaCli 報告物理磁碟數量不一致

相關問答

RAID 級別混淆（MegaCli 與 megasasctl 輸出）

LSI MegaRAID SAS 9261-8i：更換後無法辨識磁碟

如何在降級 RAID 上配置通知？

這是一個嚴重的 RAID 錯誤嗎？

我可以從 Linux 內部檢測到硬體 RAID 資訊嗎？

“SSC 回寫”是什麼意思