MegaCli 報告物理磁碟數量不一致
首先,這是我的問題的精簡版。我在 RAID 陣列中的驅動器上有一個閃爍的紅燈,雖然 MegaCli 沒有報告任何磁碟故障或警告,但一些 MegaCli 命令顯示 24 個磁碟,而另一些只顯示 23 個。我還看到每天重複出現以下錯誤:
Event Description: Controller encountered a fatal error and was reset
這些東西有關係嗎?這裡有問題嗎?
現在這是更長的版本。我繼承了對託管在數據中心的伺服器(我們稱之為伺服器
my_server
)的責任,並且我相信它具有具有 RAID 50/RAID 5+0 配置的 LSI MegaRAID SAS 9265-8i。我收到了一封來自數據中心的電子郵件,表明此伺服器的一個硬碟上的紅燈正在閃爍。不幸的是,我對 RAID 陣列幾乎一無所知,所以我必須通過MegaRAID SAS 軟體使用者指南和各種線上教程來摸索。我 ssh 進入伺服器以嘗試診斷問題。下面是一個範例 shell 會話,它展示了我的努力並提供了一些有關係統的相關資訊。
首先我檢查一些基本的系統資訊:
$ cat /etc/issue CentOS release 6.4 (Final) Kernel \r on an \m $ uname -a Linux my_server 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed Jun 12 03:34:52 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
接下來我驗證 RAID 陣列和 MegaCli 版本:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -aALL | grep "Product Name" Product Name : LSI MegaRAID SAS 9265-8i $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -a0 | grep 'RAID Level' RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -v MegaCLI SAS RAID Management Tool Ver 8.04.07 May 28, 2012 (c)Copyright 2011, LSI Corporation, All Rights Reserved. Exit Code: 0x00
現在一些關於陣列中驅動器的摘要資訊:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -A8 "Device Present" Device Present ================ Virtual Drives : 1 Degraded : 0 Offline : 0 Physical Devices : 27 Disks : 24 Critical Disks : 0 Failed Disks : 0
在這裡看起來一切都很好。然後我檢查 SMART 警報:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep 'S.M.A.R.T.' Drive has flagged a S.M.A.R.T alert : No Drive has flagged a S.M.A.R.T alert : No [...] Drive has flagged a S.M.A.R.T alert : No Drive has flagged a S.M.A.R.T alert : No
沒有 SMART 警報,所以在閱讀了一些教程之後,我執行了一些其他命令:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -ldinfo -lall -a0 | grep Drives Number Of Drives : 23 $ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -aALL | grep -Pi 'SPAN|Span\ Ref|Number\ of' Number of DISK GROUPS: 1 Number of Spans: 1 SPAN: 0 Span Reference: 0x00 Number of PDs: 23 Number of VDs: 1 Number of dedicated Hotspares: 0 Number Of Drives : 23 Span Depth : 1 Drive's postion: DiskGroup: 0, Span: 0, Arm: 0 Drive's postion: DiskGroup: 0, Span: 0, Arm: 1 Drive's postion: DiskGroup: 0, Span: 0, Arm: 2 Drive's postion: DiskGroup: 0, Span: 0, Arm: 3 [...] Drive's postion: DiskGroup: 0, Span: 0, Arm: 20 Drive's postion: DiskGroup: 0, Span: 0, Arm: 21 Drive's postion: DiskGroup: 0, Span: 0, Arm: 22
現在我有點困惑,因為一些命令(例如 adpallinfo 和 pdlist)顯示存在 24 個磁碟,而其他命令(例如 ldinfo 和 CfgDsply)只顯示 23 個。
最後我生成了一個事件日誌文件並尋找問題的跡象:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpeventlog -getevents -f lsi-events.log -a0 -nolog $ cat lsi-events.log | grep -P -i 'fail|error|warn' [...] Event Description: Controller encountered a fatal error and was reset Event Description: Controller encountered a fatal error and was reset Event Description: Controller encountered a fatal error and was reset Event Description: Controller encountered a fatal error and was reset Event Description: Controller encountered a fatal error and was reset $ cat lsi-events.log | grep -B6 -A3 -P -i 'fail|error|warn' [...] seqNum: 0x000f8644 Time: Sun Feb 26 07:32:16 2017 Code: 0x00000159 Class: 2 Locale: 0x20 Event Description: Controller encountered a fatal error and was reset Event Data: =========== None
並查找與插槽 23 相關的消息:
$ cat lsi-events.log | grep -P -i 's23' | tail -30 Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff) Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1) Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff) Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Inserted: PD 1f(e0x21/s23) Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1) Event Description: Global Hot Spare PD 1f(e0x21/s23) (global,rev) disabled Event Description: State change on PD 1f(e0x21/s23) from HOT SPARE(2) to UNCONFIGURED_GOOD(0) Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff) Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev) Event Description: State change on PD 1f(e0x21/s23) from UNCONFIGURED_GOOD(0) to HOT SPARE(2) Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0) Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
我聯繫了數據中心,並被告知驅動器 10 上出現了閃爍的燈,因此我查看了該驅動器:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDInfo -PhysDrv [33:10] -a0 Enclosure Device ID: 33 Slot Number: 10 Drive's postion: DiskGroup: 0, Span: 0, Arm: 10 Enclosure position: 1 Device Id: 18 WWN: 5000C500344D5940 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.818 TB [0xe8d00000 Sectors] Emulated Drive: No Firmware state: Online, Spun Up Commissioned Spare : No Emergency Spare : No Device Firmware Level: 0006 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x5000c500344d5941 SAS Address(1): 0x5000c500344d5942 Connected Port Number: 0(path0) 1(path1) Inquiry Data: SEAGATE ST32000444SS 00069WM6369D FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :26C (78.80 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Port-1 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Exit Code: 0x00
我也嘗試使用 smartctl:
$ sudo smartctl -a -d megaraid,18 /dev/sdc smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.11.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net Vendor: SEAGATE Product: ST32000444SS Revision: 0006 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Logical block size: 512 bytes Logical Unit id: 0x5000c500344d5943 Serial number: 9WM6369D0000914458SC Device type: disk Transport protocol: SAS Local Time is: Tue Feb 28 17:18:33 2017 CST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 26 C Drive Trip Temperature: 68 C Manufactured in week 21 of year 2011 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 41 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 41 Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 3508224337 Blocks received from initiator = 38846232 Blocks read from cache and sent to initiator = 44013719 Number of read and write commands whose size <= segment size = 2649500 Number of read and write commands whose size > segment size = 4 Vendor (Seagate/Hitachi) factory information number of hours powered up = 45862.30 number of minutes until next internal SMART test = 46 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 22540834 0 0 22540834 22540834 230.346 0 write: 0 0 0 0 0 20.012 0 verify: 161330204 1 0 161330205 161330205 1896.577 0 Non-medium error count: 0 [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] No self-tests have been logged Long (extended) Self Test duration: 18500 seconds [308.3 minutes]
您在邏輯驅動器視圖和物理設備視圖之間看到的差異是因為插槽 23 中的驅動器配置為全域熱備用,因此它沒有分配給任何邏輯驅動器,並且如果它進入降級狀態,它可以作為備用進入任何 LD。因此,您有 24 個物理驅動器和 23 個分配給 LD 0 的全域熱備用。
關於驅動器上閃爍的紅燈,您應該向 DC 檢查它是哪個插槽,然後查看有關此驅動器狀態的詳細資訊,
MegaCli -PDInfo -PhysDrv [E:S] -a0
其中 E 是機箱編號,S 是插槽編號。通常閃爍的紅燈是 PFA/SMART 即將發生故障的標誌,儘管實際里程可能會有所不同。附帶說明一下,使用
grep
逐字檢查人類可讀的輸出命令(例如 MegaCli)的結果是一種習慣,最終會導致麻煩。