Filesystems

SSD 驅動器的 ext3 分區上的突然斷電後文件系統損壞是“預期行為”嗎?

  • August 9, 2015

我的公司製造了一個嵌入式 Debian Linux 設備,它從內部 SSD 驅動器上的 ext3 分區啟動。由於該設備是一個嵌入式“黑匣子”,它通常會以粗魯的方式關閉,只需通過外部開關切斷設備的電源即可。

這通常沒問題,因為 ext3 的日誌記錄使事情井井有條,所以除了偶爾失去部分日誌文件外,事情一直進展順利。

然而,我們最近看到一些單元在經過多次硬重啟後,ext3 分區開始出現結構問題——特別是,我們在 ext3 分區上執行 e2fsck,它發現了許多類似的問題顯示在此問題底部的輸出列表中。執行 e2fsck 直到它停止報告錯誤(或重新格式化分區)可以清除問題。

我的問題是……在經歷了許多突然/意外關閉的 ext3/SSD 系統上看到這樣的問題意味著什麼?

我的感覺是,這可能是我們系統中存在軟體或硬體問題的跡象,因為我的理解是(除非出現錯誤或硬體問題)ext3 的日誌功能應該可以防止此類文件系統完整性錯誤。(注意:我知道使用者數據沒有被記錄,因此可能會發生失去/失去/截斷的使用者文件;我在這裡專門談論文件系統元數據錯誤,如下所示)

另一方面,我的同事說這是已知/預期的行為,因為 SSD 控制器有時會重新排序寫入命令,這可能會導致 ext3 日誌混淆。特別是,他認為即使在正常執行的硬體和沒有錯誤的軟體的情況下,ext3 日誌只會降低文件系統損壞的可能性,而不是不可能,所以我們不應該對不時看到這樣的問題感到驚訝。

我們誰是對的?

Embedded-PC-failsafe:~# ls
Embedded-PC-failsafe:~# umount /mnt/unionfs
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Invalid inode number for '.' in directory inode 46948.
Fix<y>? yes

Directory inode 46948, block 0, offset 12: directory corrupted
Salvage<y>? yes

Entry 'status_2012-11-26_14h13m41.csv' in /var/log/status_logs (46956) has deleted/unused inode 47075.  Clear<y>? yes
Entry 'status_2012-11-26_10h42m58.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47076.  Clear<y>? yes
Entry 'status_2012-11-26_11h29m41.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47080.  Clear<y>? yes
Entry 'status_2012-11-26_11h42m13.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47081.  Clear<y>? yes
Entry 'status_2012-11-26_12h07m17.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47083.  Clear<y>? yes
Entry 'status_2012-11-26_12h14m53.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47085.  Clear<y>? yes
Entry 'status_2012-11-26_15h06m49.csv' in /var/log/status_logs (46956) has deleted/unused inode 47088.  Clear<y>? yes
Entry 'status_2012-11-20_14h50m09.csv' in /var/log/status_logs (46956) has deleted/unused inode 47073.  Clear<y>? yes
Entry 'status_2012-11-20_14h55m32.csv' in /var/log/status_logs (46956) has deleted/unused inode 47074.  Clear<y>? yes
Entry 'status_2012-11-26_11h04m36.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47078.  Clear<y>? yes
Entry 'status_2012-11-26_11h54m45.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47082.  Clear<y>? yes
Entry 'status_2012-11-26_12h12m20.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47084.  Clear<y>? yes
Entry 'status_2012-11-26_12h33m52.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47086.  Clear<y>? yes
Entry 'status_2012-11-26_10h51m59.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47077.  Clear<y>? yes
Entry 'status_2012-11-26_11h17m09.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47079.  Clear<y>? yes
Entry 'status_2012-11-26_12h54m11.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47087.  Clear<y>? yes

Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Couldn't fix parent of inode 46948: Couldn't find parent directory entry

Pass 4: Checking reference counts
Unattached inode 46945
Connect to /lost+found<y>? yes

Inode 46945 ref count is 2, should be 1.  Fix<y>? yes
Inode 46953 ref count is 5, should be 4.  Fix<y>? yes

Pass 5: Checking group summary information
Block bitmap differences:  -(208264--208266) -(210062--210068) -(211343--211491) -(213241--213250) -(213344--213393) -213397 -(213457--213463) -(213516--213521) -(213628--213655) -(213683--213688) -(213709--213728) -(215265--215300) -(215346--215365) -(221541--221551) -(221696--221704) -227517
Fix<y>? yes

Free blocks count wrong for group #6 (17247, counted=17611).
Fix<y>? yes

Free blocks count wrong (161691, counted=162055).
Fix<y>? yes

Inode bitmap differences:  +(47089--47090) +47093 +47095 +(47097--47099) +(47101--47104) -(47219--47220) -47222 -47224 -47228 -47231 -(47347--47348) -47350 -47352 -47356 -47359 -(47457--47488) -47985 -47996 -(47999--48000) -48017 -(48027--48028) -(48030--48032) -48049 -(48059--48060) -(48062--48064) -48081 -(48091--48092) -(48094--48096)
Fix<y>? yes

Free inodes count wrong for group #6 (7608, counted=7624).
Fix<y>? yes

Free inodes count wrong (61919, counted=61935).
Fix<y>? yes


embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****

embeddedrootwrite: ********** WARNING: Filesystem still has errors **********

embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks

Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Directory entry for '.' in ... (46948) is big.
Split<y>? yes

Missing '..' in directory inode 46948.
Fix<y>? yes

Setting filetype for entry '..' in ... (46948) to 2.
Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Pass 4: Checking reference counts
Inode 2 ref count is 12, should be 13.  Fix<y>? yes

Pass 5: Checking group summary information

embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****
embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks
Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite: clean, 657/62592 files, 87882/249937 blocks

你們都錯了(也許?)… ext3 正在盡其所能應對如此突然地刪除其底層儲存。

您的 SSD 可能具有某種類型的板載記憶體。您沒有提到使用中 SSD 的品牌/型號,但這聽起來像是消費級 SSD 與企業級或工業級型號

無論哪種方式,記憶體都用於幫助合併寫入並延長驅動器的使用壽命。如果有寫入在途,突然斷電肯定是你腐敗的根源。真正的企業級和工業級 SSD 具有超級電容器,可以保持足夠長的功率以將數據從記憶體移動到非易失性儲存,這與電池支持和快閃記憶體支持的 RAID 控制器記憶體的工作方式非常相似。

如果您的驅動器沒有超級電容,則正在進行的事務將失去,從而導致文件系統損壞。ext3 可能被告知一切都在穩定的儲存上,但這只是記憶體的一個功能。

引用自:https://serverfault.com/questions/454775