在 CentOS 上從 NFS v4 讀取時導致“輸入/輸出”錯誤的原因是什麼?
在從連接的 NFS 掛載打開好的文件時,我們偶爾會(和暫時)看到 nginx 和 php-fpm 等應用程序出錯:
php-fpm 錯誤範例:
2017/05/20 22:53:09 [error] 55#0: *6575 FastCGI sent in stderr: "PHP message: PHP Warning: getimagesize(/www/newspaperfoundation.org/html/wp-content/blogs.dir/22/files/2017/05/19-highest-honors-1.jpg): failed to open stream: Input/output error in /www/newspaperfoundation.org/html/wp-content/plugins/mashsharer/includes/header-meta-tags.php on line 271" while reading response header from upstream, client: 192.168.255.34, server: www.dailyrepublic.com, request: "GET /solano-news/fairfield/highest-honors-commends-students-with-4-0-and-higher-grade-point-average/ HTTP/1.1", upstream: "fastcgi://172.17.0.3:9001", host: "www.dailyrepublic.com"
nginx錯誤範例:
2017/05/20 23:22:32 [crit] 56#0: *712 open() "/www/newspaperfoundation.org/html/wp-content/blogs.dir/24/files/2017/05/Tandem1W-550x550.jpg" failed (5: Input/output error), client: 192.168.255.34, server: www.davisenterprise.com, request: "GET /files/2017/05/Tandem1W-550x550.jpg HTTP/1.1", host: "www.davisenterprise.com", referrer: "http://www.davisenterprise.com/"
在臨時錯誤期間,我可以
ls
看到該文件以正確的權限存在。一段時間後,圖像最終變得正常。其他文件返回 OK 而沒有輸入/輸出錯誤。我找不到太多的日誌來記錄這個問題。但是啟用
rpcdebug
後,我會在出現錯誤時看到很多這樣的消息:May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner (null) May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011 May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011 May 20 16:10:07 tomentella kernel: nfsd_dispatch: vers 4 proc 1 May 20 16:10:07 tomentella kernel: nfsv4 compound op #1/5: 22 (OP_PUTFH) May 20 16:10:07 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899) May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #1: 22: status 0 May 20 16:10:07 tomentella kernel: nfsv4 compound op #2/5: 18 (OP_OPEN) May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner (null) May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011 May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011 May 20 16:10:08 tomentella kernel: nfsd_dispatch: vers 4 proc 1 May 20 16:10:08 tomentella kernel: nfsv4 compound op #1/4: 22 (OP_PUTFH) May 20 16:10:08 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899) May 20 16:10:08 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 4 #1: 22: status 0 May 20 16:10:08 tomentella kernel: nfsv4 compound op #2/4: 15 (OP_LOOKUP)
特別是,我覺得我只看到錯誤文件的這條消息:
May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner (null)
關於可能導致
input/output
錯誤的任何想法?客戶端使用以下方式掛載:
mount.nfs4 -v -o proto=tcp $NFSMASTERHOST:/srv/data /srv/data
帶有更新包的 Centos 7。該錯誤是“新的”,最近伺服器更改很少。我想也許我最近對系統包的更新可能是這個變化的觸發因素。
因為問題在某些圖像中時有發生,所以我能夠在一定程度上觀察日誌並進行比較/對比。這是一個在 grep 特定圖像名稱時從 OK 變為 bad 的範例:
May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null) May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open_confirm on file Ron-Thomas-web-150x150.jpg May 20 18:38:37 tomentella kernel: NFSD: nfsd4_close on file Ron-Thomas-web-150x150.jpg May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null) May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null) May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null) May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null) May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null) May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
這裡是
nfsstat
tomentella ★ ~ $ nfsstat Server rpc stats: calls badcalls badclnt badauth xdrcall 94437487 6 6 0 0 Server nfs v4: null compound 503 0% 94436978 99% Server nfs v4 operations: op0-unused op1-unused op2-future access close commit 0 0% 0 0% 0 0% 11213689 3% 2631554 0% 3377 0% create delegpurge delegreturn getattr getfh link 579 0% 0 0% 0 0% 88581315 31% 32460559 11% 0 0% lock lockt locku lookup lookup_root nverify 365 0% 0 0% 365 0% 30058556 10% 0 0% 0 0% open openattr open_conf open_dgrd putfh putpubfh 2771686 0% 0 0% 74326 0% 0 0% 92969992 32% 0 0% putrootfh read readdir readlink remove rename 2435 0% 1999675 0% 1917567 0% 350 0% 12404 0% 5072 0% renew restorefh savefh secinfo setattr setcltid 1226801 0% 0 0% 5072 0% 0 0% 18315216 6% 121025 0% setcltidconf verify write rellockowner bc_ctl bind_conn 121105 0% 0 0% 115189 0% 365 0% 0 0% 0 0% exchange_id create_ses destroy_ses free_stateid getdirdeleg getdevinfo 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% getdevlist layoutcommit layoutget layoutreturn secinfononam sequence 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% set_ssv test_stateid want_deleg destroy_clid reclaim_comp 0 0% 0 0% 0 0% 0 0% 0 0% Client rpc stats: calls retrans authrefrsh 0 0 0
該問題似乎與 docker 主機後面的重複本地 IP 有關。Docker 為兩個容器分配相同的內部 IP(例如
172.17.0.4
),NFS 伺服器無法確定響應哪個客戶端,在某些情況下會同時取出兩個客戶端。這顯然是 RHEL 實現中長期存在的問題,因為我能夠在 Centos 6 中找到記錄此問題的錯誤報告(目前在 CentOS 7.3 中仍然影響我)。
我發現這個正在尋找解決我自己的共享 NFS 安裝的輸入/輸出錯誤問題的方法。我在幾台機器上安裝了一個共享的 NFS 驅動器,用 PHP 讀寫。我得到了像這樣的零星但頻繁的錯誤。我不知道我做了什麼解決了它,但如果有機會它可以幫助其他有同樣問題的人……
所以,我通過複製它們來創建工作伺服器。這導致它們都具有相同的主機名。我沒有想到任何事情,據我所知,主機名不會影響我正在做的事情。我將主機名更改為唯一,並確保 /etc/hosts 文件包含指向 127.0.0.1 的主機名,並且 NFS 錯誤從那時起就沒有出現過。