Vmware-Esxi

MongoDB 頻繁切換初選

  • June 4, 2015

我們正在執行一個包含 3 個成員的 Mongo 2.6 副本集:主要、次要、仲裁者。幾乎每天我們的 MongoDB 都在切換哪個伺服器是主伺服器,這會導致與該數據庫的所有連接中斷。如果它這樣做是非常好的,因為其中一台伺服器確實宕機了,挑戰在於,在每種情況下,似乎“宕機”的伺服器似乎並沒有真正宕機。它一直在上升。

以下是我們所知道的:

  1. 所有 3 台伺服器上的mongod程序都沒有重新啟動或關閉。
  2. 伺服器一直在向 New Relic 報告。
  3. 從 mongo 日誌中,我們看到頻繁的心跳失敗。
  4. 伺服器在任何時候都沒有真正處於非常高的負載下。我在每小時 10 分鐘左右每小時看到一次 CPU 峰值,但這與故障並不完全一致。

以下是show log rswhile shell’d 進入目前主節點的結果。

2015-05-17T15:05:49.339+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017
2015-05-17T15:05:49.358+0000 [rsBackgroundSync] replSet syncing to: server1:27017
2015-05-17T15:05:56.444+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017
2015-05-17T22:11:36.638+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond):
2015-05-17T22:11:36.644+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN
2015-05-17T22:11:37.495+0000 [rsMgr] not electing self, we are not freshest
2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is up
2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY
2015-05-17T22:11:39.140+0000 [rsBackgroundSync] replSet syncing to: server1:27017
2015-05-17T22:11:39.147+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017
2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017
2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet syncing to: server1:27017
2015-05-17T23:05:47.876+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017
2015-05-18T10:05:46.821+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017
2015-05-18T10:05:46.822+0000 [rsBackgroundSync] replSet syncing to: server1:27017
2015-05-18T10:05:51.014+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017
2015-05-18T22:12:11.433+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond):
2015-05-18T22:12:11.434+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN
2015-05-18T22:12:11.507+0000 [rsMgr] replSet info electSelf 3
2015-05-18T22:12:14.708+0000 [rsMgr] replSet PRIMARY
2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is up
2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY
2015-05-18T22:12:21.610+0000 [rsHealthPoll] replSet member server1:27017 is now in state ROLLBACK
2015-05-18T22:12:23.612+0000 [rsHealthPoll] replSet member server1:27017 is now in state SECONDARY
2015-05-19T22:13:13.004+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (x.x.x.x), connection attempt failed
2015-05-19T22:13:24.127+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (x.x.x.x) failed, connection attempt failed
2015-05-19T22:13:29.267+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state
2015-05-20T22:14:35.832+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state

您可以看到我們經常收到心跳故障和停機通知,但在每種情況下,伺服器每次都會在幾秒鐘內從停機到備份。我不確定從哪裡開始尋找下一個嘗試找出可能導致問題的原因。

這已解決。核心問題是我們的託管服務提供商正在執行 VMWare 快照作為備份機制。這些快照導致虛擬機暫時進入停滯期,我相信技術術語是虛擬機停頓。

一旦這些快照被禁用,我們就不再有任何問題。

我經常看到這種情況,而且它總是在mongod流程之外。DNS 解析器問題、TCP/IP 堆棧問題、網路連結、物理硬體等。從這個mongod過程中解決問題。檢查主機作業系統上的網路錯誤,檢查物理連結(如果物理連結在等式中),檢查兩台伺服器之間的雲提供商(如果您跨越區域)。這很可能是主機作業系統上的東西,與 MongoDB 本身無關。

引用自:https://serverfault.com/questions/693467