Domain-Name-System
當主 DNS 停止而輔助 DNS 執行時,Mongodb 副本無法解析 DNS
- 我已經設置了由 3 個節點 + 1 個延遲隱藏節點 + 仲裁器組成的 mongod 副本。
- 我已經設置了 DNS:主要和次要內部 DNS(綁定)伺服器,這樣我就可以通過普通的 FQDN 名稱而不是 IP 地址來引用節點。
- 當(如果)主伺服器關閉時,我有輔助 DNS 來處理請求。
問題:
當我模擬主 DNS 關閉時 - 我完全破壞了副本集,作為主節點 - 看不到其他節點並在 5-10 秒後變為 SECONDARY
這是我的主節點 (mongodb-cluster-shard-01-rA.site-aws.com) 在主 DNS 關閉時顯示的內容:
siteRS0:SECONDARY> rs.status() { "set" : "siteRS0", "date" : ISODate("2014-08-10T03:16:22Z"), "myState" : 2, "members" : [ { "_id" : 0, "name" : "mongodb-cluster-shard-01-rA.site-aws.com:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 1913839, "optime" : Timestamp(1407628608, 1), "optimeDate" : ISODate("2014-08-09T23:56:48Z"), "self" : true }, { "_id" : 1, "name" : "mongodb-cluster-shard-01-rB.site-aws.com:27017", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(1407628608, 1), "optimeDate" : ISODate("2014-08-09T23:56:48Z"), "lastHeartbeat" : ISODate("2014-08-10T03:16:08Z"), "lastHeartbeatRecv" : ISODate("2014-08-10T03:15:52Z"), "pingMs" : 0, "syncingTo" : "mongodb-cluster-shard-01-rA.site-aws.com:27017" }, { "_id" : 2, "name" : "mongodb-cluster-shard-01-arbiter.site-aws.com:30000", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "lastHeartbeat" : ISODate("2014-08-10T03:16:19Z"), "lastHeartbeatRecv" : ISODate("2014-08-10T03:15:45Z"), "pingMs" : 0 }, { "_id" : 3, "name" : "mongodb-cluster-shard-01-rC.site-aws.com:27017", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(1407628608, 1), "optimeDate" : ISODate("2014-08-09T23:56:48Z"), "lastHeartbeat" : ISODate("2014-08-10T03:16:16Z"), "lastHeartbeatRecv" : ISODate("2014-08-10T03:15:52Z"), "pingMs" : 0, "syncingTo" : "mongodb-cluster-shard-01-rA.site-aws.com:27017" }, { "_id" : 4, "name" : "mongodb-cluster-shard-01-rA-backup-hidden.site-aws.com:27017", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(1407628608, 1), "optimeDate" : ISODate("2014-08-09T23:56:48Z"), "lastHeartbeat" : ISODate("2014-08-10T03:16:00Z"), "lastHeartbeatRecv" : ISODate("2014-08-10T03:15:49Z"), "pingMs" : 0, "syncingTo" : "mongodb-cluster-shard-01-rA.site-aws.com:27017" } ], "ok" : 1 }
如果我查看日誌,我會看到很多 getaddrinfo 消息:
[root@mongodb-cluster-shard-01-rA ec2-user]# tail /mongo/log/mongod.log 2014-08-10T02:35:13.044+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-arbiter.site-aws.com") failed: Name or service not known 2014-08-10T02:35:13.469+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rC.site-aws.com") failed: Name or service not known 2014-08-10T02:35:13.469+0000 [rsHealthPoll] couldn't connect to mongodb-cluster-shard-01-rC.site-aws.com:27017: couldn't connect to server mongodb-cluster-shard-01-rC.site-aws.com:27017 (0.0.0.0) failed, address resolved to 0.0.0.0 2014-08-10T02:35:13.968+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rA-backup-hidden.site-aws.com") failed: Name or service not known 2014-08-10T02:35:13.968+0000 [rsHealthPoll] couldn't connect to mongodb-cluster-shard-01-rA-backup-hidden.site-aws.com:27017: couldn't connect to server mongodb-cluster-shard-01-rA-backup-hidden.site-aws.com:27017 (0.0.0.0) failed, address resolved to 0.0.0.0 2014-08-10T02:35:17.059+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rB.site-aws.com") failed: Name or service not known 2014-08-10T02:35:17.059+0000 [rsHealthPoll] couldn't connect to mongodb-cluster-shard-01-rB.site-aws.com:27017: couldn't connect to server mongodb-cluster-shard-01-rB.site-aws.com:27017 (0.0.0.0) failed, address resolved to 0.0.0.0 2014-08-10T02:35:18.476+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rC.site-aws.com") failed: Name or service not known 2014-08-10T02:35:18.669+0000 [rsHealthPoll] couldn't connect to mongodb-cluster-shard-01-rC.site-aws.com:27017: couldn't connect to server mongodb-cluster-shard-01-rC.site-aws.com:27017 (0.0.0.0) failed, address resolved to 0.0.0.0 2014-08-10T02:35:18.976+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rA-backup-hidden.site-aws.com") failed: Name or service not known [root@mongodb-cluster-shard-01-rA ec2-user]# tail /mongo/log/mongod.log 2014-08-10T02:35:17.059+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rB.site-aws.com") failed: Name or service not known 2014-08-10T02:35:17.059+0000 [rsHealthPoll] couldn't connect to mongodb-cluster-shard-01-rB.site-aws.com:27017: couldn't connect to server mongodb-cluster-shard-01-rB.site-aws.com:27017 (0.0.0.0) failed, address resolved to 0.0.0.0 2014-08-10T02:35:18.476+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rC.site-aws.com") failed: Name or service not known 2014-08-10T02:35:18.669+0000 [rsHealthPoll] couldn't connect to mongodb-cluster-shard-01-rC.site-aws.com:27017: couldn't connect to server mongodb-cluster-shard-01-rC.site-aws.com:27017 (0.0.0.0) failed, address resolved to 0.0.0.0 2014-08-10T02:35:18.976+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rA-backup-hidden.site-aws.com") failed: Name or service not known 2014-08-10T02:35:20.051+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-arbiter.site-aws.com") failed: Name or service not known 2014-08-10T02:35:20.051+0000 [rsHealthPoll] couldn't connect to mongodb-cluster-shard-01-arbiter.site-aws.com:30000: couldn't connect to server mongodb-cluster-shard-01-arbiter.site-aws.com:30000 (0.0.0.0) failed, address resolved to 0.0.0.0 2014-08-10T02:35:23.677+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rC.site-aws.com") failed: Name or service not known 2014-08-10T02:35:24.066+0000 [rsHealthPoll] getaddrinfo("mongodb-cluster-shard-01-rB.site-aws.com") failed: Name or service not known 2014-08-10T02:35:24.066+0000 [rsHealthPoll] couldn't connect to mongodb-cluster-shard-01-rB.site-aws.com:27017: couldn't connect to server mongodb-cluster-shard-01-rB.site-aws.com:27017 (0.0.0.0) failed, address resolved to 0.0.0.0 [root@mongodb-cluster-shard-01-rA ec2-user]#
但是 nslookup 將 FQDN 正確解析為 IP:
[root@mongodb-cluster-shard-01-rA ec2-user]# nslookup mongodb-cluster-shard-01-rC.site-aws.com Server: 10.233.147.18 (this is secondary dns) Address: 10.233.147.18#53 Name: mongodb-cluster-shard-01-rC.site-aws.com Address: 10.220.153.211
在我啟動主 DNS (.119) 之後:很快我將通過主 DNS 解決它
[root@mongodb-cluster-shard-01-rA ec2-user]# nslookup mongodb-cluster-shard-01-rC.site-aws.com Server: 10.35.147.119 Address: 10.35.147.119#53
一旦主 DNS 啟動並執行,我就會恢復正常。我的副本成為主要副本,一切正常。那麼我錯過了什麼或做錯了什麼?
我的 mongo 實例具有以下 /etc/resolve.conf 文件:
[root@mongodb-cluster-shard-01-rA log]# cat /etc/resolv.conf ; generated by /sbin/dhclient-script search us-west-2.compute.internal site.com nameserver 10.35.147.119 nameserver 10.233.147.18 nameserver 172.16.0.23 nameserver 172.16.0.23
主 DNS /etc/named.conf:
options { #listen-on port 53 { 127.0.0.1; 10.224.3.36}; listen-on-v6 port 53 { ::1; }; directory "/var/named"; dump-file "/var/named/data/cache_dump.db"; statistics-file "/var/named/data/named_stats.txt"; memstatistics-file "/var/named/data/named_mem_stats.txt"; allow-query { any; }; recursion no; dnssec-enable yes; dnssec-validation yes; dnssec-lookaside auto; /* Path to ISC DLV key */ bindkeys-file "/etc/named.iscdlv.key"; managed-keys-directory "/var/named/dynamic"; notify yes; also-notify { 10.233.147.18; }; }; logging { channel default_debug { file "data/named.run"; severity dynamic; }; }; zone "site-aws.com" IN { type master; file "site-aws.com.zone"; allow-update { none; }; allow-query { any; }; allow-transfer {10.233.147.18; }; }; include "/etc/named.rfc1912.zones"; include "/etc/named.root.key";
“site-aws.com.zone”定義:
$TTL 86400 @ IN SOA ns1.site-aws.com. root.site-aws.com. ( 2013042203 ;Serial 300 ;Refresh 1800 ;Retry 604800 ;Expire 86400 ;Minimum TTL ) ; Specify our two nameservers IN NS ns1.site-aws.com. ; IN NS ns2.site-aws.com. ; Resolve nameserver hostnames to IP, replace with your two droplet IP addresses. ns1 IN A 10.224.3.36 ;ns2 IN A 2.2.2.2 ; Define hostname -> IP pairs which you wish to resolve devops IN A 10.35.147.119 mongodb-cluster-shard-01-rA IN A 10.230.9.223 mongodb-cluster-shard-01-rB IN A 10.17.6.57 mongodb-cluster-shard-01-rC IN A 10.220.153.211 mongodb-cluster-shard-01-arbiter IN A 10.251.112.114 mongodb-cluster-shard-01-rA-backup-hidden IN A 10.230.20.83 mongodb-cluster-backup IN A 10.230.20.83 prod-redis-cluster-01-rA IN A 10.226.207.86 ns1 IN A 10.35.147.119 ns2 IN A 10.233.147.18
輔助 DNS /etc/named.conf:
options { #listen-on port 53 { 127.0.0.1; 10.224.3.36}; listen-on-v6 port 53 { ::1; }; directory "/var/named"; dump-file "/var/named/data/cache_dump.db"; statistics-file "/var/named/data/named_stats.txt"; memstatistics-file "/var/named/data/named_mem_stats.txt"; allow-query { any; }; recursion no; dnssec-enable yes; dnssec-validation yes; dnssec-lookaside auto; /* Path to ISC DLV key */ bindkeys-file "/etc/named.iscdlv.key"; managed-keys-directory "/var/named/dynamic"; }; logging { channel default_debug { file "data/named.run"; severity dynamic; }; }; zone "site-aws.com" IN { type slave; file "site-aws.com.zone"; allow-query { any; }; allow-transfer {10.35.147.119; }; ## NS1 is allowed for zone transfer when necessary ## masters {10.35.147.119; }; ## the master NS1 is defined ## }; include "/etc/named.rfc1912.zones"; include "/etc/named.root.key";
輔助 dns 已同步 site-aws.com.zone - 文件存在。
所以問題是,為什麼副本 mongodb 會這樣。我如何確保如果主 DNS 出現故障,副本(以及通過 FQDN 引用內部節點的所有其他節點仍然可以執行)
問題出在 glibc 中,它正在記憶體 /etc/resolve.conf 數據。我通過安裝 nscd 解決了這個問題:
yum install nscd; chkconfig nscd on; /etc/init.d/nscd start
之後問題就消失了。幾個相關的話題:
- https://stackoverflow.com/questions/125466/using-glibc-why-does-my-gethostbyname-fail-after-i-dhcp-has-changed-the-dns-ser
- https://jira.mongodb.org/browse/SERVER-7587
- https://jira.mongodb.org/browse/SERVER-12099
希望這對將來的人有所幫助。