Coreos

在 coreos 上執行 calico rkt 容器時出現“EtcdException:無法獲取伺服器列表”

  • December 4, 2016

我有兩台coreos stable v1122.2.0機器,每台都配置了tls的etcd2。

我使用https://github.com/coreos/etcd/tree/master/hack/tls-setup創建了證書。

現在我正在嘗試配置 calico-node 以使用 rkt 在我的 coreos 主節點上執行。

我在 cloud-config 配置中有以下內容:

write_files:
- path: "/etc/kubernetes/cni/net.d/10-calico.conf"
  content: |
    {
    "name": "calico",
    "type": "flannel",
    "delegate": {
        "type": "calico",
        "etcd_endpoints": "https://10.79.218.2:2379,https://10.79.218.3:2379",
        "log_level": "none",
        "log_level_stderr": "info",
        "hostname": "10.79.218.2",
        "policy": {
            "type": "k8s",
            "k8s_api_root": "http://127.0.0.1:8080/api/v1/"
            }
        }
    }
- path: "/etc/kubernetes/manifests/policy-controller.yaml"
  content: |
   apiVersion: v1
    kind: Pod
    metadata:
      name: calico-policy-controller
      namespace: calico-system
    spec:
      hostNetwork: true
      containers:
        # The Calico policy controller.
        - name: k8s-policy-controller
          image: calico/kube-policy-controller:v0.2.0
          env:
            - name: ETCD_ENDPOINTS
              value: "https://10.79.218.2:2379,https://10.79.218.3:2379"
            - name: K8S_API
              value: "http://127.0.0.1:8080"
            - name: LEADER_ELECTION
              value: "true"
        # Leader election container used by the policy controller.
        - name: leader-elector
          image: quay.io/calico/leader-elector:v0.1.0
          imagePullPolicy: IfNotPresent
          args:
            - "--election=calico-policy-election"
            - "--election-namespace=calico-system"
            - "--http=127.0.0.1:4040"
...
units:
- name: calico-node.service
  enable: true
  command: start
  content: |
   [Unit]
   Description=Calico per-host agent
   Requires=network-online.target
   After=network-online.target

   [Service]
   Slice=machine.slice
   Environment=CALICO_DISABLE_FILE_LOGGING=true
   Environment=HOSTNAME=10.79.218.2
   Environment=IP=10.79.218.2
   Environment=FELIX_FELIXHOSTNAME=10.79.218.2
   Environment=CALICO_NETWORKING=false
   Environment=NO_DEFAULT_POOLS=true
   Environment=ETCD_ENDPOINTS=https://10.79.218.2:2379,https://10.79.218.3:2379
   ExecStart=/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci \
  --volume=modules,kind=host,source=/lib/modules,readOnly=false \
  --mount=volume=modules,target=/lib/modules \
  --trust-keys-from-https quay.io/calico/node:v0.19.0

  KillMode=mixed
  Restart=always
  TimeoutStartSec=0

  [Install]
  WantedBy=multi-user.target

請忽略空格縮進..我認為我沒有正確複製/粘貼它:)

當我嘗試啟動 calico-node 服務時,出現以下錯誤:

Sep 14 05:45:17 localhost systemd[1]: Started Calico per-host agent.
Sep 14 05:45:17 localhost rkt[1644]: image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
Sep 14 05:45:18 localhost rkt[1644]: image: using image from local store for image name quay.io/calico/node:v0.19.0
Sep 14 05:45:25 localhost rkt[1644]: Traceback (most recent call last):
Sep 14 05:45:25 localhost rkt[1644]:   File "startup.py", line 292, in <module>
Sep 14 05:45:25 localhost rkt[1644]:     client = IPAMClient()
Sep 14 05:45:25 localhost rkt[1644]:   File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 228, in __init__
Sep 14 05:45:25 localhost rkt[1644]:     "%s" % (ETCD_CA_CERT_FILE_ENV, etcd_ca))
Sep 14 05:45:25 localhost rkt[1644]: pycalico.datastore_errors.DataStoreError: Invalid ETCD_CA_CERT_FILE. Certificate Authority cert is required and m
Sep 14 05:45:25 localhost rkt[1644]: Calico node failed to start
Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Main process exited, code=exited, status=1/FAILURE
Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Unit entered failed state.
Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Failed with result 'exit-code'.
Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Service hold-off time over, scheduling restart.
Sep 14 05:45:25 localhost systemd[1]: Stopped Calico per-host agent.
Sep 14 05:45:25 localhost systemd[1]: Started Calico per-host agent.
Sep 14 05:45:25 localhost rkt[1714]: image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
Sep 14 05:45:26 localhost rkt[1714]: image: using image from local store for image name quay.io/calico/node:v0.19.0
Sep 14 05:45:28 localhost rkt[1714]: Traceback (most recent call last):
Sep 14 05:45:28 localhost rkt[1714]:   File "startup.py", line 292, in <module>
Sep 14 05:45:28 localhost rkt[1714]:     client = IPAMClient()
Sep 14 05:45:28 localhost rkt[1714]:   File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 228, in __init__
Sep 14 05:45:28 localhost rkt[1714]:     "%s" % (ETCD_CA_CERT_FILE_ENV, etcd_ca))
Sep 14 05:45:28 localhost rkt[1714]: pycalico.datastore_errors.DataStoreError: Invalid ETCD_CA_CERT_FILE. Certificate Authority cert is required and m

第 2-25 行

所以我明白了Invalid ETCD_CA_CERT_FILE.。我並沒有真正向 calico 指定要使用的鍵..所以我想我缺少一些配置。

我在 /etc/ssl/etcd 有以下等相關的鍵

8 -rw-------. 1 etcd etcd 1050 Sep 14 05:45 ca.pem
8 -rw-------. 1 etcd etcd  289 Sep 14 05:45 etcd1-key.pem
8 -rw-------. 1 etcd etcd 1058 Sep 14 05:45 etcd1.pem
8 -rw-------. 1 etcd etcd  227 Sep 12 03:49 server1-key.pem
8 -rw-------. 1 etcd etcd  822 Sep 12 03:49 server1.pem

我嘗試添加Environment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem到 calico-node systemd 文件,但得到完全相同的結果。

有任何想法嗎 ?

更新

所以我嘗試手動執行 calico,而不是使用 systemd。我還添加了 calico 所需的所有環境變數

export CALICO_DISABLE_FILE_LOGGING=true
export HOSTNAME=10.79.218.2
export IP=10.79.218.2
export FELIX_FELIXHOSTNAME=10.79.218.2
export CALICO_NETWORKING=false
export NO_DEFAULT_POOLS=true
export ETCD_ENDPOINTS=https://10.79.218.2:2379,https://10.79.218.3:2379
export ETCD_AUTHORITY=10.79.218.2:2379
export ETCD_SCHEME=https
export ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem
export ETCD_CERT_FILE=/etc/ssl/etcd/etcd1.pem
export ETCD_KEY_FILE=/etc/ssl/etcd/etcd1-key.pem

當我嘗試使用以下命令執行印花布容器時:

/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci \
--volume=modules,kind=host,source=/lib/modules,readOnly=false \
--mount=volume=modules,target=/lib/modules \
--trust-keys-from-https quay.io/calico/node:v0.19.0

我明白了

image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
image: using image from local store for image name quay.io/calico/node:v0.19.0
Traceback (most recent call last):
 File "startup.py", line 292, in <module>
  client = IPAMClient()
 File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 221, in __init__
   ETCD_CERT_FILE_ENV, etcd_cert))
pycalico.datastore_errors.DataStoreError: Cannot read ETCD_KEY_FILE and/or ETCD_CERT_FILE. Both must be readable file paths. Values provided: ETCD_KEY_FILE=/etc/ssl/etcd/etcd1-key.pem, ETCD_CERT_FILE=/etc/ssl/etcd/etcd1.pem

我將證書文件的文件權限更改為 666,但這並不能解決問題。而且我知道這些證書是有效的,因為 etcd tls 可以正常工作。所以我錯過了什麼?

更新 2

看來我缺少將證書目錄安裝在印花布容器上。

所以現在我正在執行印花布容器

/usr/bin/rkt run --volume etcd-ssl,kind=host,source=/etc/ssl/etcd/,readOnly=true --inherit-env --stage1-from-dir=stage1-fly.aci  --volume=modules,kind=host,source=/lib/modules,readOnly=false  --mount=volume=modules,target=/lib/modules  --trust-keys-from-https quay.io/calico/node:v0.19.0 --mount volume=etcd-ssl,target=/etc/ssl/etcd

我得到以下輸出:

image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
image: using image from local store for image name quay.io/calico/node:v0.19.0
Traceback (most recent call last):
 File "startup.py", line 292, in <module>
client = IPAMClient()
 File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 246, in __init__
allow_reconnect=True)
 File "/usr/lib/python2.7/site-packages/etcd/client.py", line 204, in __init__
set(self.machines))
 File "/usr/lib/python2.7/site-packages/etcd/client.py", line 299, in machines
return self.machines
 File "/usr/lib/python2.7/site-packages/etcd/client.py", line 301, in machines
   raise etcd.EtcdException("Could not get the list of servers, "
etcd.EtcdException: Could not get the list of servers, maybe you provided the wrong host(s) to connect to?
Calico node failed to start

我有點接近..但仍然沒有解決方案。

更新 3

我嘗試通過執行將 ETCD_ENDPOINTS 設置為 coreos 機器上的 etcd 伺服器export ETCD_ENDPOINTS=https://10.79.218.2:2379,現在當我嘗試執行 calico rkt 映像時,我得到:

image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
image: using image from local store for image name quay.io/calico/node:v0.19.0
Traceback (most recent call last):
 File "startup.py", line 295, in <module>
main()
 File "startup.py", line 251, in main
warn_if_hostname_conflict(ip)
 File "startup.py", line 192, in warn_if_hostname_conflict
current_ipv4, _ = client.get_host_bgp_ips(hostname)
 File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 132, in wrapped
"running?" % (fn.__name__, e.message))
pycalico.datastore_errors.DataStoreError: get_host_bgp_ips: Error accessing etcd (Connection to etcd failed due to SSLError(CertificateError("hostname '10.79.218.2' doesn't match u'etcd'",),)).  Is etcd running?
Calico node failed to start

我也遇到了這個問題,最終通過查看 etcd 連接邏輯和使用的庫的程式碼以及 Calico 團隊在他們的 Slack 頻道中的一些指針找到了問題的根源。

問題是因為 Calico 的目前版本(至少高達 0.22.0)使用 Python etcd 客戶端,該客戶端不支持 TLS 證書中的 IP SAN(Subject Alt Name)。這意味著您正在使用的證書無法正確地與配置它們的 etcd 伺服器相關聯。

這在此GitHub 問題中有所描述。

要解決此問題,您必須等到 urllib 庫的新版本發布,它被 etcd 客戶端拾取,並發布新版本,然後 Calico 更新以使用新的 etcd 客戶端。或者,您可以使用 FQDN 而不是 SAN 欄位中的 IP 地址重新生成證書。這意味著您需要確保可以通過這些名稱訪問您的伺服器,無論是使用 DNS 還是/etc/hosts正確設置。用於生成證書的 OpenSSL 配置應包含如下內容:

[alt_names]
DNS.1 = $ENV::FQDN

描述您如何生成證書的連結使用CFSSL,因此我建議閱讀其有關如何更改為使用主機名而不是 IP 地址的文件。我相信它可能就像修改JSON配置一樣簡單,如下所示:

"hosts": [
   "example.com",
   "www.example.com"
],

引用自:https://serverfault.com/questions/803101