Linux

SQL Server 2017 在帶有 Pacemaker 故障轉移的 Ubuntu 上啟動時崩潰

  • December 22, 2017

我已經在 Ubuntu Server 16.04 LTS 上安裝了 mssql-server 和 mssql-server-ha。我在兩個節點上使用 drbd,pacemaker 和 corosync 試圖控制兩個節點之間的自動故障轉移。crm status顯示2個錯誤:

Failed Actions:
* res_mssql_monitor_5000 on hostname2 'invalid parameter' (2): call=57, status=complete, exitreason='2017/11/09 12:33:01 Expected local server name to be res_mssql but it was hostname1',
last-rc-change='Thu Nov  9 12:33:01 2017', queued=0ms, exec=5241ms
* res_mssql_start_0 on hostname2 'unknown error' (1): call=6086, status=complete, exitreason='SQL Server crashed during startup.',
last-rc-change='Thu Nov  9 12:32:39 2017', queued=0ms, exec=24329ms

(實際主機名替換為“hostname1 and hostname2”)

TL;DR 如果有人在 Linux 設置上使用浮動 IP 成功配置了雙節點起搏器/corosync/drbd SQL Server 2017,我很想知道我做錯了什麼。如果您需要其他配置或日誌文件,請告訴我。


我不知道它在哪裡找到實際的 hostname1 vs rs_mssql 作為預期的主機名。上面的錯誤出現在主機名 2 上,所以我認為這可能是我在初始設置期間將配置文件從主機名 1 複製到主機名 2 時發生的。

我的CRM配置:

(注:我還沒有攻擊IPaddr2問題;我的正常IP地址有ens160和ens192,稍後我想配置一個IP別名為ip_mssql,用於公共IP訪問SQL伺服器)

node 1: hostname1 \
  attributes
node 2: hostname2 \
  attributes
primitive ip_mssql IPaddr2 \
  params ip=(virt IP addr) iflabel=ip_mssql \ #I think iflabel is wrong
  op monitor interval=5s nic=ip_mssql \
  meta target-role=Stopped
primitive res_drbd_mssql ocf:linbit:drbd \
  params drbd_resource=mssql \
  op start interval=0 timeout=240s \
  op stop interval=0 timeout=120s
primitive res_fs_mssqlData Filesystem \
  params device="/dev/drbd0" directory="/var/opt/mssql/data" fstype=xfs \
  op start interval=0 timeout=60s \
  op stop interval=0 timeout=120s
primitive res_fs_mssqlLog Filesystem \
  params device="/dev/drbd1" directory="/var/opt/mssql/log" fstype=xfs \
  op start interval=0 timeout=60s \
  op stop interval=0 timeout=120s
primitive res_fs_mssqlTempDB Filesystem \
  params device="/dev/drbd2" directory="/var/opt/mssql/tempDB" fstype=xfs \
  op start interval=0 timeout=60s \
  op stop interval=0 timeout=120s
primitive res_mssql ocf:mssql:fci \
  op monitor interval=5s timeout=30s \
  op start interval=0 timeout=60s \
  op stop interval=0 timeout=60s
group mssqlserver res_fs_mssqlData res_fs_mssqlLog res_fs_mssqlTempDB ip_mssql
ms ms_drbd_mssql res_drbd_mssql \
  meta notify=true master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
colocation col_mssql_drbd inf: mssqlserver ms_drbd_mssql:Master
order ord_mssql inf: ms_drbd_mssql:promote mssqlserver:start
property cib-bootstrap-options: \
  have-watchdog=false \
  dc-version=1.1.14-70404b0 \
  cluster-infrastructure=corosync \
  cluster-name=mssqlserver \
  stonith-enabled=false \
  start-failure-is-fatal=false \
  last-lrm-refresh=1510177588 \
  startup-fencing=true \
  enable-startup-probes=true \
  symmetric-cluster=true \
  stop-orphan-actions=true \
  stonith-action=reboot \
  remove-after-stop=false \
  stop-all-resources=false \
  stop-orphan-resources=true \
  no-quorum-policy=ignore \
  is-managed-default=true

我可以手動啟動mssql-server就好了:

sudo systemctl start mssql-server
sudo systemctl status mssql-server

mssql-server.service - Microsoft SQL Server Database Engine
 Loaded: loaded (/lib/systemd/system/mssql-server.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2017-11-09 12:49:21 CST; 1s ago
  Docs: https://docs.microsoft.com/en-us/sql/linux
Main PID: 3368 (sqlservr)
  Tasks: 62
Memory: 171.0M
  CPU: 1.770s
CGroup: /system.slice/mssql-server.service
      3368 /opt/mssql/bin/sqlservr
      3371 /opt/mssql/bin/sqlservr

Nov 09 12:49:21 hostname2 systemd[1]: Started Microsoft SQL Server Database Engine.

這些是我發現的唯一實際錯誤/var/opt/mssql/log/errorlog

2017-11-09 12:49:28.17 spid4s      Service Master Key could not be decrypted using one of its encryptions. See sys.key_encryptions for details.
2017-11-09 12:49:28.17 spid4s      An error occurred during Service Master Key initialization. SQLErrorCode=33095, State=8, LastOsError=0.
2017-11-09 12:49:31.14 spid22s     The Service Broker endpoint is in disabled or stopped state.
2017-11-09 12:49:31.14 spid22s     The Database Mirroring endpoint is in disabled or stopped state.
2017-11-09 12:49:31.17 spid22s     Service Broker manager has started.
2017-11-09 12:49:31.37 spid4s      Recovery is complete. This is an informational message only. No user action is required.

手動 drbd 故障轉移通過umount /dev/drbd0 /dev/drbd1 /dev/drbd2and工作,然後在新的主節點(和 mount…)drbdadm secondary mssql上反轉該過程。drbdadm primary mssql

我的 /etc/drbd.d/mssql.res conf(/etc/drbd.d/global_common.conf 未從儲存庫更改):

resource mssql {
  handlers {
           split-brain "/usr/lib/drbd/notify-split-brain.sh root";
  }
  net {
           after-sb-0pri discard-least-changes;
           after-sb-1pri discard-secondary;
           after-sb-2pri disconnect;
  }
  volume 0 {
       device minor 0;
       disk /dev/VG-SqlData/LV-SqlData;
           meta-disk internal;
  }
  volume 1 {
      device minor 1;
      disk /dev/VG-SqlLogs/LV-SqlLogs;
           meta-disk internal;
  }
  volume 2 {
      device minor 2;
      disk /dev/VG-TempDB/LV-TempDB;
           meta-disk internal;
  }
  syncer {
           rate 35M;
           verify-alg md5;
  }
  on hostname1 {
           address <ip addr1>:7788;
  }
  on hostname2 {
           address <ip addr2>:7788;
  }
}

嘗試使用systemd來啟動服務: crm configure edit res_mssql

編輯配置,使其如下所示:

primitive res_mssql systemd:mssql-server \
  op monitor interval=30s timeout=30s \
  op start interval=0 timeout=60s \
  op stop interval=0 timeout=60s

那應該完成同樣的事情。但是,我想資源代理可能需要一些額外的參數,這可能是讓它按照您嘗試的方式工作所需要的全部。

我建議檢查 RA 資訊,看看您是否可以弄清楚您缺少哪些參數:crm ra info ocf:mssql:fci

引用自:https://serverfault.com/questions/882726