Failover

MySQL:Pacemaker 無法將失敗的主伺服器作為新的從伺服器啟動?

  • July 16, 2012
  • 起搏器-1.0.12-1
  • corosync-1.2.7-1.1

我將為 MySQL 複製(1 個主伺服器和 1 個從屬伺服器)設置故障轉移,請遵循本指南: https ://github.com/jayjanssen/Percona-Pacemaker-Resource-Agents/blob/master/doc/PRM-setup-guide .rst

這是輸出**crm configure show**:

node serving-6192 \
   attributes p_mysql_mysql_master_IP="192.168.6.192"
node svr184R-638.localdomain \
   attributes p_mysql_mysql_master_IP="192.168.6.38"
primitive p_mysql ocf:percona:mysql \
   params config="/etc/my.cnf" pid="/var/run/mysqld/mysqld.pid"
socket="/var/lib/mysql/mysql.sock" replication_user="repl"
replication_passwd="x" test_user="test_user" test_passwd="x" \
   op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1" \
   op monitor interval="2s" role="Slave" timeout="30s"
OCF_CHECK_LEVEL="1" \
   op start interval="0" timeout="120s" \
   op stop interval="0" timeout="120s"
primitive writer_vip ocf:heartbeat:IPaddr2 \
   params ip="192.168.6.8" cidr_netmask="32" \
   op monitor interval="10s" \
   meta is-managed="true"
ms ms_MySQL p_mysql \
   meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" globally-unique="false"
target-role="Master" is-managed="true"
colocation writer_vip_on_master inf: writer_vip ms_MySQL:Master
order ms_MySQL_promote_before_vip inf: ms_MySQL:promote writer_vip:start
property $id="cib-bootstrap-options" \
   dc-version="1.0.12-unknown" \
   cluster-infrastructure="openais" \
   expected-quorum-votes="2" \
   no-quorum-policy="ignore" \
   stonith-enabled="false" \
   last-lrm-refresh="1341801689"
property $id="mysql_replication" \
   p_mysql_REPL_INFO="192.168.6.192|mysql-bin.000006|338"

crm_mon:

Last updated: Mon Jul  9 10:30:01 2012
Stack: openais
Current DC: serving-6192 - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ serving-6192 svr184R-638.localdomain ]

Master/Slave Set: ms_MySQL
    Masters: [ serving-6192 ]
    Slaves: [ svr184R-638.localdomain ]
writer_vip    (ocf::heartbeat:IPaddr2):    Started serving-6192

編輯/etc/my.cnf錯誤語法的 serving-6192 以測試故障轉移,它工作正常:

  • svr184R-638.localdomain 被提升為master
  • writer_vip 切換到 svr184R-638.localdomain

目前狀態:

Last updated: Mon Jul  9 10:35:57 2012
Stack: openais
Current DC: serving-6192 - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ serving-6192 svr184R-638.localdomain ]

Master/Slave Set: ms_MySQL
    Masters: [ svr184R-638.localdomain ]
    Stopped: [ p_mysql:0 ]
writer_vip    (ocf::heartbeat:IPaddr2):    Started svr184R-638.localdomain

Failed actions:
   p_mysql:0_monitor_5000 (node=serving-6192, call=15, rc=7,
status=complete): not running
   p_mysql:0_demote_0 (node=serving-6192, call=22, rc=7,
status=complete): not running
   p_mysql:0_start_0 (node=serving-6192, call=26, rc=-2, status=Timed
Out): unknown exec error

/etc/my.cnf從serving-6192 上 刪除錯誤的語法,然後重新啟動corosync,我想看到的是 serving-6192 作為一個新的從站啟動,但它沒有:

Failed actions:
   p_mysql:0_start_0 (node=serving-6192, call=4, rc=1,
status=complete): unknown error

這是我懷疑的日誌片段:

Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: rsc:p_mysql:0:4: start
Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: RA output:
(p_mysql:0:start:stderr) Error performing operation: The
object/attribute does not exist

Jul 09 10:46:32 serving-6192 crm_attribute: [7420]: info: Invoked:
/usr/sbin/crm_attribute -N serving-6192 -l reboot --name readable -v 0

/var/log/cluster/corosync.log: http://fpaste.org/AyOZ/

奇怪的是我可以手動啟動它:

export OCF_ROOT=/usr/lib/ocf
export OCF_RESKEY_config="/etc/my.cnf"
export OCF_RESKEY_pid="/var/run/mysqld/mysqld.pid"
export OCF_RESKEY_socket="/var/lib/mysql/mysql.sock"
export OCF_RESKEY_replication_user="repl"
export OCF_RESKEY_replication_passwd="x"
export OCF_RESKEY_test_user="test_user"
export OCF_RESKEY_test_passwd="x"

sh -x /usr/lib/ocf/resource.d/percona/mysql start: http://fpaste.org/RVGh/

我做錯了什麼嗎?


回复@Patrick Fri Jul 13 10:22:10 ICT 2012:

我不確定為什麼它會失敗,因為您的日誌不包含來自資源腳本的任何消息(ocf_log 命令)

我把它全部從/var/log/cluster/corosync.log. 你心裡有什麼理由嗎?

/etc/corosync/corosync.conf

compatibility: whitetank

totem {
   version: 2
   secauth: off
   threads: 0
   interface {
       member {
           memberaddr: 192.168.6.192
       }
       member {
           memberaddr: 192.168.6.38
       }
       ringnumber: 0
       bindnetaddr: 192.168.6.0
       mcastaddr: 226.94.1.1
       mcastport: 5405
   }
}

logging {
   fileline: off
   to_stderr: yes
   to_logfile: yes
   to_syslog: yes
   logfile: /var/log/cluster/corosync.log
   debug: on
   timestamp: on
   logger_subsys {
       subsys: AMF
       debug: off
   }
}

amf {
   mode: disabled
}

另外,當您手動執行腳本時,腳本工作的原因是因為您沒有設置告訴腳本它是主/從資源的變數。所以當它執行時,腳本認為它只是一個獨立的實例。

謝謝。我已將以下變數附加到我的~/.bash_profile

export OCF_RESKEY_CRM_meta_clone_max="2"
export OCF_RESKEY_CRM_meta_role="Slave"

使其生效. ~/.bash_profile並手動啟動mysql資源:

sh -x /usr/lib/ocf/resource.d/percona/mysql start: http://fpaste.org/EMwa/

它工作正常:

mysql> show slave status\G
*************************** 1. row ***************************
              Slave_IO_State: Waiting for master to send event
                 Master_Host: 192.168.6.38
                 Master_User: repl
                 Master_Port: 3306
               Connect_Retry: 60
             Master_Log_File: mysql-bin.000072
         Read_Master_Log_Pos: 1428602
              Relay_Log_File: mysqld-relay-bin.000006
               Relay_Log_Pos: 39370
       Relay_Master_Log_File: mysql-bin.000072
            Slave_IO_Running: Yes
           Slave_SQL_Running: Yes
             Replicate_Do_DB: 
         Replicate_Ignore_DB: 
          Replicate_Do_Table: 
      Replicate_Ignore_Table: 
     Replicate_Wild_Do_Table: 
 Replicate_Wild_Ignore_Table: 
                  Last_Errno: 0
                  Last_Error: 
                Skip_Counter: 0
         Exec_Master_Log_Pos: 1428602
             Relay_Log_Space: 39527
             Until_Condition: None
              Until_Log_File: 
               Until_Log_Pos: 0
          Master_SSL_Allowed: No
          Master_SSL_CA_File: 
          Master_SSL_CA_Path: 
             Master_SSL_Cert: 
           Master_SSL_Cipher: 
              Master_SSL_Key: 
       Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
               Last_IO_Errno: 0
               Last_IO_Error: 
              Last_SQL_Errno: 0
              Last_SQL_Error: 
 Replicate_Ignore_Server_Ids: 
            Master_Server_Id: 123
1 row in set (0.00 sec)

停止 MySQL,打開調試,重新啟動 corosync,這是日誌:http ://fpaste.org/mZzS/

如您所見,只有“未知錯誤”:

1.
   Jul 13 10:48:06 serving-6192 crmd: [3341]: debug:
   get_xpath_object: No match for
   //cib_update_result//diff-added//crm_config in
   /notify/cib_update_result/diff
2.
   Jul 13 10:48:06 serving-6192 lrmd: [3338]: WARN: Managed
   p_mysql:1:start process 3416 exited with return code 1.
3.
   Jul 13 10:48:06 serving-6192 crmd: [3341]: info:
   process_lrm_event: LRM operation p_mysql:1_start_0 (call=4,
   rc=1, cib-update=10, confirmed=true) unknown error

有什麼想法嗎?


更新 2012 年 7 月 14 日星期六 17:16:03 ICT:

@Patrick:謝謝你的提示!

Pacemaker 使用的環境變數如下: http: //fpaste.org/92yN/

正如我在與您聊天時所懷疑的那樣,該節點serving-6192以 開頭OCF_RESKEY_CRM_meta_master_max=1,因此,由於以下程式碼:

/usr/lib/ocf/resource.d/percona/mysql:

if ocf_is_ms; then
   mysql_extra_params="--skip-slave-start"
fi

/usr/lib/ocf//lib/heartbeat/ocf-shellfuncs:

ocf_is_ms() {
   [ ! -z "${OCF_RESKEY_CRM_meta_master_max}" ] && [ "${OCF_RESKEY_CRM_meta_master_max}" -gt 0 ]
}

額外的參數--skip-slave-start包括:

ps -ef | grep mysql

root 18215 1 0 17:12 pts/4 00:00:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --user=mysql --skip-slave-start

mysql 19025 18215 1 17:12 pts/4 00:00:14 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --skip-slave-start --log-error=/var/log/mysqld.log --open-files-limit=8192 --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306

但 SQL 執行緒仍在執行:

        Slave_IO_Running: Yes
       Slave_SQL_Running: Yes

並且複制工作正常。

IFS=$'\n' ENV=( $(cat /tmp/16374.env) ); env -i - "${ENV[@]}" sh -x /usr/lib/ocf/resource.d/percona/mysql start: http://fpaste.org/x7xE/

我的頭撞在牆上(:-> |

尤里卡!

我們倆都忘記了一個非常重要的日誌文件,它是… /var/log/mysqld.log

socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL) by Atomicorp
[Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000082' at position 58569, relay log './mysqld-relay-bin.000002' position: 58715
[Note] Slave I/O thread: connected to master 'repl@192.168.6.38:3306',replication started in log 'mysql-bin.000082' at position 58569
[Warning] Aborted connection 10 to db: 'unconnected' user: 'test_user' host: 'localhost' (init_connect command failed)
[Warning] The MySQL server is running with the --read-only option so it cannot execute this statement
[Note] /usr/libexec/mysqld: Normal shutdown

你可以猜到,我通過結合 binlog 和 來跟踪使用者活動init-connect

init_connect = "INSERT INTO audit.accesslog (connect_time, user_host, connection_id) VALUES (NOW(), CURRENT_USER(), CONNECTION_ID());"

serving-6192在作為從機啟動時設置為只讀,然後當 Pacemaker 執行監視器操作時test_user

   # Check for test table
   ocf_run -q $MYSQL $MYSQL_OPTIONS_TEST \
       -e "SELECT COUNT(*) FROM $OCF_RESKEY_test_table"

init_connect命令因上述錯誤而失敗:

MySQL 伺服器正在使用該--read-only選項執行,因此它無法執行此語句

解決方案是我應該init_connect在初始化監控操作之前將選項設置為空字元串(在提升節點成為主節點時不要忘記將其轉回)

對於使用事件調度程序的任何人:還請注意,在將從屬設備提升為主設備時必須將其打開:

set_event_scheduler() {
   local es_val
   if ocf_is_true $1; then
       es_val="on"
   else
       es_val="off"
   fi
   ocf_run $MYSQL $MYSQL_OPTIONS_REPL \
       -e "SET GLOBAL event_scheduler=${es_val}"
}

get_event_scheduler() {
   # Check if event-scheduler is set
   local event_scheduler_state

   event_scheduler_state=`$MYSQL $MYSQL_OPTIONS_REPL \
       -e "SHOW VARIABLES" | grep event_scheduler | awk '{print $2}'`

   if [ "$event_scheduler_state" = "ON" ]; then
       return 0
   else
       return 1
   fi
}

mysql_promote() {
   local master_info

   if ( ! mysql_status err ); then
       return $OCF_NOT_RUNNING
   fi
   ocf_run $MYSQL $MYSQL_OPTIONS_REPL \
       -e "STOP SLAVE"

   # Set Master Info in CIB, cluster level attribute
   update_data_master_status
   master_info="$(get_local_ip)|$(get_master_status File)|$(get_master_status Position)"
   ${CRM_ATTR_REPL_INFO} -v "$master_info"
   rm -f $tmpfile

   set_read_only off || return $OCF_ERR_GENERIC
   set_event_scheduler on || return $OCF_ERR_GENERIC

降級時也不要忘記將其關閉:

   'pre-demote')
       # Is the notification for our set
       notify_resource=`echo $OCF_RESKEY_CRM_meta_notify_demote_resource|cut -d: -f1`
       my_resource=`echo $OCF_RESOURCE_INSTANCE|cut -d: -f1`
       if [ $notify_resource != ${my_resource} ]; then
           ocf_log debug "Notification is not for us"
           return $OCF_SUCCESS
       fi

       demote_host=`echo $OCF_RESKEY_CRM_meta_notify_demote_uname|tr -d " "`
       if [ $demote_host = ${HOSTNAME} ]; then
           ocf_log info "post-demote notification for $demote_host"
           set_read_only on
           set_event_scheduler off

乾杯,

引用自:https://serverfault.com/questions/405982