Backup

pg_dumpall 偶爾掛起

  • July 23, 2018

我有一個 bash shell 腳本,可以從 docker 轉儲我所有的 postgres 數據庫:

function dump_postgres {
   mkdir -p ${BACKUP_DIR}/postgres/
   docker ps -a --format '{{.Names}}\t{{.Ports}}' | grep 5432/tcp | awk '{ print $1 }' | while read -r line; do
       echo "extracting database from container '${line}'"
       docker exec -t ${line} pg_dumpall -v --lock-wait-timeout=600 -c -U postgres > ${BACKUP_DIR}/postgres/${line}.sql
   done
}

dump_postgres >> "${LOG}" 2>> "${ERROR}"

該腳本確定哪些 docker 容器正在偵聽預設的 postgres 埠,並以 sql 格式轉儲數據庫。

我的問題是這個命令在 cron 啟動時每隔一天就會突然停止。它只是停止並且執行轉儲的容器不會退出。stderr 上也沒有輸出。

你知道如何解決這個問題嗎?

編輯:

crontab 看起來像這樣:

30 1 * * * /home/mrbackup/backup.sh

工作安排在每晚 1 點 30 分。3:00 的看門狗腳本已經看到備份失敗並且 pg_dumpall 掛起。

EDIT2: 今晚 quassel 是問題所在,但每隔一天就會發生變化。nextcloud 也有問題。有關流程的更多資訊:

# ps -efH
root       493     1  0 Mai24 ?        00:00:04   /usr/sbin/cron -f
root     12401   493  0 01:30 ?        00:00:00     /usr/sbin/CRON -f
mrbackup 12402 12401  0 01:30 ?        00:00:00       /bin/sh -c /home/mrbackup/backup.sh
mrbackup 12403 12402  0 01:30 ?        00:00:00         /bin/bash /home/mrbackup/backup.sh
mrbackup 12412 12403  0 01:30 ?        00:00:00           /bin/bash /home/mrbackup/backup.sh
mrbackup 12428 12412  0 01:30 ?        00:00:11             docker exec -t quassel_postgres_1 pg_dumpall -v --lock-wait-timeout=600 -c -U postgres

root     12449 11150  0 01:30 ?        00:00:00         pg_dumpall -v --lock-wait-timeout=600 -c -U postgres
root     12467 12449  0 01:30 ?        00:00:00           sh -c "/usr/local/bin/pg_dump"  -v --lock-wait-timeout '600' -Fp 'user=postgres dbname=quassel'
root     12468 12467  0 01:30 ?        00:00:16             /usr/local/bin/pg_dump -v --lock-wait-timeout 600 -Fp user=postgres dbname=quassel

strace顯示pg_dump的大多數子程序都在等待鎖?

futex(0xc42054cd38, FUTEX_WAIT, 0, NULL

其中兩個程序重新啟動系統呼叫:

restart_syscall(<... resuming interrupted futex ...>)

嘗試使用此查詢查看數據庫:

SELECT a.datname,
      c.relname,
      l.mode,
      l.GRANTED,
      a.wait_event_type,
      a.usename,
      a.query, 
      a.query_start,
      age(now(), a.query_start) AS "age", 
      a.pid 
FROM  pg_stat_activity a
 JOIN pg_locks l ON l.pid = a.pid
 JOIN pg_class c ON c.oid = l.relation
WHERE
 NOW() - query_start > '10 second';  
ORDER BY a.query_start;

給了我以下內容:

datname |                 relname                 |      mode       | granted | wait_event_type | usename  |                      query                       |          query_start          |       age       |  pid  
---------+-----------------------------------------+-----------------+---------+-----------------+----------+--------------------------------------------------+-------------------------------+-----------------+-------
quassel | pg_attribute_relid_attnum_index         | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
quassel | pg_attribute_relid_attnam_index         | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
quassel | pg_attribute                            | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
quassel | pg_class                                | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
quassel | pg_init_privs_o_c_o_index               | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
quassel | pg_roles                                | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
quassel | pg_init_privs                           | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
quassel | pg_depend_reference_index               | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095

還有 100 多行

編輯 我已經解決了我最初的問題,但我仍然會對管道破裂的原因感興趣。還有為什麼該過程在這種情況下不會終止。

看起來碼頭集裝箱的管道有時會破壞管道。使用 pg_dumpall 中的 -f 參數解決了這個問題

docker exec -t ${line} pg_dumpall -v --lock-wait-timeout=600 -c -U postgres -f /tmp/export.sql
docker cp ${line}:/tmp/export.sql ${BACKUP_DIR}/postgres/${line}.sql
docker exec -t ${line} rm /tmp/export.sql

引用自:https://serverfault.com/questions/922348