Backup
pg_dumpall 偶爾掛起
我有一個 bash shell 腳本,可以從 docker 轉儲我所有的 postgres 數據庫:
function dump_postgres { mkdir -p ${BACKUP_DIR}/postgres/ docker ps -a --format '{{.Names}}\t{{.Ports}}' | grep 5432/tcp | awk '{ print $1 }' | while read -r line; do echo "extracting database from container '${line}'" docker exec -t ${line} pg_dumpall -v --lock-wait-timeout=600 -c -U postgres > ${BACKUP_DIR}/postgres/${line}.sql done } dump_postgres >> "${LOG}" 2>> "${ERROR}"
該腳本確定哪些 docker 容器正在偵聽預設的 postgres 埠,並以 sql 格式轉儲數據庫。
我的問題是這個命令在 cron 啟動時每隔一天就會突然停止。它只是停止並且執行轉儲的容器不會退出。stderr 上也沒有輸出。
你知道如何解決這個問題嗎?
編輯:
crontab 看起來像這樣:
30 1 * * * /home/mrbackup/backup.sh
工作安排在每晚 1 點 30 分。3:00 的看門狗腳本已經看到備份失敗並且 pg_dumpall 掛起。
EDIT2: 今晚 quassel 是問題所在,但每隔一天就會發生變化。nextcloud 也有問題。有關流程的更多資訊:
# ps -efH root 493 1 0 Mai24 ? 00:00:04 /usr/sbin/cron -f root 12401 493 0 01:30 ? 00:00:00 /usr/sbin/CRON -f mrbackup 12402 12401 0 01:30 ? 00:00:00 /bin/sh -c /home/mrbackup/backup.sh mrbackup 12403 12402 0 01:30 ? 00:00:00 /bin/bash /home/mrbackup/backup.sh mrbackup 12412 12403 0 01:30 ? 00:00:00 /bin/bash /home/mrbackup/backup.sh mrbackup 12428 12412 0 01:30 ? 00:00:11 docker exec -t quassel_postgres_1 pg_dumpall -v --lock-wait-timeout=600 -c -U postgres root 12449 11150 0 01:30 ? 00:00:00 pg_dumpall -v --lock-wait-timeout=600 -c -U postgres root 12467 12449 0 01:30 ? 00:00:00 sh -c "/usr/local/bin/pg_dump" -v --lock-wait-timeout '600' -Fp 'user=postgres dbname=quassel' root 12468 12467 0 01:30 ? 00:00:16 /usr/local/bin/pg_dump -v --lock-wait-timeout 600 -Fp user=postgres dbname=quassel
strace顯示pg_dump的大多數子程序都在等待鎖?
futex(0xc42054cd38, FUTEX_WAIT, 0, NULL
其中兩個程序重新啟動系統呼叫:
restart_syscall(<... resuming interrupted futex ...>)
嘗試使用此查詢查看數據庫:
SELECT a.datname, c.relname, l.mode, l.GRANTED, a.wait_event_type, a.usename, a.query, a.query_start, age(now(), a.query_start) AS "age", a.pid FROM pg_stat_activity a JOIN pg_locks l ON l.pid = a.pid JOIN pg_class c ON c.oid = l.relation WHERE NOW() - query_start > '10 second'; ORDER BY a.query_start;
給了我以下內容:
datname | relname | mode | granted | wait_event_type | usename | query | query_start | age | pid ---------+-----------------------------------------+-----------------+---------+-----------------+----------+--------------------------------------------------+-------------------------------+-----------------+------- quassel | pg_attribute_relid_attnum_index | AccessShareLock | t | | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095 quassel | pg_attribute_relid_attnam_index | AccessShareLock | t | | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095 quassel | pg_attribute | AccessShareLock | t | | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095 quassel | pg_class | AccessShareLock | t | | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095 quassel | pg_init_privs_o_c_o_index | AccessShareLock | t | | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095 quassel | pg_roles | AccessShareLock | t | | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095 quassel | pg_init_privs | AccessShareLock | t | | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095 quassel | pg_depend_reference_index | AccessShareLock | t | | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
還有 100 多行
編輯 我已經解決了我最初的問題,但我仍然會對管道破裂的原因感興趣。還有為什麼該過程在這種情況下不會終止。
看起來碼頭集裝箱的管道有時會破壞管道。使用 pg_dumpall 中的 -f 參數解決了這個問題
docker exec -t ${line} pg_dumpall -v --lock-wait-timeout=600 -c -U postgres -f /tmp/export.sql docker cp ${line}:/tmp/export.sql ${BACKUP_DIR}/postgres/${line}.sql docker exec -t ${line} rm /tmp/export.sql