RabbitMQ 隨機崩潰
我最近按照基於 RPM 的 Linux 發行版的說明在 AWS 上部署了 RabbitMQ 伺服器。
RabbitMQ 3.8.14 Erlang 23.3.1
這是一個只有一個虛擬主機(“/”)的單節點部署,我主要使用預設配置。我還修改了打開文件的限制,最初為 65536,目前為 150000。雖然使用的文件描述符保持在相對較低的水平。
我有兩個連接到 RabitMQ 的應用程序。每個應用程序都使用自己的使用者名和密碼進行身份驗證。兩個使用者都對虛擬主機和他們使用的主題擁有完全的權限。
我已經設置了 7 個交換和隊列。都很耐用。
伺服器啟動沒有任何問題,應用程序執行正常,能夠與伺服器通信,沒有問題並從主題中消費,但它們隨機崩潰。
在 rabbitmq 日誌文件中,我可以看到以下內容
2021-04-21 02:49:27.342 [info] <0.32135.4> connection <0.32135.4> (10.11.234.236:39453 -> 172.24.22.82:5672): user 'mes' authenticated and granted access to vhost '/' 2021-04-21 02:49:27.374 [info] <0.32138.4> connection <0.32138.4> (10.11.252.54:23576 -> 172.24.22.82:5672): user 'mes' authenticated and granted access to vhost '/' 2021-04-21 03:02:37.689 [error] <0.31757.4> closing AMQP connection <0.31757.4> (10.11.161.117:45741 -> 172.24.22.82:5672): {writer,send_failed,{error,timeout}} 2021-04-21 03:02:37.690 [info] <0.32596.4> Closing all channels from connection '10.11.161.117:45741 -> 172.24.22.82:5672' because it has been closed 2021-04-21 03:02:48.136 [info] <0.32614.4> accepting AMQP connection <0.32614.4> (10.11.161.117:2496 -> 172.24.22.82:5672) 2021-04-21 03:02:48.142 [info] <0.32614.4> connection <0.32614.4> (10.11.161.117:2496 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/' 2021-04-21 03:03:18.346 [error] <0.32614.4> closing AMQP connection <0.32614.4> (10.11.161.117:2496 -> 172.24.22.82:5672): {inet_error,enotconn} 2021-04-21 03:03:18.347 [info] <0.32674.4> Closing all channels from connection '10.11.161.117:2496 -> 172.24.22.82:5672' because it has been closed 2021-04-21 03:03:30.140 [info] <0.32694.4> accepting AMQP connection <0.32694.4> (10.11.161.117:54985 -> 172.24.22.82:5672) 2021-04-21 03:03:30.144 [info] <0.32694.4> connection <0.32694.4> (10.11.161.117:54985 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/' 2021-04-21 03:04:00.387 [error] <0.32694.4> closing AMQP connection <0.32694.4> (10.11.161.117:54985 -> 172.24.22.82:5672): {inet_error,enotconn} 2021-04-21 03:04:00.395 [info] <0.32752.4> Closing all channels from connection '10.11.161.117:54985 -> 172.24.22.82:5672' because it has been closed 2021-04-21 03:04:14.035 [info] <0.5.5> accepting AMQP connection <0.5.5> (10.11.161.117:63900 -> 172.24.22.82:5672) 2021-04-21 03:04:14.040 [info] <0.5.5> connection <0.5.5> (10.11.161.117:63900 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/' 2021-04-21 03:04:44.270 [error] <0.5.5> closing AMQP connection <0.5.5> (10.11.161.117:63900 -> 172.24.22.82:5672): {inet_error,enotconn} 2021-04-21 03:04:44.271 [info] <0.56.5> Closing all channels from connection '10.11.161.117:63900 -> 172.24.22.82:5672' because it has been closed 2021-04-21 03:04:44.316 [error] <0.14.5> ** Generic server <0.14.5> terminating ** Last message in was {'$gen_cast',terminate} ** When Server state == {ch,{conf,running,rabbit_framing_amqp_0_9_1,1,<0.5.5>,<0.12.5>,<0.5.5>,<<"10.11.161.117:63900 -> 172.24.22.82:5672">>,undefined,{user,<<"cron">>,[],[{rabbit_auth_backend_internal,none}]},<<"/">>,<<>>,<0.6.5>,[{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"basic.nack">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}],none,0,134217728,undefined,#{},1000000000},{lstate,<0.13.5>,false},none,5514,{5438,{[{5513,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265917}},{5512,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265916}},{5511,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265915}},{5510,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265914}},{5509,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265913}},{5508,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265912}},{5507,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265911}},{5506,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265910}},{5505,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265909}},{5504,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmai...">>,...},...],...}},...} ** Reason for termination == ** noproc 2021-04-21 03:04:44.317 [info] <0.14.5> [{initial_call,{rabbit_channel,init,['Argument__1']}},{pid,<0.14.5>},{registered_name,[]},{error_info,{exit,noproc,[{gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1183}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,236}]}]}},{ancestors,[<0.11.5>,<0.9.5>,<0.4.5>,<0.3.5>,<0.853.0>,<0.852.0>,<0.851.0>,rabbit_sup,<0.274.0>]},{message_queue_len,38},{messages,[{'$gen_cast',{deliver,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,true,{{resource,<<"/">>,queue,<<"q.etm.mes.tmce_results">>},<0.522.0>,265918,true,{basic_message,{resource,<<"/">>,exchange,<<"x.etm.mes.tmce_results">>},[<<"test">>],{content,60,{'P_basic',<<"application/json">>,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},<<128,0,16,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110>>,rabbit_framing_amqp_0_9_1,[<<"{\"customerId\":5435103,\"campaignRunId\":66836,\"status\":\"Sent\",\"messageKey\":\"66836~5435103\",\"timeSent\":1618988531.764735000}">>]},<<225,81,222,134,43,120,7,4,135,105,190,34,66,200,149,86>>,false}}}},{'EXIT',<0.11.5>,shutdown},{'$gen_cast',{deliver,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,true,{{resource,<<"/">>,queue,<<"q.etm.mes.tmce_results">>},<0.522.0>,265919,true,{basic_message,{resource,<<"/">>,exchange,<<"x.etm.mes.tmce_results">>},[<<"test">>],{content,60,{'P_basic',<<"application/json">>,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},<<128,0,16,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110>>,rabbit_framing_amqp_0_9_1,[<<"{\"customerId\":8288025,\"campaignRunId\":66836,\"status\":\"Sent\",\"messageKey\":\"66836~8288025\",\"timeSent\":1618988531.764737000}">>]},<<201,223,49,27,101,206,136,61,160,111,163,73,226,167,54,31>>,false}}}},{'$gen_cast',{deliver,<<"TmceResultConsu...">>,...}},...]},...], [] 2021-04-21 03:04:44.318 [error] <0.14.5> CRASH REPORT Process <0.14.5> with 0 neighbours exited with reason: no such process or port in call to gen_server2:terminate/3 line 1183 2021-04-21 03:04:44.318 [info] <0.11.5> supervisor: {<0.11.5>,rabbit_channel_sup}, errorContext: shutdown_error, reason: noproc, offender: [{pid,<0.14.5>},{id,channel},{mfargs,{rabbit_channel,start_link,[1,<0.5.5>,<0.12.5>,<0.5.5>,<<"10.11.161.117:63900 -> 172.24.22.82:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"cron">>,[],[{rabbit_auth_backend_internal,none}]},<<"/">>,[{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"basic.nack">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}],<0.6.5>,<0.13.5>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}] 2021-04-21 03:04:44.318 [error] <0.11.5> Supervisor {<0.11.5>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.5.5>, <0.12.5>, <0.5.5>, <<"10.11.161.117:63900 -> 172.24.22.82:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"cron">>,[],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentica...">>,...},...], <0.6.5>, <0.13.5>) at <0.14.5> exit with reason noproc in context shutdown_error 2021-04-21 03:04:50.625 [info] <0.73.5> accepting AMQP connection <0.73.5> (10.11.161.117:42412 -> 172.24.22.82:5672) 2021-04-21 03:04:50.630 [info] <0.73.5> connection <0.73.5> (10.11.161.117:42412 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/' 2021-04-21 03:05:20.862 [error] <0.73.5> closing AMQP connection <0.73.5> (10.11.161.117:42412 -> 172.24.22.82:5672): {inet_error,enotconn} 2021-04-21 03:05:20.863 [info] <0.131.5> Closing all channels from connection '10.11.161.117:42412 -> 172.24.22.82:5672' because it has been closed 2021-04-21 03:05:32.643 [info] <0.149.5> accepting AMQP connection <0.149.5> (10.11.161.117:10578 -> 172.24.22.82:5672) 2021-04-21 03:05:32.647 [info] <0.149.5> connection <0.149.5> (10.11.161.117:10578 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/' 2021-04-21 03:06:02.926 [error] <0.149.5> closing AMQP connection <0.149.5> (10.11.161.117:10578 -> 172.24.22.82:5672): {inet_error,enotconn} 2021-04-21 03:06:02.927 [info] <0.206.5> Closing all channels from connection '10.11.161.117:10578 -> 172.24.22.82:5672' because it has been closed 2021-04-21 03:06:15.925 [info] <0.228.5> accepting AMQP connection <0.228.5> (10.11.161.117:41072 -> 172.24.22.82:5672) 2021-04-21 03:06:15.929 [info] <0.228.5> connection <0.228.5> (10.11.161.117:41072 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/' 2021-04-21 03:06:48.903 [error] <0.228.5> closing AMQP connection <0.228.5> (10.11.161.117:41072 -> 172.24.22.82:5672): {inet_error,enotconn} 2021-04-21 03:06:48.904 [info] <0.279.5> Closing all channels from connection '10.11.161.117:41072 -> 172.24.22.82:5672' because it has been closed 2021-04-21 03:06:57.332 [info] <0.305.5> accepting AMQP connection <0.305.5> (10.11.161.117:14359 -> 172.24.22.82:5672) 2021-04-21 03:06:57.337 [info] <0.305.5> connection <0.305.5> (10.11.161.117:14359 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/' 2021-04-21 03:07:27.574 [error] <0.305.5> closing AMQP connection <0.305.5> (10.11.161.117:14359 -> 172.24.22.82:5672): {inet_error,enotconn} 2021-04-21 03:07:27.575 [info] <0.357.5> Closing all channels from connection '10.11.161.117:14359 -> 172.24.22.82:5672' because it has been closed 2021-04-21 03:07:34.426 [info] <0.375.5> accepting AMQP connection <0.375.5> (10.11.161.117:48428 -> 172.24.22.82:5672) 2021-04-21 03:07:34.432 [info] <0.375.5> connection <0.375.5> (10.11.161.117:48428 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/' 2021-04-21 03:08:04.683 [error] <0.375.5> closing AMQP connection <0.375.5> (10.11.161.117:48428 -> 172.24.22.82:5672): {inet_error,enotconn} 2021-04-21 03:08:04.684 [info] <0.432.5> Closing all channels from connection '10.11.161.117:48428 -> 172.24.22.82:5672' because it has been closed 2021-04-21 03:08:13.436 [info] <0.449.5> accepting AMQP connection <0.449.5> (10.11.161.117:23879 -> 172.24.22.82:5672) 2021-04-21 03:08:13.442 [info] <0.449.5> connection <0.449.5> (10.11.161.117:23879 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/' 2021-04-21 03:08:43.668 [error] <0.449.5> closing AMQP connection <0.449.5> (10.11.161.117:23879 -> 172.24.22.82:5672): {inet_error,enotconn}
崩潰後,除非我重新啟動 RabbitMQ 伺服器,否則應用程序無法恢復並與 RabbitMQ 通信。
此外,當我檢查時,
rabbitmq-server.service
我可以看到它處於活動狀態。當我檢查時也會發生同樣的事情
rabbitmq-diagnostics is_running
Asking node rabbit@rabbitmq-test1 for its status ... RabbitMQ on node rabbit@rabbitmq-test1 is fully booted and running
到目前為止,我還沒有找到關於何時發生這種情況的模式。關於為什麼會發生這種情況的任何想法?
事實證明,這個問題與 rabbitMQ 配置無關。這是由於應用程序和 MongoDB(我們也在使用)之間的通信緩慢導致應用程序和 RabbitMQ 之間的連接超時。
您正在處理錯誤的 forloop 並且記憶體不足,因為您正在打開許多通道或連接,試圖處理每個交換和每個交換中持有的隊列之間的通信。最有可能的。
當您再次啟動腳本時,查看您有多少個連接以及您有多少個頻道。如果您的應用程序正在接收不遵循任何模式的消息,並且您不正確地循環通過每個交換和每個隊列以引導每條消息,那麼您將耗盡您的記憶體而沒有模式來複製問題,因為進入您的發布者的消息沒有遵循因此,理性模式很難縮小問題所在。您的連接時間戳似乎在很短的時間內從開始到結束,這表明您正在接收大量消息並且您的 for 循環正在為每個交換和隊列創建不必要的連接,因為您沒有處理 forloop 和連接後者正確。