解密 IBM Aspera 日誌
Aspera 日誌包含每個文件傳輸會話的大量不同事件類型。其中一些事件相當神秘:
LOG Sender DS Qs ds/re/a/d/ao/eo/n/ap/rq/rd/ru/cl/di/do=0/0/0/0/0/0/0/0/0/0/0/0/0/0 Rs o/i=1/1 mgmt backlog i/s/n = 0/412/0 1/456/0
有些 - 不那麼神秘,但仍可能需要一些破譯:
2020-08-21 19:48:44.622 [112c-00001cf0] LOG FASP Session Statistics [Sender] id=029c1333-c732-4636-83c1-e1acec2864cb delay=64ms rex_delay=72ms solicited_rex=0.01% unsolicited_rex=0.00% ave_xmit_rate 299.53Mbps effective=99.99% effective_rate=299.50Mbps (detail: good_blks 25762789 bl_total 25765926 bl_orig 25762790 bl_rex 3136 dup_last_blks 1) (sndr ctl: sent 1061 rcvd 1061 lost 0 lost 0.00%) (rcvr ctl: sent 9618 rcvd 9617 lost 1 lost 0.01%) (rex ctl: sent 20 rcvd 20 lost 0 lost 0.00%) (progress: tx_bytes 36274005756 file_bytes 55740661756 tx_time 996528996) (estimate transfer size: 0 bytes) sess->instru.num_rex_xfer_not_found 0 sess->instru.num_rex_abrtxfer_not_found 0
…這只是表面上的問題,因為每個文件傳輸會話都包含由某些公共欄位(執行緒 ID、傳輸 ID 和傳輸 UUID)連接的幾種事件類型。一些樣本:
2020-08-21 19:32:07.730 [112c-00001cf0] LOG FASP Session Start uuid=029c1333-c732-4636-83c1-e1acec2864cb op=send status=started source=D:/Content/somefile.mov dest=/some_destination source_prefix=- local=anon_IP:port peer=dest_IP:33001 tcp_port=22 os="Microsoft Windows *.* " ver=3.9.* lic=9:999:9999 peeros="Isilon OneFS v*.*" peerver=3.* peerlic=99:99:9999 proto_sess=20003 proto_udp=20000 proto_bwmeas=20000 proto_data=2000e 2020-08-21 19:32:07.730 [112c-00001cf0] LOG FASP Session Params uuid=029c1333-c732-4636-83c1-e1acec2864cb userid=0 user="peer_user" targetrate=300000000 minrate=50000000 rate_policy=fair cipher=aes-128-cfb resume=2 create=8 ovr=1 times=0 precalc=no mf=0 mf_path=- mf_suffix=.aspera-inprogress partial_file_suffix=- files_encrypt=yes files_decrypt=no file_csum=NONE dgram_sz=0 prepostcmd=- tcp_mode=no rtt_auto=yes cookie="-" vl_proto_ver=1 peer_vl_proto_ver=1 vl_local=2500000000 vlink_remote=0 vl_sess_id=8 srcbase=- rd_sz=0 wr_sz=0 cluster_num_nodes=1 cluster_node_id=0 cluster_multi_session_threshold=-1 range=0-0 keepalive=no test_login=no proxy_ip=- net_rc_alg=alg_delay exclude_older/newer_than=0/0 2020-08-21 19:32:08.198 [112c-00001cf0] LOG FASP Transfer Start uuid=029c1333-c732-4636-83c1-e1acec2864cb op=send status=started file="D:/Content/somefile.mov" size=55740661756 start_byte=19466656000 rate=300.00Mbps loss=0.00 rexreqs=0 overhead=0 2020-08-21 19:32:26.433 [112c-00001cf0] LOG Sender bl t/o/r/dl/c/h=448201/448201/0/0/442785/442785 ctl bm/bs=0/0 rex b/r/n=0/0/0 r_ott q/s/n=0/0/146 rate t/m/c=300000000/50000000/300000000 prog t/f/e=616782848/20083438848/18011034 xmtD=0 2020-08-21 19:32:26.433 [112c-00001cf0] LOG Sender DS Qs ds/re/a/d/ao/eo/n/ap/rq/rd/ru/cl/di/do=1/0/0/0/0/0/0/0/0/0/1/0/0/0 Rs o/i=1/1 mgmt backlog i/s/n = 0/412/0 1/456/0
… + 大約 100 個“LOG Sender DS”和“LOG Sender bl”事件在傳輸期間跨越約 16 分鐘,然後,總結:
2020-08-21 19:48:44.231 [112c-00001cf0] LOG FASP Transfer Stop uuid=029c1333-c732-4636-83c1-e1acec2864cb op=send status=success file="D:/Content/somefile.mov" size=55740661756 start_byte=19466656000 rate=291.35Mbps elapsed=996.02s loss=0.00 rexreqs=0 overhead=0 2020-08-21 19:48:44.622 [112c-00001cf0] LOG FASP Session Stop uuid=029c1333-c732-4636-83c1-e1acec2864cb op=send status=success source=D:/Content/somefile.mov dest=/some_destination source_prefix=- local=local_IP:port peer=dest_IP:33001 tcp_port=22 os="Microsoft Windows *.* " ver=3.9.* lic=9:999:9999 peeros="Isilon OneFS v*.* " peerver=3.* peerlic=99:99:9999 proto_sess=20003 proto_udp=20000 proto_bwmeas=20000 proto_data=2000e 2020-08-21 19:48:44.622 [112c-00001cf0] LOG FASP Session Params uuid=029c1333-c732-4636-83c1-e1acec2864cb userid=0 user="peer_user" targetrate=300000000 minrate=50000000 rate_policy=fair cipher=aes-128-cfb resume=2 create=8 ovr=1 times=0 precalc=no mf=0 mf_path=- mf_suffix=.aspera-inprogress partial_file_suffix= files_encrypt=yes files_decrypt=no file_csum=NONE dgram_sz=0 prepostcmd=- tcp_mode=no rtt_auto=yes cookie="-" vl_proto_ver=1 peer_vl_proto_ver=1 vl_local=2500000000 vlink_remote=0 vl_sess_id=8 srcbase=- rd_sz=0 wr_sz=0 cluster_num_nodes=1 cluster_node_id=0 cluster_multi_session_threshold=-1 range=0-0 keepalive=no test_login=no proxy_ip=- net_rc_alg=alg_delay exclude_older/newer_than=0/0
…會話以上面已經發布的“會話統計”事件結束:
2020-08-21 19:48:44.622 [112c-00001cf0] LOG FASP Session Statistics [Sender] id=029c1333-c732-4636-83c1-e1acec2864cb delay=64ms rex_delay=72ms solicited_rex=0.01% unsolicited_rex=0.00% ave_xmit_rate 299.53Mbps effective=99.99% effective_rate=299.50Mbps (detail: good_blks 25762789 bl_total 25765926 bl_orig 25762790 bl_rex 3136 dup_last_blks 1) (sndr ctl: sent 1061 rcvd 1061 lost 0 lost 0.00%) (rcvr ctl: sent 9618 rcvd 9617 lost 1 lost 0.01%) (rex ctl: sent 20 rcvd 20 lost 0 lost 0.00%) (progress: tx_bytes 36274005756 file_bytes 55740661756 tx_time 996528996) (estimate transfer size: 0 bytes) sess->instru.num_rex_xfer_not_found 0 sess->instru.num_rex_abrtxfer_not_found 0
我請 IBM 支持人員解釋其中一些含義並得到答复;將發布我學到的內容作為答案。如果您有任何其他資訊 - 請分享!
以下是 IBM 的回复,解釋了一些日誌 + 我們自己的分析破譯“會話統計”事件 - 這使我們能夠解決一個相當具有挑戰性的問題。希望這對某人有幫助!
IBM 的回复(很長):
工程部門並未考慮為客戶破譯大部分日誌消息,它們主要用於內部故障排除,因此我們沒有記錄大量此類消息,並且無法保證這些消息在未來版本中不會更改.
這是我們記錄的一些描述(請注意新版本中的消息不同)
FASP receiver session logging (2.0.2/2.1.x/2.2.x/2.5.x/2.6.x/2.7.x/2.8.x/3.x.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/f/l=2/1/0/0/0 rex_rtt r/s/o=0/0/8 rate_rtt b/r/s/f=0/0/0/1 rex n/s/q/a/r=0/0/0/0/0 bl l/d/o/x/r=0/1/0/0/0 disk b/w=0/1452 rate t/m/c=10000000/200000/10000000 prog t/f/e=60/60/132229 rcvD=1 (2.0.2/2.1.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/f/l=2/1/0/0/0 rex_rtt r/s/o=0/0/8 rate_rtt b/r/s/f=0/0/0/1 rex n/s/q/a/r=0/0/0/0/0 bl l/d/o/x/r=0/1/0/0/0 disk b/w=0/1452 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (2.2.x/2.5.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/f/l=2/1/0/0/0 rex_rtt r/s/o=0/0/8 ooo_rtt r/s/o=0/0/8 rate_rtt b/r/s/f=0/0/0/1 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/x/r=0/1/0/0/0 disk b/w=0/1452 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (2.6.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/d=2/1/0/0 rex_rtt l/h/s/o=0/20/5/8 ooo_rtt l/h/s/o=0/0/0/8 rate_rtt b/l/h/s/f=1/0/20/10/1 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/r/a/x/dl/df/dm/ds=0/1/0/0/1/1/0/0/0/0 disk l/h/b/w=0/6/1/1452 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (2.7.1/2.7.3) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/d=2/1/0/0 rex_rtt l/h/s/o=0/20/5/8 ooo_rtt l/h/s/o=0/0/0/8 rate_rtt b/l/h/s/r/f=1/0/20/10/0/1 ctl bm/bs=0/1 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/r/a/x/dl/df/dm/ds=0/1/0/0/1/1/0/0/0/0 disk l/h/b/w=0/6/1/1452 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (2.7.4 and 2.8.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/d/ts=2/1/0/0/2 rex_rtt l/h/s/o=0/20/5/8 ooo_rtt l/h/s/o=0/0/0/8 rate_rtt b/l/h/s/r/f=1/0/20/10/0/1 ctl bm/bs=0/1 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/r/a/x/dl/df/dm/ds=0/1/0/0/1/1/0/0/0/0 disk l/h/b=0/6/1 vlink lq/lo/rq/ro=5/10/0/0 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (3.0.x and later) Receiver Blocks Info (bl) [t/o/r/f/l] (pre-2.7) t = total number of blocks received o = number of original-blocks received r = number of retransmitted-blocks received f = number of times first block was received l = number of times the last block was received Receiver Blocks Info (bl) [t/o/r/d] (2.7 and 2.8.x) t = total number of blocks received o = number of original-blocks received r = number of retransmitted-blocks received d = number of blocks discarded (before getting accepted) (t = o + r + d) Receiver Blocks Info (bl) [t/o/r/d/ts] (3.0.x) t = total number of blocks received o = number of original-blocks received r = number of retransmitted-blocks received d = number of blocks discarded (before getting accepted) ts= total number of blocks sent by sender (only non-zero if peer supports sending it >=3.0.x) (t = o + r + d) Retransmission round-trip time (rex_rtt) [r/s/o] (pre-2.7) r = latest measurement of REX RTT s = smoothed version of REX RTT o = latest computed time-out value (i.e. RTO) Retransmission round-trip time (rex_rtt) [l/h/s/o] (2.7) l = smallest measured REX RTT during last period h = biggest measured REX RTT during last period s = smoothed version of REX RTT o = latest computed time-out value (i.e. RTO) Out-of-order blocks round-trip time (ooo_rtt) [r/s/o] (pre-2.7) r = latest measurement of OOO RTT s = smoothed version of OOO RTT o = latest computed time-out value (i.e. RTO) Out-of-order blocks round-trip time (ooo_rtt) [l/h/s/o] (2.7) l = smallest measured OOO RTT during last period h = biggest measured OOO RTT during last period s = smoothed version of OOO RTT o = latest computed time-out value (i.e. RTO) Network round-trip time (rate_rtt) [b/r/s/f] (pre-2.7) b = base (minimum) network RTT r = latest measurement of network RTT s = smoothed version of (network) RTT (In 2.2.x, "s" is the sum of the base RTT plus the largest queueing component -- the one that currently takes effect) f = FASP rate adaptation mode (meaningful with –Q) Network round-trip time (rate_rtt) [b/l/h/s/r/f] (2.7) b = base (minimum) network RTT l = smallest measured NET RTT during last period h = biggest measured NET RTT during last period s = smoothed version of (network) RTT (In 2.2.x, "s" is the sum of the base RTT plus the largest queueing component -- the one that currently takes effect) r = number of times base RTT was reset f = FASP rate adaptation mode (meaningful with –Q) Control PDU stats (ctl) [bm/bs] (2.7.4 - ) bm = number of control PDUs with badmac bs = number of control PDUs with badseqno Retransmit Requests (rex) [n/s/q/a/r] (2.0.x - 2.5.x) n = number of NACKs outstanding in the NACK-list s = number of NACK requests sent q = number of NACK request messages sent (1 NACK request message carries multiple NACK requests) a = number of ADD failures to NACK list (due to out of space) r = number of REMOVE failures from nack list (implies duplicated retransmissions or buggy behavior) Retransmit Requests (rex) [n/s/q/v/a/r] (2.6.x) n = number of NACKs outstanding in the NACK-list s = number of NACK requests sent q = number of NACK request messages sent (1 NACK request message carries multiple NACK requests) v = number of NACK requests removed because block already revd a = number of ADD failures to NACK list (due to out of space) r = number of REMOVE failures from nack list (implies duplicated retransmissions or buggy behavior) Lost blocks (bl) [l/d/o/x/r] (pre-2.7) l = number of lost original blocks (note: if rex buffer/timer full, this counter is incorrectly adding same lost > originals multiple times) d = number of duplicate blocks received o = number of out-of-order blocks received x = number of lost blocks for which retransmission should have been requested, but were not because the rex table was full r = number of lost blocks due to receiver ring buffer full Lost blocks (bl) [l/d/o/r/a/x/dl/df/dm/ds] (2.7) l = number of lost original blocks (note: if rex buffer/timer full, this counter is incorrectly adding same lost > originals multiple times) d = number of duplicate blocks received o = number of out-of-order blocks received r = number of lost blocks due to receiver ring buffer full a = number of blocks discarded due to aborted transfers x = number of blocks discarded when xfer is not ready dl= number of duplicate last blocks df= number of blocks dropped due to bad format (before being accepted) dm= number of blocks dropped due to bad mac checksum (before being accepted) ds= number of blocks dropped due to bad sequence number (before being accepted) (d (number of blocks discarded) = r + a + x + dl) Disk (disk) [b/w] b = number of buffers occupied in the disk-ring (equiv to disk backlog) w = total number of bytes written (including non-contiguous) Disk (disk) [l/h/b/w] (2.7) l = smallest measured buffer depth during last period h = largest measured buffer depth during last period b = number of buffers occupied in the disk-ring (equiv to disk backlog) w = total number of bytes written (including non-contiguous) Rate (rate) [t/m/c] (2.0.2/2.1.x) t = target rate of the session m = minimum rate of the session c = transfer rate currently computed Rate (rate) [t/m/c/n/vl/vr/r] (2.2.x) t = target rate of the session m = minimum rate of the session c = transfer rate currently computed (minimal of n/vl/vr/r) n = transfer rate computed based on network congestion vl = transfer rate computed based on local virtual link vr = transfer rate computed based on remote virtual link r = transfer rate computed based on receiver storage I/O Vlink (vlink) [lq/lo/rq/ro] (3.0.x) lq = local vlink queueing delay (ms) lo = local vlink occupancy (%) rq = remote vlink queueing delay (ms) ro = remote vlink occupancy (%) Progress (prog) [t/f/e] t = running total of bytes received f = running total of file bytes received e = total elapsed time (microseconds) Number of files received (rcvD)
FASP 發件人會話記錄 (2.0.2/2.1.x/2.2.x/2.5.x/2.6.x/2.7.x/2.8.x)
2008-01-11 14:37:30 [97f-b6962bb0] Sender bl t/o/r/f/l=3/3/0/0/0 rex b/r/n=0/0/0 rate t/m/c=10000000/200000/10000000 prog t/f/e=0/0/46 xmtD=1 (2.0.2-2.6.x excluding 2.6.4) 2008-01-11 14:37:30 [97f-b6962bb0] Sender bl t/o/r/f/l=3/3/0/0/0 rex b/r/n=0/0/0 r_ott q/s/n=2/1/1000 rate t/m/c=10000000/200000/10000000 prog t/f/e=0/0/46 xmtD=1 (2.6.4/2.7.1) 2008-01-11 14:37:30 [97f-b6962bb0] Sender bl t/o/r/dl=3/3/0/1 rex b/r/n=0/0/0 r_ott q/s/n=2/1/1000 rate t/m/c=10000000/200000/10000000 prog t/f/e=0/0/46 xmtD=1 (2.7.1/2.7.3) 2008-01-11 14:37:30 [97f-b6962bb0] Sender bl t/o/r/dl=3/3/0/1 ctl bm/bs=0/1 rex b/r/n=0/0/0 r_ott q/s/n=2/1/1000 rate t/m/c=10000000/200000/10000000 prog t/f/e=0/0/46 xmtD=1 (2.7.4 and after) Sender Blocks Info (bl) [t/o/r/f/l] t = total number of blocks sent o = number of original-blocks transmitted r = number of retransmitted-blocks transmitted f = number of times first block was transmitted l = number of times the last block was transmitted Sender Blocks Info (bl) [t/o/r/dl] (2.7.3 and newer) t = total number of blocks sent o = number of original-blocks transmitted r = number of retransmitted-blocks transmitted dl = number of duplicated last block Control PDU stats (ctl) [bm/bs] (2.7.4 - ) bm = number of control PDUs with badmac bs = number of control PDUs with badseqno Retransmissions (rex) [b/r/n] b = number of NACKs received (1 NACK per block) r = number of NACK requests received n = number of outstanding blocks to be retransmitted Reverse path queueing (r_ott) [q/s/n] q = reverse path queueing (ms) s = estimated clock skew (magnified by 1000) n = number of reverse path queueing probes accepted Rate [t/m/c] t = target rate of the session m = minimum rate of the session c = current transfer rate (computed by the receiver) Progress (prog) [t/f/e] t = running total of bytes transmitted f = running total of file bytes transmitted e = total elapsed time (microseconds) Number of files transmitted (xmtD)
DS 發件人日誌記錄
Queues (Qs) [ds/re/a/d/ao/eo/n/ap/rq/rd/ru/di/do] ds = nubmer of DS count re = number of entries in the "retry" queue (used when OS returns too many open file) a = number of entries in the "arguments" queue d = number of entries in the directory queue ao = number of entries in the open asynchronous queue eo = number of entries in the open error queue n = number of entries in the "new file" queue ap = number of entries in the "attribute preparation" queue (in the process of calculating md5 checksum) rq = number of entries for which the "DS request" was sent (not yet responded to) rd = number of entries in the READY queue (DS response received) ru = number of entries in the RUNNING queue cl = number of entries in the CLEAN queue di = number of entries in the DISP queue (no action needed, for example the file was already transferred) do = number of entries in the DONE queue (finished transferring) Requests/Responses (Rs) [o/i] o = number of DS requests sent (cumulative) i = number of DS responses received (cumulative) Management backlog (mgmt backlog) [i/s/n] i = index of management socket s = file descriptor of management socket n = total messages buffered for socket
DS 接收器記錄
Queues (Qs) [ds/n/rq/ao/ap/rd/ru/no/po/pc/do] ds = nubmer of DS count n = number of entries in the "new file" queue rq = number of entries for which the "DS request" was received ao = number of entries in the open asynchronous queue ap = number of entries in the attribute preparation queue rd = number of entries in the READY queue ru = number of entries in the RUNNING queue no = number of entries in the NOXFER DISP queue (no action needed, for example the file was already transferred) po = number of entries in the POST queue pc = number of entries in the post complete queue do = number of entries in the DONE queue (finished transferring) Requests/Responses (Rs) [o/i] o = number of DS requests received (cumulative) i = number of DS responses sent (cumulative)
IBM 尚未提供有關“FASP 會話統計”的資訊,因此這是我團隊的實證分析:
FASP 會話統計
[112c-00001cf0] thread ID - shared across events for the same transfer session [Sender] Role / type. I.e. "sender" means an outbound transfer id=029c1333-***64cb another form of a transfer id ("UUID" in some other events) delay=64ms TBD rex_delay=72ms "retransmission delay", TBD solicited_rex=0.01% TBD unsolicited_rex=0.00% TBD ave_xmit_rate 299.53Mbps self explanatory; formula on how it's calculated - TBD effective=99.99% self explanatory effective_rate=299.50Mbps self explanatory (detail: good_blks 25762789 bl_total 25765926 bl_orig 25762790 bl_rex 3136 dup_last_blks 1) "detail" stanza good_blks : actual blocks to transfer bl_total : total blocks transferred incl. rex : forumula: bl_total = bl_orig + bl_rex bl_orig : "original blocks" = good_blks + dup_last_blks bl_rex : retransmitted blocks* dup_last_blks : "duplicated last blocks" * high ratio of bl_rex to good_blks indicates an issue (sndr ctl: sent 1061 rcvd 1061 lost 0 lost 0.00%) (rcvr ctl: sent 9618 rcvd 9617 lost 1 lost 0.01%) "Sender Control" stanza (TBD) (rex ctl: sent 20 rcvd 20 lost 0 lost 0.00%) "Retransmission Control" stanza (TBD) (progress: tx_bytes 36274005756 file_bytes 55740661756 tx_time 996528996) "Progress" stanza tx_bytes : transfer bytes file_bytes : file bytes tx_time : transfer time (in microseconds) (estimate transfer size: 0 bytes) self explanatory (relationship with blocks and progress bytes TBD sess->instru.num_rex_xfer_not_found 0 TBD sess->instru.num_rex_abrtxfer_not_found 0 TBD
案例:性能不佳的 Aspera 伺服器
解密此特定事件(上面的“LOG FASP 會話統計”)類型中的欄位使我們能夠在 Splunk 中並排比較我們的兩個 Aspera 伺服器及其關鍵統計數據(重傳率和每小時 GBytes)。比較顯示,在 2020 年 8 月的整個月中,其中一個伺服器的重傳塊比率異常高,性能降低了 40%。查看前幾個月的類似統計數據表明,兩台伺服器的重傳率不相上下,低於一個百分比,以及 >90GB/h 的性能。
有問題的伺服器在 8 月初被遷移到不同的網段。除了我們看到的其他錯誤之外,這種分析使我們能夠檢測和衡量問題的影響,並最終將罪魁禍首(網路配置)歸零。一旦問題得到緩解,伺服器的性能就會恢復正常。
如果您將 Aspera 日誌提取到 Splunk 並希望在搜尋時提取欄位並顯示統計資訊的 SPL,我將在 Splunk Answers 中發布一些內容並在此處連結。