i trying understand overhead in [blk_account_io_completion][1]
in linux block layer. using perf annotate
following snippet (abridged). can shed light on reason add
, test
instruction have such overheads compared neighboring instruction executed them?
: part_stat_add(cpu, part, sectors[rw], bytes >> 9); 0.13 : ffffffff813336eb: movsxd r8,r8d 0.00 : ffffffff813336ee: lea rdx,[rax*8+0x0] 0.00 : ffffffff813336f6: mov rcx,qword ptr [rdi+0x210] 72.04 : ffffffff813336fd: add rcx,qword ptr [r8*8-0x7e2df6a0] 0.22 : ffffffff81333705: add qword ptr [rcx+rdx*1],rsi 0.61 : ffffffff81333709: mov eax,dword ptr [rdi+0x1f4] 26.52 : ffffffff8133370f: test eax,eax 0.00 : ffffffff81333711: je ffffffff81333733 <blk_account_io_completion+0x83>
one possible reason these instructions happen pointed instruction pointer when sample taken. typical x86 cpu can retire 4 instructions per cycle, when , sample token, program counter point 1 instruction, not four.
here example - see below. simple plain loop bunch of nop instructions. note how clockticks distribute on profile 3 instructions in gaps. may similar effect seeing.
alternatively, mov rcx,qword ptr [rdi+0x210]
, mov eax,dword ptr [rdi+0x1f4]
miss cache cycles spent on being attributed next instruction, see here.
│ disassembly of section .text: │ │ 00000000004004ed : │ push %rbp │ mov %rsp,%rbp │ movl $0x0,-0x4(%rbp) │ ↓ jmp 25 14.59 │ d: nop │ nop │ nop 0.03 │ nop 14.58 │ nop │ nop │ nop 0.08 │ nop 13.89 │ nop │ nop 0.01 │ nop 0.08 │ nop 13.99 │ nop │ nop 0.01 │ nop 0.05 │ nop 13.92 │ nop │ nop 0.01 │ nop 0.07 │ nop 14.44 │ addl $0x1,-0x4(%rbp) 0.33 │25: cmpl $0x3fffffff,-0x4(%rbp) 13.90 │ ↑ jbe d │ pop %rbp │ ← retq
Comments
Post a Comment