consider following code:
#include <limits> #include <cstdint>  using t = uint32_t; // or uint64_t  t shift(t x, t y, t n) {     return (x >> n) | (y << (std::numeric_limits<t>::digits - n)); } according godbolt, clang 3.8.1 generates following assembly code -o1, -o2, -o3:
shift(unsigned int, unsigned int, unsigned int):         movb    %dl, %cl         shrdl   %cl, %esi, %edi         movl    %edi, %eax         retq while gcc 6.2 (even -mtune=haswell) generates:
shift(unsigned int, unsigned int, unsigned int):     movl    $32, %ecx     subl    %edx, %ecx     sall    %cl, %esi     movl    %edx, %ecx     shrl    %cl, %edi     movl    %esi, %eax     orl     %edi, %eax     ret this seems far less optimized, since shrd fast on intel sandybridge , later. there anyway rewrite function facilitate optimization compilers (and in particular gcc) , favor use of shld/shrd assembly instructions?
or there gcc -mtune or other options encourage gcc tune better modern intel cpus?
with -march=haswell, emits bmi2 shlx / shrx, still not shrd.
no, can see no way gcc use shrd instruction.
 can manipulate output gcc generates changing -mtune , -march options.    
or there gcc -mtune or other options encourage gcc tune better modern intel cpus?
yes can gcc generate bmi2 code:
e.g: x86-64 gcc6.2 -o3 -march=znver1 //amd zen
 generates: (haswell timings). 
    code            critical path latency     reciprocal throughput     ---------------------------------------------------------------     mov     eax, 32          *                     0.25     sub     eax, edx         1                     0.25             shlx    eax, esi, eax    1                     0.5     shrx    esi, edi, edx    *                     0.5     or      eax, esi         1                     0.25     ret     total:                   3                     1.75 compared clang 3.8.1:
    mov    cl, dl            1                     0.25     shrd   edi, esi, cl      4                     2     mov    eax, edi          *                     0.25      ret     total                    5                     2.25 given dependency chain here: shrd slower on haswell, tied on sandybridge, slower on skylake.
 reciprocal throughput faster shrx sequence.  
so depends, on post bmi processors gcc produces better code, pre-bmi clang wins.
 shrd has wildly varying timings on different processors, can see why gcc not overly fond of it.
 -os (optimize size) gcc still not select shrd.    
*) not part of timing because either not on critical path, or turns 0 latency register rename.
Comments
Post a Comment