c++ - Making g++ use SHLD/SHRD instructions -


consider following code:

#include <limits> #include <cstdint>  using t = uint32_t; // or uint64_t  t shift(t x, t y, t n) {     return (x >> n) | (y << (std::numeric_limits<t>::digits - n)); } 

according godbolt, clang 3.8.1 generates following assembly code -o1, -o2, -o3:

shift(unsigned int, unsigned int, unsigned int):         movb    %dl, %cl         shrdl   %cl, %esi, %edi         movl    %edi, %eax         retq 

while gcc 6.2 (even -mtune=haswell) generates:

shift(unsigned int, unsigned int, unsigned int):     movl    $32, %ecx     subl    %edx, %ecx     sall    %cl, %esi     movl    %edx, %ecx     shrl    %cl, %edi     movl    %esi, %eax     orl     %edi, %eax     ret 

this seems far less optimized, since shrd fast on intel sandybridge , later. there anyway rewrite function facilitate optimization compilers (and in particular gcc) , favor use of shld/shrd assembly instructions?

or there gcc -mtune or other options encourage gcc tune better modern intel cpus?

with -march=haswell, emits bmi2 shlx / shrx, still not shrd.

no, can see no way gcc use shrd instruction.
can manipulate output gcc generates changing -mtune , -march options.

or there gcc -mtune or other options encourage gcc tune better modern intel cpus?

yes can gcc generate bmi2 code:

e.g: x86-64 gcc6.2 -o3 -march=znver1 //amd zen
generates: (haswell timings).

    code            critical path latency     reciprocal throughput     ---------------------------------------------------------------     mov     eax, 32          *                     0.25     sub     eax, edx         1                     0.25             shlx    eax, esi, eax    1                     0.5     shrx    esi, edi, edx    *                     0.5     or      eax, esi         1                     0.25     ret     total:                   3                     1.75 

compared clang 3.8.1:

    mov    cl, dl            1                     0.25     shrd   edi, esi, cl      4                     2     mov    eax, edi          *                     0.25      ret     total                    5                     2.25 

given dependency chain here: shrd slower on haswell, tied on sandybridge, slower on skylake.
reciprocal throughput faster shrx sequence.

so depends, on post bmi processors gcc produces better code, pre-bmi clang wins.
shrd has wildly varying timings on different processors, can see why gcc not overly fond of it.
-os (optimize size) gcc still not select shrd.

*) not part of timing because either not on critical path, or turns 0 latency register rename.


Comments