Examining Compiler Output 2


In this post I present a short comparison between the machine code generated by GCC 7.2 and Clang 5.0.0 for the evaluation of a floating point polynomial.

For both compilers, only -O2 was used, but -O3 produced the same code.

float evaluate(float a, float b) {
    return (a - b + 1.0f) * (a - b) * (a - b - 1.0f);


evaluate(float, float):
  subss xmm0, xmm1
  movss xmm2, DWORD PTR .LC0[rip]
  movaps xmm1, xmm0
  addss xmm1, xmm2
  mulss xmm1, xmm0
  subss xmm0, xmm2
  mulss xmm0, xmm1
  .long 1065353216


  .long 1065353216 # float 1
  .long 3212836864 # float -1
evaluate(float, float): # @evaluate(float, float)
  subss xmm0, xmm1
  movss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
  addss xmm1, xmm0
  mulss xmm1, xmm0
  addss xmm0, dword ptr [rip + .LCPI0_1]
  mulss xmm0, xmm1

Explanation (for the Clang output)

xmm0 = a
xmm1 = b
xmm0 = a - b
xmm1 = 1.0f
xmm1 = a - b + 1.0f
xmm1 = (a - b + 1.0f) * (a - b)
xmm0 = a - b - 1.0f
xmm0 = (a - b - 1.0f) * (a - b + 1.0f) * (a - b)

GCC seems to prefer movaps over movss, even though movss is sufficient in this case. A reason for doing so is that using movaps avoid stalls from partial updates to XMM registers. Clang doesn’t generate movaps, but uses two constants and only addss for them rather than only having one and using subss to subtract one from a register.

After benchmarking these alternatives, they had roughly the same throughput.