# Examining Compiler Output 2

2018-01-04In this post I present a short comparison between the machine code generated by GCC 7.2 and Clang 5.0.0 for the evaluation of a floating point polynomial.

For both compilers, only -O2 was used, but -O3 produced the same code.

```
float evaluate(float a, float b) {
return (a - b + 1.0f) * (a - b) * (a - b - 1.0f);
}
```

## GCC

```
evaluate(float, float):
subss xmm0, xmm1
movss xmm2, DWORD PTR .LC0[rip]
movaps xmm1, xmm0
addss xmm1, xmm2
mulss xmm1, xmm0
subss xmm0, xmm2
mulss xmm0, xmm1
ret
.LC0:
.long 1065353216
```

## Clang

```
.LCPI0_0:
.long 1065353216 # float 1
.LCPI0_1:
.long 3212836864 # float -1
evaluate(float, float): # @evaluate(float, float)
subss xmm0, xmm1
movss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
addss xmm1, xmm0
mulss xmm1, xmm0
addss xmm0, dword ptr [rip + .LCPI0_1]
mulss xmm0, xmm1
ret
```

### Explanation (for the Clang output)

```
xmm0 = a
xmm1 = b
xmm0 = a - b
xmm1 = 1.0f
xmm1 = a - b + 1.0f
xmm1 = (a - b + 1.0f) * (a - b)
xmm0 = a - b - 1.0f
xmm0 = (a - b - 1.0f) * (a - b + 1.0f) * (a - b)
```

GCC seems to prefer movaps over movss, even though movss is sufficient in this case. A reason for doing so is that using movaps avoid stalls from partial updates to XMM registers. Clang doesnâ€™t generate movaps, but uses two constants and only addss for them rather than only having one and using subss to subtract one from a register.

After benchmarking these alternatives, they had roughly the same throughput.