# Examining Clang output

2017-11-30In this post I present a short comparison between the machine code generated by Clang 5.0.0 for two different for loops.

## Sum up to N

The first function sums from 1 to N and returns the result.

### C++

```
int sumUpTo(int n) {
int x = 0;
for (int i = 1; i <= n; i++) {
x += i;
}
return x;
}
```

### Output

```
sumUpTo(int):
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea eax, [rcx + 2*rdi - 1]
ret
.LBB0_1:
xor eax, eax
ret
```

As you can see, Clang used a closed formula to evaluate the expression!

### What should it be

```
(1 + n)(n) / 2
= (n^2 + n) / 2
```

### What Clang does

```
(n - 1)(n - 2) / 2 + (2n - 1)
= (n^2 - 3n + 2) / 2 + (4n - 2) / 2
= (n^2 + n) / 2
```

It is important to note that the reason why a compiler would rather evaluate this convoluted expression is that it will not overflow like the first expression would.

## Sum up to N multiplying by 2

### C++

```
int sumTwiceUpTo(int n) {
int x = 0;
for (int i = 1; i <= n; i++) {
x += 2 * i;
}
return x;
}
```

### Output

```
sumTwiceUpTo(int):
test edi, edi
jle .LBB1_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul ecx, eax
and ecx, -2
lea eax, [rcx + 4*rdi - 2]
ret
.LBB1_1:
xor eax, eax
ret
```

### What is is

```
(1 + n)(n)
= (n^2 + n)
```

### What Clang does

```
((n - 1)(n - 2) & 111...0) + 4n - 2
= (n - 1)(n - 2) + 4n - 2
= n^2 + n
```

Again, a closed formula!

It is interesting to note that the and is completely useless here: it is forcing the product (which is always even) into a even number. Also, it requires the multiplication result to be evaluated, so it should add measurable delay.

### Removing the unnecessary instruction

Comparing my modified version with the version generated by Clang on my machine shows that removing the and improves performance, on average, by 16 percent.

### C++ Benchmark Code

```
#include <chrono>
#include <iostream>
#include <sstream>
using namespace std;
int sumTwiceUpTo(int n) {
asm("test edi, edi\n"
"jle .ASMLB1_1\n"
"lea eax, [rdi - 1]\n"
"lea ecx, [rdi - 2]\n"
"imul ecx, eax\n"
"and ecx, -2\n"
"lea eax, [rcx + 4*rdi - 2]\n"
"ret\n"
".ASMLB1_1:\n"
"xor eax, eax\n");
}
int mySumTwiceUpTo(int n) {
asm("test edi, edi\n"
"jle .ASMLB2_1\n"
"lea eax, [rdi - 1]\n"
"lea ecx, [rdi - 2]\n"
"imul ecx, eax\n"
"lea eax, [rcx + 4*rdi - 2]\n"
"ret\n"
".ASMLB2_1:\n"
"xor eax, eax\n");
}
int main() {
auto start = std::chrono::steady_clock::now();
for (int j = 0; j < 1000; j++) {
for (int i = 0; i < 65536; i++) {
sumTwiceUpTo(i);
}
}
auto now = std::chrono::steady_clock::now();
std::chrono::duration<double, std::nano> ns = now - start;
cout << ns.count() << " ns" << '\n';
start = std::chrono::steady_clock::now();
for (int j = 0; j < 1000; j++) {
for (int i = 0; i < 65536; i++) {
mySumTwiceUpTo(i);
}
}
now = std::chrono::steady_clock::now();
ns = now - start;
cout << ns.count() << " ns" << '\n';
}
```

You can compile it with

```
g++ -O1 -masm=intel main.cpp -o main && ./main
```