Examining Compiler Output 2

In this post I present a short comparison between the machine code generated by GCC 7.2 and Clang 5.0.0 for the evaluation of a floating point polynomial.

For both compilers, only -O2 was used, but -O3 produced the same code.

float evaluate(float a, float b) {
    return (a - b + 1.0f) * (a - b) * (a - b - 1.0f);
}

GCC

evaluate(float, float):
  subss xmm0, xmm1
  movss xmm2, DWORD PTR .LC0[rip]
  movaps xmm1, xmm0
  addss xmm1, xmm2
  mulss xmm1, xmm0
  subss xmm0, xmm2
  mulss xmm0, xmm1
  ret
.LC0:
  .long 1065353216

Clang

.LCPI0_0:
  .long 1065353216 # float 1
.LCPI0_1:
  .long 3212836864 # float -1
evaluate(float, float): # @evaluate(float, float)
  subss xmm0, xmm1
  movss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
  addss xmm1, xmm0
  mulss xmm1, xmm0
  addss xmm0, dword ptr [rip + .LCPI0_1]
  mulss xmm0, xmm1
  ret

Explanation (for the Clang output)

xmm0 = a
xmm1 = b
xmm0 = a - b
xmm1 = 1.0f
xmm1 = a - b + 1.0f
xmm1 = (a - b + 1.0f) * (a - b)
xmm0 = a - b - 1.0f
xmm0 = (a - b - 1.0f) * (a - b + 1.0f) * (a - b)

GCC seems to prefer movaps over movss, even though movss is sufficient in this case. A reason for doing so is that using movaps avoid stalls from partial updates to XMM registers. Clang doesn’t generate movaps, but uses two constants and only addss for them rather than only having one and using subss to subtract one from a register.

After benchmarking these alternatives, they had roughly the same throughput.

Examining Compiler Output 1

In this post I present a short comparison between the machine code generated by Clang 5.0.0 for two different for loops.

Sum up to N

The first function sums from 1 to N and returns the result.

C++

int sumUpTo(int n) {
  int x = 0;
  for (int i = 1; i <= n; i++) {
    x += i;
  }
  return x;
}

Output

sumUpTo(int):
  test edi, edi
  jle .LBB0_1
  lea eax, [rdi - 1]
  lea ecx, [rdi - 2]
  imul rcx, rax
  shr rcx
  lea eax, [rcx + 2*rdi - 1]
  ret
.LBB0_1:
  xor eax, eax
  ret

As you can see, Clang used a closed formula to evaluate the expression!

What should it be

  (1 + n)(n) / 2
= (n^2 + n) / 2

What Clang does

  (n - 1)(n - 2) / 2 + (2n - 1)
= (n^2 - 3n + 2) / 2 + (4n - 2) / 2
= (n^2 + n) / 2

It is important to note that the reason why a compiler would rather evaluate this convoluted expression is that it will not overflow like the first expression would.

Sum up to N multiplying by 2

C++

int sumTwiceUpTo(int n) {
  int x = 0;
  for (int i = 1; i <= n; i++) {
    x += 2 * i;
  }
  return x;
}

Output

sumTwiceUpTo(int):
  test edi, edi
  jle .LBB1_1
  lea eax, [rdi - 1]
  lea ecx, [rdi - 2]
  imul ecx, eax
  and ecx, -2
  lea eax, [rcx + 4*rdi - 2]
  ret
.LBB1_1:
  xor eax, eax
  ret

What is is

  (1 + n)(n)
= (n^2 + n)

What Clang does

  ((n - 1)(n - 2) & 111...0) + 4n - 2
= (n - 1)(n - 2) + 4n - 2
= n^2 + n

Again, a closed formula!

It is interesting to note that the and is completely useless here: it is forcing the product (which is always even) into a even number. Also, it requires the multiplication result to be evaluated, so it could add measurable delay.

Wunderlist to Taskwarrior

Description

Wunderlist to Taskwarrior is an application that fetches your tasks from Wunderlist and inserts them into Taskwarrior. It can be regularly ran by the operating system to keep Taskwarrior up-to-date.

Why I Made This

As I cannot update Taskwarrior when I am not near one of my computers I needed a way to update Taskwarrior while on the road. The solution I found uses Wunderlist as an intermediate. I already used Wunderlist for simple lists such as the weekly groceries but I greatly prefer Taskwarrior over it, so I still use Taskwarrior when I am at my computer.

Relevant Features

  • Selective Synchronization. Lists starting with “!” are never synchronized.
  • Data Safety. All communication is done via HTTPS.
  • Persistence. SQLite is used to keep track of what has been synchronized.

I had to decide between a continuous process which would update Taskwarrior or a utility application that would be scheduled to run regularly. I opted for the latter as the application initialization and termination are not so expensive. Currently, I’ve been running it every minute on one of my computers.

The Code

Wunderlist to Taskwarrior was written in Haskell using Stack so it should be easy enough to get a build for your own computer.

This is the GitHub repository.

Minimum Window Substring

Description

This is a post regarding the solution to a substring problem which follows a pattern that is quite common among substring problems: it is solvable with a pair of fast and slow iterators.

If this is not possible, we will return the empty string.

Problem Statement

Given two strings, S and T, find the minimum window in S which will contain all the characters in T (with their frequencies) in linear time.

S T Answer
"" "" ""
"" "A" ""
"A" "" ""
"A" "A" "A"
"A" "AB" "A"
"ADOBECODEBANC" "ABC" "BANC"
"ADOBECODECABANC" "ABC" "CAB"

C++ Solution

string minimum_window_substring(string s, string t) {
  if (t.empty()) {
    return "";
  }
  // Frequency in the token.
  unordered_map<char, size_t> tf;
  for (char c : t) {
    tf[c]++;
  }
  // Frequency in the string.
  unordered_map<char, size_t> sf;
  // How many characters we have to match.
  const auto target = t.size();
  const auto n = s.size();
  size_t best_i = 0;
  const auto no_size = numeric_limits<size_t>::max();
  size_t best_size = no_size;
  size_t slow = 0;
  size_t fast = 0;
  size_t matched = 0;
  // Advance fast until we met the target.
  // Then shrink with slow until we no longer meet the conditions.
  // As this advances each pointer at most N times, this is O(n).
  while (slow < n && fast < n) {
    sf[s[fast]]++;
    if (sf[s[fast]] <= tf[s[fast]]) {
      matched++;
    }
    fast++;
    // Here, the range is [slow, fast).
    while (matched == target && slow < fast) {
      const size_t match_size = fast - slow;
      if (match_size < best_size) {
        best_i = slow;
        best_size = match_size;
      }
      sf[s[slow]]--;
      if (sf[s[slow]] < tf[s[slow]]) {
        matched--;
      }
      slow++;
    }
  }
  if (best_size == no_size) {
    return "";
  }
  return s.substr(best_i, best_size);
}

Hellish Bricks DevBlog 2

This is an update on what I have recently done to Hellish Bricks.

Progression Curves

The game got progression curves for running speed and initial jump speed.

Speed Curve

Jump Speed Curve

These are important so that the character evolves, creating more gameplay possibilities as the game progresses.

Colored Lights

I’ve also enabled lantern and ambient light commands so that the player can change the lantern and the ambient light colors during the game or during initialization, through the boot script.

Default Boot Script

# Set lantern to a light yellow.
lantern #EEEEBB
# Set ambient to a dark red.
ambient #443322

White and Colored Lights Comparison

White Lights

Colored Lights