Examining Clang output

In this post I present a short comparison between the machine code generated by Clang 5.0.0 for two different for loops.

Sum up to N

The first function sums from 1 to N and returns the result.

C++

int sumUpTo(int n) {
  int x = 0;
  for (int i = 1; i <= n; i++) {
    x += i;
  }
  return x;
}

Output

sumUpTo(int):
  test edi, edi
  jle .LBB0_1
  lea eax, [rdi - 1]
  lea ecx, [rdi - 2]
  imul rcx, rax
  shr rcx
  lea eax, [rcx + 2*rdi - 1]
  ret
.LBB0_1:
  xor eax, eax
  ret

As you can see, Clang used a closed formula to evaluate the expression!

What should it be

  (1 + n)(n) / 2
= (n^2 + n) / 2

What Clang does

  (n - 1)(n - 2) / 2 + (2n - 1)
= (n^2 - 3n + 2) / 2 + (4n - 2) / 2
= (n^2 + n) / 2

It is important to note that the reason why a compiler would rather evaluate this convoluted expression is that it will not overflow like the first expression would.

Sum up to N multiplying by 2

C++

int sumTwiceUpTo(int n) {
  int x = 0;
  for (int i = 1; i <= n; i++) {
    x += 2 * i;
  }
  return x;
}

Output

sumTwiceUpTo(int):
  test edi, edi
  jle .LBB1_1
  lea eax, [rdi - 1]
  lea ecx, [rdi - 2]
  imul ecx, eax
  and ecx, -2
  lea eax, [rcx + 4*rdi - 2]
  ret
.LBB1_1:
  xor eax, eax
  ret

What is is

  (1 + n)(n)
= (n^2 + n)

What Clang does

  ((n - 1)(n - 2) & 111...0) + 4n - 2
= (n - 1)(n - 2) + 4n - 2
= n^2 + n

Again, a closed formula!

It is interesting to note that the and is completely useless here: it is forcing the product (which is always even) into a even number. Also, it requires the multiplication result to be evaluated, so it should add measurable delay.

Removing the unnecessary instruction

Comparing my modified version with the version generated by Clang on my machine shows that removing the and improves performance, on average, by 16 percent.

C++ Benchmark Code

#include <chrono>
#include <iostream>
#include <sstream>

using namespace std;

int sumTwiceUpTo(int n) {
  asm("test edi, edi\n"
      "jle .ASMLB1_1\n"
      "lea eax, [rdi - 1]\n"
      "lea ecx, [rdi - 2]\n"
      "imul ecx, eax\n"
      "and ecx, -2\n"
      "lea eax, [rcx + 4*rdi - 2]\n"
      "ret\n"
      ".ASMLB1_1:\n"
      "xor eax, eax\n");
}

int mySumTwiceUpTo(int n) {
  asm("test edi, edi\n"
      "jle .ASMLB2_1\n"
      "lea eax, [rdi - 1]\n"
      "lea ecx, [rdi - 2]\n"
      "imul ecx, eax\n"
      "lea eax, [rcx + 4*rdi - 2]\n"
      "ret\n"
      ".ASMLB2_1:\n"
      "xor eax, eax\n");
}

int main() {
  auto start = std::chrono::steady_clock::now();
  for (int j = 0; j < 1000; j++) {
    for (int i = 0; i < 65536; i++) {
      sumTwiceUpTo(i);
    }
  }
  auto now = std::chrono::steady_clock::now();
  std::chrono::duration<double, std::nano> ns = now - start;
  cout << ns.count() << " ns" << '\n';

  start = std::chrono::steady_clock::now();
  for (int j = 0; j < 1000; j++) {
    for (int i = 0; i < 65536; i++) {
      mySumTwiceUpTo(i);
    }
  }
  now = std::chrono::steady_clock::now();
  ns = now - start;
  cout << ns.count() << " ns" << '\n';
}

You can compile it with

g++ -O1 -masm=intel main.cpp -o main && ./main

Wunderlist to Taskwarrior

Description

Wunderlist to Taskwarrior is an application that fetches your tasks from Wunderlist and inserts them into Taskwarrior. It can be regularly ran by the operating system to keep Taskwarrior up-to-date.

Why I Made This

As I cannot update Taskwarrior when I am not near one of my computers I needed a way to update Taskwarrior while on the road. The solution I found uses Wunderlist as an intermediate. I already used Wunderlist for simple lists such as the weekly groceries but I greatly prefer Taskwarrior over it, so I still use Taskwarrior when I am at my computer.

Relevant Features

  • Selective Synchronization. Lists starting with “!” are never synchronized.
  • Data Safety. All communication is done via HTTPS.
  • Persistence. SQLite is used to keep track of what has been synchronized.

I had to decide between a continuous process which would update Taskwarrior or a utility application that would be scheduled to run regularly. I opted for the latter as the application initialization and termination are not so expensive. Currently, I’ve been running it every minute on one of my computers.

The Code

Wunderlist to Taskwarrior was written in Haskell using Stack so it should be easy enough to get a build for your own computer.

This is the GitHub repository.

Minimum Window Substring

Description

This is a post regarding the solution to a substring problem which follows a pattern that is quite common among substring problems: it is solvable with a pair of fast and slow iterators.

If this is not possible, we will return the empty string.

Problem Statement

Given two strings, S and T, find the minimum window in S which will contain all the characters in T (with their frequencies) in linear time.

S T Answer
"" "" ""
"" "A" ""
"A" "" ""
"A" "A" "A"
"A" "AB" "A"
"ADOBECODEBANC" "ABC" "BANC"
"ADOBECODECABANC" "ABC" "CAB"

C++ Solution

string minimum_window_substring(string s, string t) {
  if (t.empty()) {
    return "";
  }
  // Frequency in the token.
  unordered_map<char, size_t> tf;
  for (char c : t) {
    tf[c]++;
  }
  // Frequency in the string.
  unordered_map<char, size_t> sf;
  // How many characters we have to match.
  const auto target = t.size();
  const auto n = s.size();
  size_t best_i = 0;
  const auto no_size = numeric_limits<size_t>::max();
  size_t best_size = no_size;
  size_t slow = 0;
  size_t fast = 0;
  size_t matched = 0;
  // Advance fast until we met the target.
  // Then shrink with slow until we no longer meet the conditions.
  // As this advances each pointer at most N times, this is O(n).
  while (slow < n && fast < n) {
    sf[s[fast]]++;
    if (sf[s[fast]] <= tf[s[fast]]) {
      matched++;
    }
    fast++;
    // Here, the range is [slow, fast).
    while (matched == target && slow < fast) {
      const size_t match_size = fast - slow;
      if (match_size < best_size) {
        best_i = slow;
        best_size = match_size;
      }
      sf[s[slow]]--;
      if (sf[s[slow]] < tf[s[slow]]) {
        matched--;
      }
      slow++;
    }
  }
  if (best_size == no_size) {
    return "";
  }
  return s.substr(best_i, best_size);
}

Hellish Bricks DevBlog 2

This is an update on what I have recently done to Hellish Bricks.

Progression Curves

The game got progression curves for running speed and initial jump speed.

Speed Curve

Jump Speed Curve

These are important so that the character evolves, creating more gameplay possibilities as the game progresses.

Colored Lights

I’ve also enabled lantern and ambient light commands so that the player can change the lantern and the ambient light colors during the game or during initialization, through the boot script.

Default Boot Script

# Set lantern to a light yellow.
lantern #EEEEBB
# Set ambient to a dark red.
ambient #443322

White and Colored Lights Comparison

White Lights

Colored Lights

Hellish Bricks DevBlog 1

This is the first of what may end up being a series of posts regarding the development of my closed-source game Hellish Bricks.

The game is heavily inspired on Devil Daggers (2016), but intends to focus on low-poly and minimalist graphics instead of retro ones.

Hellish Bricks is written using modern C++ and requires at least OpenGL 3.3 and just about 128 MiB of free memory for now.

I intend to write better and more visually appealing shaders in the future and start working on some fancier graphical effects, such as particles and shadows.

Gameplay Video

Playable Demo

You need at least Windows Vista to run the game and you may also need to download and install the Microsoft Visual C++ Redistributable for Visual Studio 2017 before being able to execute the game.

New versions of Wine are also able to run the abovementioned executable.

In the future I may release Linux, BSD and maybe even macOS demo builds.