Optimization

This lecture, CS107 Lecture 16, focuses on program optimization, aiming to help students understand how to improve code efficiency and speed and learn about the optimizations that the GCC compiler can perform.

What is Optimization?

Optimization is the process of making a program faster or more efficient in terms of time and space. While understanding Big-O notation provides a foundation for efficiency, targeted optimizations can lead to significant performance gains. However, it’s important to focus optimization efforts where they are most needed.

A general approach to optimization is summarized as follows:

  1. If doing something seldom and only on small inputs, do whatever is simplest to code, understand, and debug.
  2. If doing things thing a lot, or on big inputs, make the primary algorithm’s Big-O cost reasonable.
  3. Let gcc do its magic from there.
  4. Optimize explicitly as a last resort.

GCC Optimization

The GCC compiler can perform various optimizations at different levels. The lecture compares two levels:

  • gcc -O0: Primarily a literal translation of C code.
  • gcc -O2: Enables nearly all reasonable optimizations.

Other, more aggressive or specialized optimization levels exist, such as -O3 (more aggressive, potentially trades size for speed), -Os (optimizes for size), and -Ofast (disregards standards compliance for maximum speed). A comprehensive list of GCC optimization flags is available in the GCC documentation.

Optimizations performed by GCC can target one or more of the following:

  • Static instruction count
  • Dynamic instruction count
  • Cycle count / execution time

Here are some specific GCC optimizations discussed:

Constant Folding

Constant folding is the process where the compiler pre-calculates constant expressions at compile-time.

Example C code:

int seconds = 60 * 60 * 24 * n_days;

The constant expression 60 * 60 * 24 can be calculated by the compiler, so the compiled code will directly use the result (86400) instead of performing the multiplications at runtime.

Example with a more complex expression:

int fold(int param) {
  char arr[5];
  int a = 0x107;
  int b = a * sizeof(arr);
  int c = sqrt(2.0);
  return a * param + (a + 0x15/c + strlen("Hello") * b - 0x37) / 4;
}

Before optimization (-O0), the assembly shows multiple instructions for calculating the expression:

0000000000400626 <fold>:
400626: 55                   push   %rbp
400627: 53                   push   %rbx
400628: 48 83 ec 08          sub    $0x8,%rsp
40062c: 89 fd                mov    %edi,%ebp
40062e: f2 0f 10 05 da 00 00 movsd  0xda(%rip),%xmm0        # 400710 <_IO_stdin_used+0x10>
400635: 00
400636: e8 d5 fe ff ff       callq  400510 <sqrt@plt>
40063b: f2 0f 2c c8          cvttsd2si %xmm0,%ecx
40063f: 69 ed 07 01 00 00    imul   $0x107,%ebp,%ebp
400645: b8 15 00 00 00       mov    $0x15,%eax
40064a: 99                   cltd
40064b: f7 f9                idiv   %ecx
40064d: 8d 98 07 01 00 00    lea    0x107(%rax),%ebx
400653: bf 04 07 40 00       mov    $0x400704,%edi          # "Hello"
400658: e8 93 fe ff ff       callq  4004f0 <strlen@plt>
40065d: 48 69 c0 23 05 00 00 imul   $0x523,%rax,%rax
400664: 48 63 db             movslq %ebx,%rbx
400667: 48 8d 44 18 c9       lea    -0x37(%rax,%rbx,1),%rax
40066c: 48 c1 e8 02          shr    $0x2,%rax
400670: 01 e8                add    %ebp,%eax
400672: 48 83 c4 08          add    $0x8,%rsp
400676: 5b                   pop    %rbx
400677: 5d                   pop    %rbp
400678: c3                   retq

After optimization (-O2), the assembly is significantly shorter, indicating many parts of the expression were folded into constants:

00000000004004f0 <fold>:
4004f0: 69 c7 07 01 00 00    imul   $0x107,%edi,%eax
4004f6: 05 a5 06 00 00       add    $0x6a5,%eax
4004fb: c3                   retq
4004fc: 0f 1f 40 00          nopl   0x0(%rax)

Common Sub-expression Elimination

This optimization prevents recalculating the same expression multiple times by computing it once and storing the result.

Example C code:

int a = (param2 + 0x107);
int b = param1 * (param2 + 0x107) + a;
return a * (param2 + 0x107) + b * (param2 + 0x107);

The expression (param2 + 0x107) is a common sub-expression. GCC will calculate it once and reuse the result. The assembly shows the optimization taking place. This optimization can even occur at -O0.

00000000004004f0 <subexp>:
4004f0: 81 c6 07 01 00 00    add    $0x107,%esi
4004f6: 0f af fe             imul   %esi,%edi
4004f9: 8d 04 77             lea    (%rdi,%rsi,2),%eax
4004fc: 0f af c6             imul   %esi,%eax
4004ff: c3                   retq

Dead Code Elimination

Dead code elimination removes code that has no effect on the program’s output.

Examples of dead code:

  • An if statement with a condition that is always false: if (param1 < param2 && param1 > param2)
  • An empty for loop: for (int i = 0; i < 1000; i++);
  • if/else statements where both branches perform the same operation.
  • if/else statements where the else branch is effectively unreachable or redundant.

Before optimization (-O0), the assembly includes instructions for the dead code:

00000000004004d6 <dead_code>:
4004d6: b8 00 00 00 00       mov    $0x0,%eax
4004db: eb 03                jmp    4004e0 <dead_code+0xa>
4004dd: 83 c0 01             add    $0x1,%eax
4004e0: 3d e7 03 00 00       cmp    $0x3e7,%eax
4004e5: 7e f6                jle    4004dd <dead_code+0x7>
4004e7: 39 f7                cmp    %esi,%edi
4004e9: 75 05                jne    4004f0 <dead_code+0x1a>
4004eb: 8d 47 01             lea    0x1(%rdi),%eax
4004ee: eb 03                jmp    4004f3 <dead_code+0x1d>
4004f0: 8d 47 01             lea    0x1(%rdi),%eax
4004f3: f3 c3                repz retq

After optimization (-O2), the dead code is removed in the assembly:

00000000004004f0 <dead_code>:
4004f0: 8d 47 01             lea    0x1(%rdi),%eax
4004f3: c3                   retq
4004f4: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4004fb: 00 00 00
4004fe: 66 90                xchg   %ax,%ax

Strength Reduction

Strength reduction replaces computationally expensive operations with equivalent, less expensive ones. Examples include changing division to multiplication (by the reciprocal), or multiplication and modulo operations to bit shifts and bitwise AND when dealing with powers of two.

Example C code:

int a = param2 * 32; // multiplication can be replaced by left shift
int b = a * 7;
int c = b / 3;
int d = param2 % 2; // modulo by 2 can be replaced by bitwise AND with 1
for (int i = 0; i <= param2; i++) {
  c += param1[i] + 0x107 * i; // multiplication in loop might be optimized
}
return c + d;

Code Motion

Code motion moves computations out of loops if their results do not change during the loop iterations.

Example C code:

for (int i = 0; i < n; i++) {
  sum += arr[i] + foo * (bar + 3);
}

The expression foo * (bar + 3) can be calculated once before the loop starts, as its value does not depend on the loop variable i.

Tail Recursion

GCC can sometimes optimize tail-recursive functions by converting them into iterative loops, avoiding the overhead of recursive function calls.

Example C code:

long factorial(int n) {
  if (n <= 1) return 1;
  else return n * factorial(n - 1); // This is not tail recursion
}

A tail-recursive version would have the recursive call as the very last operation in the function.

Loop Unrolling

Loop unrolling reduces loop overhead by performing multiple iterations of the loop body within a single iteration of the compiled loop.

Example C code:

for (int i = 0; i <= n - 4; i += 4) {
  sum += arr[i];
  sum += arr[i + 1];
  sum += arr[i + 2];
  sum += arr[i + 3];
} // after the loop handle any leftovers

This manually unrolled loop processes 4 elements in each iteration, reducing the number of loop condition checks and jumps. Compilers can often perform this optimization automatically.

Limitations of GCC Optimization

While powerful, GCC cannot optimize everything because it doesn’t have the same level of understanding of the code’s intent as the programmer.

Example:

int char_sum(char *s) {
  int sum = 0;
  for (size_t i = 0; i < strlen(s); i++) { // Bottleneck: strlen called in each iteration
    sum += s[i];
  }
  return sum;
}

The bottleneck here is calling strlen(s) in every loop iteration. A programmer knows that the length of the string s does not change within the loop, and can manually optimize this by calculating the length once before the loop. While GCC might apply code motion to move strlen out of the loop in some cases, it cannot always guarantee this due to potential side effects or aliasing.

Another example:

void lower1(char *s) {
  for (size_t i = 0; i < strlen(s); i++) { // Bottleneck: strlen called in each iteration
    if (s[i] >= 'A' && s[i] <= 'Z') {
      s[i] -= ('A' - 'a');
    }
  }
}

Again, strlen(s) is called repeatedly. GCC cannot move strlen out of this loop because the string s is being modified within the loop. Although the modification doesn’t change the string’s length, GCC cannot assume this. A programmer, knowing this, can manually optimize by calculating the length once before the loop.

Caching

Program performance is not solely limited by processor speed; memory access can be a significant bottleneck. Computer systems use a hierarchy of memory levels, ranging from very fast (registers) to very slow (disk). Data that is used more frequently is moved to faster memory levels.

Caching relies on locality:

  • Temporal locality: Data that has been accessed recently is likely to be accessed again soon (co-located in time).
  • Spatial locality: Data located near recently accessed data is also likely to be accessed (co-located in space).

A high cache hit rate is crucial for performance. Even a small percentage of cache misses can significantly impact the total memory access time.