Optimization
This lecture, CS107 Lecture 16, focuses on program optimization, aiming to help students understand how to improve code efficiency and speed and learn about the optimizations that the GCC compiler can perform.
What is Optimization?
Optimization is the process of making a program faster or more efficient in terms of time and space. While understanding Big-O notation provides a foundation for efficiency, targeted optimizations can lead to significant performance gains. However, it’s important to focus optimization efforts where they are most needed.
A general approach to optimization is summarized as follows:
- If doing something seldom and only on small inputs, do whatever is simplest to code, understand, and debug.
- If doing things thing a lot, or on big inputs, make the primary algorithm’s Big-O cost reasonable.
- Let gcc do its magic from there.
- Optimize explicitly as a last resort.
GCC Optimization
The GCC compiler can perform various optimizations at different levels. The lecture compares two levels:
gcc -O0: Primarily a literal translation of C code.gcc -O2: Enables nearly all reasonable optimizations.
Other, more aggressive or specialized optimization levels exist, such as -O3 (more aggressive, potentially trades size for speed), -Os (optimizes for size), and -Ofast (disregards standards compliance for maximum speed). A comprehensive list of GCC optimization flags is available in the GCC documentation.
Optimizations performed by GCC can target one or more of the following:
- Static instruction count
- Dynamic instruction count
- Cycle count / execution time
Here are some specific GCC optimizations discussed:
Constant Folding
Constant folding is the process where the compiler pre-calculates constant expressions at compile-time.
Example C code:
int seconds = 60 * 60 * 24 * n_days;
The constant expression 60 * 60 * 24 can be calculated by the compiler, so the compiled code will directly use the result (86400) instead of performing the multiplications at runtime.
Example with a more complex expression:
int fold(int param) {
char arr[5];
int a = 0x107;
int b = a * sizeof(arr);
int c = sqrt(2.0);
return a * param + (a + 0x15/c + strlen("Hello") * b - 0x37) / 4;
}
Before optimization (-O0), the assembly shows multiple instructions for calculating the expression:
0000000000400626 <fold>:
400626: 55 push %rbp
400627: 53 push %rbx
400628: 48 83 ec 08 sub $0x8,%rsp
40062c: 89 fd mov %edi,%ebp
40062e: f2 0f 10 05 da 00 00 movsd 0xda(%rip),%xmm0 # 400710 <_IO_stdin_used+0x10>
400635: 00
400636: e8 d5 fe ff ff callq 400510 <sqrt@plt>
40063b: f2 0f 2c c8 cvttsd2si %xmm0,%ecx
40063f: 69 ed 07 01 00 00 imul $0x107,%ebp,%ebp
400645: b8 15 00 00 00 mov $0x15,%eax
40064a: 99 cltd
40064b: f7 f9 idiv %ecx
40064d: 8d 98 07 01 00 00 lea 0x107(%rax),%ebx
400653: bf 04 07 40 00 mov $0x400704,%edi # "Hello"
400658: e8 93 fe ff ff callq 4004f0 <strlen@plt>
40065d: 48 69 c0 23 05 00 00 imul $0x523,%rax,%rax
400664: 48 63 db movslq %ebx,%rbx
400667: 48 8d 44 18 c9 lea -0x37(%rax,%rbx,1),%rax
40066c: 48 c1 e8 02 shr $0x2,%rax
400670: 01 e8 add %ebp,%eax
400672: 48 83 c4 08 add $0x8,%rsp
400676: 5b pop %rbx
400677: 5d pop %rbp
400678: c3 retq
After optimization (-O2), the assembly is significantly shorter, indicating many parts of the expression were folded into constants:
00000000004004f0 <fold>:
4004f0: 69 c7 07 01 00 00 imul $0x107,%edi,%eax
4004f6: 05 a5 06 00 00 add $0x6a5,%eax
4004fb: c3 retq
4004fc: 0f 1f 40 00 nopl 0x0(%rax)
Common Sub-expression Elimination
This optimization prevents recalculating the same expression multiple times by computing it once and storing the result.
Example C code:
int a = (param2 + 0x107);
int b = param1 * (param2 + 0x107) + a;
return a * (param2 + 0x107) + b * (param2 + 0x107);
The expression (param2 + 0x107) is a common sub-expression. GCC will calculate it once and reuse the result. The assembly shows the optimization taking place. This optimization can even occur at -O0.
00000000004004f0 <subexp>:
4004f0: 81 c6 07 01 00 00 add $0x107,%esi
4004f6: 0f af fe imul %esi,%edi
4004f9: 8d 04 77 lea (%rdi,%rsi,2),%eax
4004fc: 0f af c6 imul %esi,%eax
4004ff: c3 retq
Dead Code Elimination
Dead code elimination removes code that has no effect on the program’s output.
Examples of dead code:
- An
ifstatement with a condition that is always false:if (param1 < param2 && param1 > param2) - An empty
forloop:for (int i = 0; i < 1000; i++); if/elsestatements where both branches perform the same operation.if/elsestatements where theelsebranch is effectively unreachable or redundant.
Before optimization (-O0), the assembly includes instructions for the dead code:
00000000004004d6 <dead_code>:
4004d6: b8 00 00 00 00 mov $0x0,%eax
4004db: eb 03 jmp 4004e0 <dead_code+0xa>
4004dd: 83 c0 01 add $0x1,%eax
4004e0: 3d e7 03 00 00 cmp $0x3e7,%eax
4004e5: 7e f6 jle 4004dd <dead_code+0x7>
4004e7: 39 f7 cmp %esi,%edi
4004e9: 75 05 jne 4004f0 <dead_code+0x1a>
4004eb: 8d 47 01 lea 0x1(%rdi),%eax
4004ee: eb 03 jmp 4004f3 <dead_code+0x1d>
4004f0: 8d 47 01 lea 0x1(%rdi),%eax
4004f3: f3 c3 repz retq
After optimization (-O2), the dead code is removed in the assembly:
00000000004004f0 <dead_code>:
4004f0: 8d 47 01 lea 0x1(%rdi),%eax
4004f3: c3 retq
4004f4: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4004fb: 00 00 00
4004fe: 66 90 xchg %ax,%ax
Strength Reduction
Strength reduction replaces computationally expensive operations with equivalent, less expensive ones. Examples include changing division to multiplication (by the reciprocal), or multiplication and modulo operations to bit shifts and bitwise AND when dealing with powers of two.
Example C code:
int a = param2 * 32; // multiplication can be replaced by left shift
int b = a * 7;
int c = b / 3;
int d = param2 % 2; // modulo by 2 can be replaced by bitwise AND with 1
for (int i = 0; i <= param2; i++) {
c += param1[i] + 0x107 * i; // multiplication in loop might be optimized
}
return c + d;
Code Motion
Code motion moves computations out of loops if their results do not change during the loop iterations.
Example C code:
for (int i = 0; i < n; i++) {
sum += arr[i] + foo * (bar + 3);
}
The expression foo * (bar + 3) can be calculated once before the loop starts, as its value does not depend on the loop variable i.
Tail Recursion
GCC can sometimes optimize tail-recursive functions by converting them into iterative loops, avoiding the overhead of recursive function calls.
Example C code:
long factorial(int n) {
if (n <= 1) return 1;
else return n * factorial(n - 1); // This is not tail recursion
}
A tail-recursive version would have the recursive call as the very last operation in the function.
Loop Unrolling
Loop unrolling reduces loop overhead by performing multiple iterations of the loop body within a single iteration of the compiled loop.
Example C code:
for (int i = 0; i <= n - 4; i += 4) {
sum += arr[i];
sum += arr[i + 1];
sum += arr[i + 2];
sum += arr[i + 3];
} // after the loop handle any leftovers
This manually unrolled loop processes 4 elements in each iteration, reducing the number of loop condition checks and jumps. Compilers can often perform this optimization automatically.
Limitations of GCC Optimization
While powerful, GCC cannot optimize everything because it doesn’t have the same level of understanding of the code’s intent as the programmer.
Example:
int char_sum(char *s) {
int sum = 0;
for (size_t i = 0; i < strlen(s); i++) { // Bottleneck: strlen called in each iteration
sum += s[i];
}
return sum;
}
The bottleneck here is calling strlen(s) in every loop iteration. A programmer knows that the length of the string s does not change within the loop, and can manually optimize this by calculating the length once before the loop. While GCC might apply code motion to move strlen out of the loop in some cases, it cannot always guarantee this due to potential side effects or aliasing.
Another example:
void lower1(char *s) {
for (size_t i = 0; i < strlen(s); i++) { // Bottleneck: strlen called in each iteration
if (s[i] >= 'A' && s[i] <= 'Z') {
s[i] -= ('A' - 'a');
}
}
}
Again, strlen(s) is called repeatedly. GCC cannot move strlen out of this loop because the string s is being modified within the loop. Although the modification doesn’t change the string’s length, GCC cannot assume this. A programmer, knowing this, can manually optimize by calculating the length once before the loop.
Caching
Program performance is not solely limited by processor speed; memory access can be a significant bottleneck. Computer systems use a hierarchy of memory levels, ranging from very fast (registers) to very slow (disk). Data that is used more frequently is moved to faster memory levels.
Caching relies on locality:
- Temporal locality: Data that has been accessed recently is likely to be accessed again soon (co-located in time).
- Spatial locality: Data located near recently accessed data is also likely to be accessed (co-located in space).
A high cache hit rate is crucial for performance. Even a small percentage of cache misses can significantly impact the total memory access time.