Graham King

Solvitas perambulum

CPU Performance Optimization notes

Dumping some older raw notes on CPU performance optimization from my notes file, so I can find them in the future. I think these came from a Rust conference talk.

Data-centric design

Keep data small, compact, and local.

Keep data in a single-thread. Cross-core traffic is expensive, can’t use L1 cache.

Data centric design: Cache line is 64 bytes, L1 cache varies but 32 KB is typical.

  • work on data in chunks that fit
  • when Hyper Threading you only get half of each cache

Accessing 1 byte or 64 (aligned) is the same memory operation.

Group hot data together and make it small

  • struct {a,b,c} if a is accessed far more than b or c, split into struct{a} and struct{b,c}
  • instead of array of structs ([]struct{a,b,c} this usually leads to struct of arrays struct { []a, []b, []c } with the index being the ‘user_id’, so user1.a is users.a[1], user1.b is users.b[1]. This is a common in video game industry.

Pad all data to natural boundaries, usually 32 or 64 bit boundaries. un-aligned values are much slower. Compiler might do this already.

Align and pad heap critical data (especially synchronization primitives) on 64 byte (cache line) boundary to avoid false sharing.

Hardware pre-fetch (from memory and down the various cache levels):

  • keep data in same 4KB page. prefetch stops at page boundaries
  • access in constant strides (forwards or backwards) Prefetch is trying to detect loops/streams where you are working through a set of data in a predictable fashion (walking an array, often a string)

Data types

Don’t worry about u8/u16/etc, most things expand to u64 anyway and hence cost more instructions:

  • 32 bit instructions take one less byte, so prefer unsigned int 32.
  • Code has it’s own cache (instruction cache), so fewer / smaller instructions are also better, like data.
  • In my experiments u64 and u32 can be faster than u8 and u16. Try both, because smaller values should vectorize better.

Integer is much faster than floating point. f32 and f64 are similar

Powers of 2 are faster, default your numbers to that whenever possible:

  • Faster multiplication (including array access) and and division.
  • Memory alignment (an array[2^x] automatically aligns on a power of 2) ??
  • Easier vectorization

Branching

Avoid unpredictable branches, were there isn’t one direction that’s nearly always taken.

In a switch or big if/else block, if there are two common branches and many uncommon, split into two statements so that each is predictable.

Conditional / ternary operator (x = cond ? a : b) can compile to a ‘conditional move’ instruction (CMOV family) which avoids a branch. In Rust if assignment use ‘let x = if cond { a } else { b }’ to get CMOV. Compiler decides, so not guaranteed to be branch free.

Prefer stack allocated data:

  • No actual allocation, just increase stack pointer by fixed amount know at compile time.
  • Likely to be in L1 cache because variables are very close to each other in memory (same cache line, or prefetch block).
  • Avoids “false sharing” (sharing non-overlapping parts of a cache line w/ another thread) as stack is thread local.

If threads share data and access it a lot, copy to stack and work there. Copy back when finished to only commit once. Otherwise cache line keeps getting ‘dirty’ and being sychronized, lots of traffic to/from memory.

Try to make loop iterations independant of each other, so they can run in parallel / out-of-order

SIMD: Single Instruction Multiple Data, also called ‘vector’, allows performing a basic operation (e.g. add) on multiple values at once by packing them into a large register. Can be used directly in assembly or via ‘intrinsics’ https://doc.rust-lang.org/core/arch/x86_64/index.html (also C++ for intrinsics)

For specific algorithms the work is often already done, look for library / crate using SIMD instructions. stdlib is often slow to catch up, and has to support very old architectures.