Rust Performance 101 in 5 Minutes

June 16, 2021 performance rust

Summary

To determine if your Rust program is CPU bound, check CPU usage with `htop` and isolate the problematic part into a repeatable test. Run measurements using `time ./myprog` or `perf stat -e task-clock ./myprog`. Build in release mode with link-time optimization by adding `lto = true` and `codegen-units = 1` in your `Cargo.toml`, and compile for your target CPU using `RUSTFLAGS="-C target-cpu=native" cargo build --release`. Use `cargo flamegraph` to identify performance hotspots for optimization. If your bottleneck is in HashMap/HashSet, consider alternatives like arrays or faster hashing libraries such as `nohash-hasher`, `rustc-hash`, or `AHash`. For common tasks, look for faster libraries on crates.io.

Is your Rust program CPU bound? Here are the very first things you can do on Linux.

First be sure the CPU really is the bottleneck. htop should tell you, or that aircraft-taking-off fan noise.

Second isolate the part you are concerned about into a program (or unit test or benchmark) that you can run repeatably.

Third, get some ballpark measurements by running one of these at least four times:

time ./myprog
perf stat -e task-clock ./myprog

Build in release mode with link time optimization

Add this to your Cargo.toml:

[profile.release]
lto = true
codegen-units = 1

then build it:

cargo build --release

This might be all you need. Release mode makes a huge difference. Then try only one or neither of those two profile.release lines just in case they made things worse. Trust, but verify.

Compile for the target CPU

By default the Rust compiler will only use CPU instructions that even very old CPUs support, because it doesn’t know where you are going to run your program. If you are only going to run locally you can allow the compiler to use faster instructions:

RUSTFLAGS="-C target-cpu=native" cargo build --release

Here native CPU is an alias for “this machine”.

If you are running on a different machine than you are building on, but you know which machine, target that CPU. Find valid CPU names like this: rustc --target=x86_64-unknown-linux-gnu --print target-cpus

Aside: rustc prints the CPU micro-architecture names like “Nehalem” and “Skylake”. To find yours: gcc -march=native -Q --help=target | grep march.

Find the hotspot with a flamegraph

Install cargo flamegraph:

cargo install flamegraph
cargo flamegraph
chromium-browser flamegraph.svg # or however you view SVG files

This will show you where your program spent time. That’s the part to optimize. Can you avoid doing that thing altogether? Or do it in a different way? Optimizing at the language level will only get you so far, the biggest wins are usually at the software design level.

Use a faster HashMap

Often the bottleneck will be in HashMap / HashSet. Here are three things you can do:

Could you use an array instead? Even a quite large spare Vec is often much faster than a HashMap.
Are your hash keys numbers? Try nohash-hasher.
Otherwise try rustc-hash or AHash, both should be a fair bit faster than the standard library’s HashMap/HashSet.

nohash-hasher, rustc-hash and ahash are all almost drop-in replacements, requiring just a few character changes.

Use a faster library

If your bottleneck is in something relatively common (e.g. JSON parsing) there is often a faster library on crates.io. Take a look!

I think our five minutes is up. Happy tuning!

Appendix: Beyond 5 minutes

Read the Rust performance book.
Use the amazing perf, via perf one-liners.
Cancel your summer plans. Read Brendan Gregg’s Systems Performance. Eight hundred pages later, you will know kung-fu.

Graham King