Rust atomics on x86: How and why
Summary
To understand Rust’s atomic memory Ordering I looked at what assembly they compile to on my x86 tigerlake machine. I expect this will be almost identical on any vaguely recent x86, and completely different on ARM.
Load and store
- load at any
Ordering
compiles tomov
. It is free. - store:
- at
Ordering::Relaxed
orRelease
(with a corresponding load atAcquire
) is also amov
. Also free. - at
Ordering::SeqCst
(with a load also atSeqCst
) compiles toxchg
(ref).
- at
These seem to hold for all sizes. I checked u64, i8 and bool.
Try it yourself on Compiler Explorer. Search for “nop”: four is the load, three is the store.
Read-modify-write
Most read-modify-write operations, at at any Ordering compile to lock cmpxchg
in a loop:
- fetch_max
- fetch_update
- fetch_and, fetch_xor, etc, if you use the return value.
If you call compare_exchange or compare_exchange_weak directly (which are identical!) the compiler doesn’t output a loop because you are expected to do it. Both are just a single lock cmpxchg
.
Operations with a more specific (and obvious) instruction compile to that with a lock
prefix, also at any Ordering, and we don’t need a loop:
- fetch_add -> either
lock xadd
if you use the return value, orlock add
if you don’t. - fetch_and if you don’t use the return value ->
lock and
- fetch_xor if you don’t use the return value ->
lock xor
, and so on.
Try it yourself on Compiler Explorer. Search for “nop”.
C++ standards committee, you owe me a weekend
For any kind of read-modify-write operation it doesn’t matter what ordering you choose. There is no difference between compare_exchange and compare_exchange_weak. In fact the only time the ordering makes any different at all is store
at SeqCst
(when we add a full memory barrier).
All those CppCon videos I watched, all those Stack Exchange answers I read and it comes out the same. All those thought leaders telling us how hard atomics are to get right. Take these broken wings my friend and learn to fly again.
Caveat: Compiler re-ordering
The compiler is allowed to re-order your instructions if a single thread would not notice the difference, and it will do so to try to get better CPU utilization. If you have multiple threads and two pieces of shared data A and B, and A must be visible by the time B changes, this might not be obvious to the compiler. The compiler can freely re-order A to happen after B.
Ordering
is also an instruction to the compiler. A store
with Ordering::Relaxed
and Ordering::Release
might both compile to a simple mov
instruction, but that Release might still have made a difference by preventing a compiler re-ordering. To prevent this you can add a compiler fence.
Tools
To play with it locally, because trust but verify, copy the code from Compiler Explorer linked above, or use this as a starting point.
use std::sync::Arc;
use std::sync::atomic::{Ordering, AtomicU64};
use std::env::args;
use std::arch::asm;
use std::thread;
use std::time::Duration;
fn main() {
let m = Arc::new(AtomicU64::new(0u64));
let num_args = args().len() as u64;
let m2 = m.clone();
thread::spawn(move || {
for i in 0..10000 {
unsafe { asm!("nop", "nop", "nop"); }
m2.store(num_args + i, Ordering::SeqCst);
}
});
thread::sleep(Duration::from_millis(1));
let mut out = 0;
for _ in 0..10000 {
unsafe { asm!("nop", "nop", "nop", "nop"); }
out += m.load(Ordering::SeqCst);
}
println!("Total: {}", out);
}
Target your local CPU and output the assembly:
export RUSTFLAGS=-Ctarget-cpu=native
cargo rustc -r -- --emit asm -C "llvm-args=-x86-asm-syntax=intel"
It will be in target/release/deps/<something>.s
. Search for those nop
to find the interesting instruction.
Bonus, for curiosity:
- Find your CPU architecture:
gcc -march=native -Q --help=target | grep march
- List Rust/LLVM target architecture options:
rustc --print=target-cpus
Attempting an explanation
Atomics are confusing partly because they help us do two different things:
- Ensure instructions are not interrupted part way through. This is what I think of intuitively as ‘atomic’.
- Control when written values become visible to other cores. This is not intuitively a part of ‘atomic’ operations, but it is what the Acquire, Release and SeqCst orderings were designed for.
Loads being simple mov
instructions (as if there was no atomic) are relatively easy to explain. Our concern here is tearing, which is a value being loaded in two operations (e.g. the first byte of your u16 in one memory load, the second byte afterwards). We don’t want the value to change half-way through the load! This could happen if our value spanned two cache lines. x86 guarantees that aligned primitive values (and hence contained in a single 64 byte cache line) are loaded atomically. Our compilers usually guarantee that values are size-aligned (Rust). Hence a mov
is atomic.
Store at Ordering::Relaxed is similarly easy to explain as a mov
. It is already atomic, and Relaxed guarantees nothing else.
Store at Ordering::Release (with a matching load Acquire) being a simple mov
I understand less. I believe it is because x86’s memory model is total store ordering, which I shall learn about soon no doubt. This should help explain it.
Store at Ordering::SeqCst uses the xchg instruction which implies a lock prefix. SeqCst (“Sequentially Consistent”) needs to guarantee that whatever happened before in source order really does happen before, and whatever happens after in source order also stays there. This requires a “full memory barrier”, and that’s exactly what the lock
instruction prefix provides.
xchg qword ptr [rdi + 16], rax
fetch_add
, compare_exchange
, and others are explained by a combination of what instructions x86 makes available (lock, mfence, sfence, lfence), and what the compiler chooses (lock). lock
is how we lock a memory location, and it happens to also be a full memory barrier. fetch_add
etc require that we read the value from memory, change it, and then write it back, atomically. We’re going to need some kind of lock to achieve this so that other cores don’t change that memory location (cache line, really). The x86 instruction that allows us to do this is lock
. For an Ordering::Relaxed
increment (fetch_add
) we will see something like this:
lock add qword ptr [rdi + 16], rax
and for a compare_exchange
this:
lock cmpxchg qword ptr [rdi + 16], rcx
That prefix also provides sequential consistency (a full memory barrier), so we get the exact same instruction at Ordering::SeqCst
.
Conclusion
The Rust / C++ memory model is significantly more complicated than necessary for x86 work. My conclusion from that is to use it as communication with the programmer not with the machine. The compiler will do the right thing, and it will probably be free, or no more expensive than a weaker ordering. Here are some guidelines:
- Not shared? no atomics (Rust is very good at enforcing this).
- A counter or otherwise self-contained value?
Relaxed
. - A signal that some other data has changed?
Release
where you set it,Acquire
where you read it. - Anything else, or not sure?
SeqCst
. In C++ this is the default ordering. In Go it is the only ordering.
More importantly don’t do what I did and overthink it. In my experiments on x86, apart from store at Seqcst, it makes no difference what you choose.
Lock free programming is a lie. "Lock-free programming" nearly always means using atomics instead of mutexes. The x86 prefix for an atomic instruction is literally `lock`. How is that lock free?