Graham King

Solvitas perambulum

Rust atomics on x86: How and why

Summary
I examined Rust's atomic memory ordering on my x86 Tiger Lake machine and found that load operations compile to a free `mov` instruction, while store operations at `Ordering::Relaxed` or `Release` also compile to `mov`, but `Ordering::SeqCst` compiles to `xchg`. For read-modify-write operations, most compile to `lock cmpxchg` in a loop, except for specific operations like `fetch_add`, which use `lock xadd`. I learned that the choice of ordering generally does not affect read-modify-write operations, and the only significant difference occurs with `store` at `SeqCst`, which adds a memory barrier. The compiler can reorder instructions, potentially causing issues across threads, so using appropriate orderings is crucial in multi-threading contexts. Ultimately, my findings suggest to simplify atomic handling by prioritizing the appropriate orderings for different scenarios: no atomics for non-shared data, `Relaxed` for counters, `Release` and `Acquire` for signaling changes, and `SeqCst` otherwise, while being mindful not to overcomplicate the process.

To understand Rust’s atomic memory Ordering I looked at what assembly they compile to on my x86 tigerlake machine. I expect this will be almost identical on any vaguely recent x86, and completely different on ARM.

Load and store

  • load at any Ordering compiles to mov. It is free.
  • store:
    • at Ordering::Relaxed or Release (with a corresponding load at Acquire) is also a mov. Also free.
    • at Ordering::SeqCst (with a load also at SeqCst) compiles to xchg (ref).

These seem to hold for all sizes. I checked u64, i8 and bool.

Try it yourself on Compiler Explorer. Search for “nop”: four is the load, three is the store.

Read-modify-write

Most read-modify-write operations, at at any Ordering compile to lock cmpxchg in a loop:

If you call compare_exchange or compare_exchange_weak directly (which are identical!) the compiler doesn’t output a loop because you are expected to do it. Both are just a single lock cmpxchg.

Operations with a more specific (and obvious) instruction compile to that with a lock prefix, also at any Ordering, and we don’t need a loop:

  • fetch_add -> either lock xadd if you use the return value, or lock add if you don’t.
  • fetch_and if you don’t use the return value -> lock and
  • fetch_xor if you don’t use the return value -> lock xor, and so on.

Try it yourself on Compiler Explorer. Search for “nop”.

C++ standards committee, you owe me a weekend

For any kind of read-modify-write operation it doesn’t matter what ordering you choose. There is no difference between compare_exchange and compare_exchange_weak. In fact the only time the ordering makes any different at all is store at SeqCst (when we add a full memory barrier).

All those CppCon videos I watched, all those Stack Exchange answers I read and it comes out the same. All those thought leaders telling us how hard atomics are to get right. Take these broken wings my friend and learn to fly again.

Caveat: Compiler re-ordering

The compiler is allowed to re-order your instructions if a single thread would not notice the difference, and it will do so to try to get better CPU utilization. If you have multiple threads and two pieces of shared data A and B, and A must be visible by the time B changes, this might not be obvious to the compiler. The compiler can freely re-order A to happen after B.

Ordering is also an instruction to the compiler. A store with Ordering::Relaxed and Ordering::Release might both compile to a simple mov instruction, but that Release might still have made a difference by preventing a compiler re-ordering. To prevent this you can add a compiler fence.

Tools

To play with it locally, because trust but verify, copy the code from Compiler Explorer linked above, or use this as a starting point.

use std::sync::Arc;
use std::sync::atomic::{Ordering, AtomicU64};
use std::env::args;
use std::arch::asm;
use std::thread;
use std::time::Duration;

fn main() {
    let m = Arc::new(AtomicU64::new(0u64));
    let num_args = args().len() as u64;

    let m2 = m.clone();
    thread::spawn(move || {
        for i in 0..10000 {
            unsafe { asm!("nop", "nop", "nop"); }
            m2.store(num_args + i, Ordering::SeqCst);
        }
    });

    thread::sleep(Duration::from_millis(1));
    let mut out = 0;
    for _ in 0..10000 {
        unsafe { asm!("nop", "nop", "nop", "nop"); }
        out += m.load(Ordering::SeqCst);
    }
    println!("Total: {}", out);
}

Target your local CPU and output the assembly:

export RUSTFLAGS=-Ctarget-cpu=native
cargo rustc -r -- --emit asm -C "llvm-args=-x86-asm-syntax=intel"

It will be in target/release/deps/<something>.s. Search for those nop to find the interesting instruction.

Bonus, for curiosity:

  • Find your CPU architecture: gcc -march=native -Q --help=target | grep march
  • List Rust/LLVM target architecture options: rustc --print=target-cpus

Attempting an explanation

Atomics are confusing partly because they help us do two different things:

  • Ensure instructions are not interrupted part way through. This is what I think of intuitively as ‘atomic’.
  • Control when written values become visible to other cores. This is not intuitively a part of ‘atomic’ operations, but it is what the Acquire, Release and SeqCst orderings were designed for.

Loads being simple mov instructions (as if there was no atomic) are relatively easy to explain. Our concern here is tearing, which is a value being loaded in two operations (e.g. the first byte of your u16 in one memory load, the second byte afterwards). We don’t want the value to change half-way through the load! This could happen if our value spanned two cache lines. x86 guarantees that aligned primitive values (and hence contained in a single 64 byte cache line) are loaded atomically. Our compilers usually guarantee that values are size-aligned (Rust). Hence a mov is atomic.

Store at Ordering::Relaxed is similarly easy to explain as a mov. It is already atomic, and Relaxed guarantees nothing else.

Store at Ordering::Release (with a matching load Acquire) being a simple mov I understand less. I believe it is because x86’s memory model is total store ordering, which I shall learn about soon no doubt. This should help explain it.

Store at Ordering::SeqCst uses the xchg instruction which implies a lock prefix. SeqCst (“Sequentially Consistent”) needs to guarantee that whatever happened before in source order really does happen before, and whatever happens after in source order also stays there. This requires a “full memory barrier”, and that’s exactly what the lock instruction prefix provides.

xchg    qword ptr [rdi + 16], rax

fetch_add, compare_exchange, and others are explained by a combination of what instructions x86 makes available (lock, mfence, sfence, lfence), and what the compiler chooses (lock). lock is how we lock a memory location, and it happens to also be a full memory barrier. fetch_add etc require that we read the value from memory, change it, and then write it back, atomically. We’re going to need some kind of lock to achieve this so that other cores don’t change that memory location (cache line, really). The x86 instruction that allows us to do this is lock. For an Ordering::Relaxed increment (fetch_add) we will see something like this:

lock        add qword ptr [rdi + 16], rax

and for a compare_exchange this:

lock        cmpxchg qword ptr [rdi + 16], rcx

That prefix also provides sequential consistency (a full memory barrier), so we get the exact same instruction at Ordering::SeqCst.

Conclusion

The Rust / C++ memory model is significantly more complicated than necessary for x86 work. My conclusion from that is to use it as communication with the programmer not with the machine. The compiler will do the right thing, and it will probably be free, or no more expensive than a weaker ordering. Here are some guidelines:

  • Not shared? no atomics (Rust is very good at enforcing this).
  • A counter or otherwise self-contained value? Relaxed.
  • A signal that some other data has changed? Release where you set it, Acquire where you read it.
  • Anything else, or not sure? SeqCst. In C++ this is the default ordering. In Go it is the only ordering.

More importantly don’t do what I did and overthink it. In my experiments on x86, apart from store at Seqcst, it makes no difference what you choose.


Lock free programming is a lie. "Lock-free programming" nearly always means using atomics instead of mutexes. The x86 prefix for an atomic instruction is literally `lock`. How is that lock free?