Graham King

Solvitas perambulum

Underrust: What is the cost of Mutex, RwLock and AtomicPtr?

Summary
Uncontended Mutex and RwLock cost about the same as an L3 cache access for any of lock, unlock, read lock and write lock. AtomicPtr is free!

Resources that are read concurrently from many busy threads and modified rarely include config settings you can modify at runtime, or any kind of active inventory of dynamic resources (files, users, servers, plugins, etc).

In Rust we typically have to wrap the resource with something that makes it Sync. That’s usually a Mutex, RwLock, or a lower level atomic such as AtomicPtr.

But if the resource is only rarely written to, what are we paying to satisfy Rust’s safety rules with Sync?

Let’s look at the assembly output to find out, on my x86 tigerlake machine with Rust 1.87.

Constraints:

  • I am interested in the case of many busy concurrent readers, and one occasional writer.
  • On Linux. The Rust stdlib specializes all the sync primitives by target OS.
  • On x86. ARM has different cache coherency rules and usually needs more locking and synchronization. All the Rust code here will work unchanged, but the assembly will be different.

Internally Mutex and RwLock both rely on Linux futex, wrapped as either an Atomic<u8> or Atomic<u32>. The u8 / u32 doesn’t make a practical difference because the compiler prefers 32-bit registers. See my earlier post “Underrust: Does it matter what type (u8, u32, …) I use?" for details. AtomicPtr and it’s siblings map pretty directly to assembly instructions.

Mutex

One of my first reflexes as a Rust developer doing multi-threaded or async work is to wrap it, for all values of it, in a Arc<Mutex<X>>, so let’s start there.

Lock

A mutex locks itself by changing it’s futex from 0 (unlocked) to 1 (locked).

mov    ecx,0x1               ; Load value 1 into ecx (desired lock state)
xor    eax,eax               ; Clear eax to 0 (expected current lock state)
lock cmpxchg DWORD PTR [rdi],ecx ; Atomic compare-and-swap: if [rdi]==0, set to 1, else load [rdi] into eax
jne    152c9                 ; If lock was already held (cmpxchg failed), jump to slow path

Unlock

Set it back to 0.

xor    ecx,ecx               ; Clear ecx for unlock operation
xchg   DWORD PTR [rdi],ecx   ; Atomically exchange lock with 0 (unlock), get old value in ecx

When one of the operands is memory ([rdi] here), xchg behaves as if the lock prefix were present, so this is the same as lock xchg.

Cost

The uncontended read and write both do similar core operations: lock cmpxchg and lock xchg.

The lock instruction ensures that instructions is atomic. It does this by taking ownership of the cache line, meaning that no other processor can access it during the operation. This is important because even though cmpxchg is a single assembly instruction, the CPU needs multiple steps to execute it - read the current value from memory, compare it, etc.

The instruction tables (pdf) suggest both of these operations cost 20 to 30 CPU cycles. That’s in the ballpark of reading from the L3 cache, so quite reasonable.

RwLock

Read

Take the lock - add a reader, if possible, atomically:

mov    eax,DWORD PTR [rdi]   ; Load current lock state from [rdi] into eax (rdi = lock pointer)
cmp    eax,0x3ffffffd        ; Compare lock state with max reader count (1073741821)
ja     14255                 ; Jump to slow path if state > max (overflow or write lock held)
lea    ecx,[rax+0x1]         ; Calculate new state (current + 1 reader) into ecx
lock cmpxchg DWORD PTR [rdi],ecx ; Atomically compare [rdi] with eax, swap with ecx if equal
jne    14255                 ; Jump to slow path if CAS failed (state changed)

There are special cases to handle lock poisoning and to prevent writer starvation, which I won’t cover here. They don’t affect the performance of the fast path.

Release the lock – decrement reader count atomically:

mov    esi,0xffffffff        ; Load -1 into esi (will decrement reader count)
lock xadd DWORD PTR [rdi],esi ; Atomically add -1 to lock state, return old value in esi

The “magic values” such as 0x3FFF_FFF are explained here.

Write

Take the lock - put special write value, if possible, atomically.

mov    ecx,0x3fffffff	; The special write lock bit pattern
xor    eax,eax			; eax = 0
lock cmpxchg DWORD PTR [rdi],ecx ; Atomically compare [rdi] with eax, swap with ecx if equal
jne    15515 <rw_lock::write_it+0x75>	; Jump to wait path if CAS failed (read or write lock already held, or reader or writer waiting)

Release the lock – set it back to 0 atomically.

mov    esi,0xc0000001 ; Negative 0x3FFF_FFFF in 32-bit signed arithmetic
lock xadd DWORD PTR [rdi],esi ; Clear the write lock, loading current value

Loading the current value with the xadd (exchange-add) allows us to next check if there are readers or writers waiting, that we need to wake up.

Cost

The uncontended read and write both do the same core operation: lock cmpxchg. Acquiring the read lock involves an extra comparison and addition, so technically the read lock is more expensive to acquire than the write lock, although those operations are very fast, and dwarfed by the lock cmpxchg.

RwLock is very similar to Mutex. They both use 0 for unlocked. Mutex uses 1 for lock, whereas RwLock stores more information in the rest of the u32 (futex). Value 1 to 0x3FFF_FFFE are the number of read locks, and 0x3FFF_FFFF is the write lock. Both locks also use some upper bits to indicate if there are threads waiting to acquire the lock.

Mutex vs RwLock Cost Comparison

In the uncontended state both locks should have very similar cost.

The big difference is when there are multiple concurrent readers. The RwLock cost stays the same, each thread atomically incrementing the read lock count. The Mutex however will push every reader after the first one to the slow contended path. They get parked until the first thread unlocks and notifies one of them.

The conclusion I take from this is that when the idea of separate “readers” and “writers” makes sense for the locked resource I should default to RwLock not Mutex.

AtomicPtr

If the value being shared is an integer or bool type, Rust’s atomic types are the obvious choice.

To share a richer type we can use AtomicPtr<T>, and swap out the object being pointed to. This is ideal for the “many reads / occasional writes” scenario here because as we’ll see the atomic part of the read will be zero-cost.

AtomicPtr Example

Here’s an example. Let’s pretend we need to know which user is on-call. We need a User:

struct User {
    name: String,
    permissions: u64,
}
impl User {
    fn make(i: usize) -> Box<Self> {
        Box::new(Self {
            name: format!("name_{i}"),
            permissions: i as u64,
        })
    }
}

Note how make returns a Box, because we will need a heap pointer next.

We store that pointer in an AtomicPtr:

    let on_call_ptr = AtomicPtr::new(Box::into_raw(User::make(0)));

Then we share it with the writer (which changes who is on-call currently) and the many readers:

    use std::thread;
    thread::scope(|s| {
        // Writer
        s.spawn(|| writer(&on_call_ptr));
        // Readers
        for thread_idx in 0..3 {
            let on_call_ptr = &on_call_ptr;
            s.spawn(move || reader(thread_idx, on_call_ptr));
        }
    });

Usually to manage the lifetime of the AtomicPtr we would put it in an Arc. Here the lifetime is known and limited, so we can tell Rust that with thread::scope and avoid the Arc.

The writer updates the value being pointed to, and de-allocates the previous value’s memory:

	let user_ptr = Box::into_raw(User::make(user_id));
	let old = on_call_ptr.swap(user_ptr, Ordering::Relaxed);
	unsafe { drop(Box::from_raw(old)) };

We atomically replace the user on call, and cleanup the old allocation.

We comply with from_raw’s safety requirements because swaps ensures we are the sole owner of the old pointer, and it came from a Box originally so the layout is correct.

The readers read it:

	let on_call = unsafe { &*on_call_ptr.load(Ordering::Relaxed) };

The de-reference is safe here because we know it contains a valid User (so not null, unaligned, etc), which is never modified after being written (so no data race).

The full code is here.

Writer assembly

The write is straight-forward, using the same xchg instruction we saw in the Mutex unlock case.

call   1a070 <ptr_magic::User::make>
mov    r15,rax						 ; The new User is in rax, move it to r15
xchg   QWORD PTR [rbx],r15			 ; on_call_ptr.swap
									 ; on_call_ptr is at address in rbx

That’s the same cost as any of the uncontended Mutex or RwLock operations we looked at. Our writer is always uncontended, we only have one.

Reader assembly

This line in the reader, the key atomic load:

	let on_call = unsafe { &*on_call_ptr.load(Ordering::Relaxed) };

Compiles to this:

mov    rax,QWORD PTR [rbx]	; on_call_ptr is at address in rbx

Don’t you think that’s wonderful?

On x86 any memory ordering belows SeqCst compiles to a simple mov instruction, it is effectively ignored because we get that for free. I have a post with more detail and examples here: “Rust atomics on x86: How and why”.

The x86 cache coherency rules ensure that when one CPU core changes a value (the writer updates the on-call pointer), it will be visible to all other CPU cores (the readers) in about the same time as an L3 cache fetch, typically tens of nanoseconds. The compiler does not need to insert atomic instructions to achieve this.

This means we can also do zero-cost atomic writes. The xchg instruction earlier in the writer case is only necessary because we are swapping, we need the old pointer to de-allocate it’s memory. If we don’t mind leaking the memory (e.g. a short running program with only a few writes), or the protected resource doesn’t need cleaning up, we could use a simple store like this:

	on_call_ptr.store(user_ptr, Ordering::Relaxed);

That also compiles to a simple mov instruction:

mov    QWORD PTR [rbx],rax	; on_call_ptr is at address in rbx

Being as on x86 the compiler is basically going to ignore that we used an AtomicPtr and read and write directly to/from memory, we could do that too in Rust. I prefer not to however. It involves more unsafe, and sometimes converting *mut T to u64 because raw pointers are not Send, all of which tend to upset code reviewers. While it makes me feel clever at the time I write it (the assembly comes out the same!) I find it annoying to maintain three months from now. So let’s not do that.

Conclusion

We already have nice things. That’s the conclusion.

  • Avoid having a contended Mutex. That is the biggest performance killer here.
  • Use a RwLock. That’s the safe, easy option for a many-readers / occasional-writer situation. It scales consistently to over a billion (0x3FF_FFFE) active read locks, at a cost similar to an extra L3 cache access on every read.
  • If that is too slow, an AtomicPtr swap gets you atomiticity at zero performance cost, but adds some development complexity. You have to remember to de-allocate the old memory, you must have a single writer, and must not otherwise access the pointed-to memory.

Remembering to de-allocate the memory seems obvious, but in Rust it’s not something we usually have to think about, hence it is easy to miss. And a single writer might become two when a different team adds a feature three months from now.

RwLock is safe from all this.