Underrust: What is the cost of Mutex, RwLock and AtomicPtr?
Summary
Resources that are read concurrently from many busy threads and modified rarely include config settings you can modify at runtime, or any kind of active inventory of dynamic resources (files, users, servers, plugins, etc).
In Rust we typically have to wrap the resource with something that makes it Sync. That’s usually a Mutex
, RwLock
, or a lower level atomic such as AtomicPtr
.
But if the resource is only rarely written to, what are we paying to satisfy Rust’s safety rules with Sync
?
Let’s look at the assembly output to find out, on my x86 tigerlake machine with Rust 1.87.
Constraints:
- I am interested in the case of many busy concurrent readers, and one occasional writer.
- On Linux. The Rust stdlib specializes all the
sync
primitives by target OS. - On x86. ARM has different cache coherency rules and usually needs more locking and synchronization. All the Rust code here will work unchanged, but the assembly will be different.
Internally Mutex
and RwLock
both rely on Linux futex, wrapped as either an Atomic<u8>
or Atomic<u32>
. The u8
/ u32
doesn’t make a practical difference because the compiler prefers 32-bit registers. See my earlier post “Underrust: Does it matter what type (u8, u32, …) I use?" for details. AtomicPtr
and it’s siblings map pretty directly to assembly instructions.
Mutex
One of my first reflexes as a Rust developer doing multi-threaded or async work is to wrap it, for all values of it, in a Arc<Mutex<X>>
, so let’s start there.
Lock
A mutex locks itself by changing it’s futex from 0 (unlocked) to 1 (locked).
mov ecx,0x1 ; Load value 1 into ecx (desired lock state)
xor eax,eax ; Clear eax to 0 (expected current lock state)
lock cmpxchg DWORD PTR [rdi],ecx ; Atomic compare-and-swap: if [rdi]==0, set to 1, else load [rdi] into eax
jne 152c9 ; If lock was already held (cmpxchg failed), jump to slow path
Unlock
Set it back to 0.
xor ecx,ecx ; Clear ecx for unlock operation
xchg DWORD PTR [rdi],ecx ; Atomically exchange lock with 0 (unlock), get old value in ecx
When one of the operands is memory ([rdi]
here), xchg
behaves as if the lock
prefix were present, so this is the same as lock xchg
.
Cost
The uncontended read and write both do similar core operations: lock cmpxchg
and lock xchg
.
The lock
instruction ensures that instructions is atomic. It does this by taking ownership of the cache line, meaning that no other processor can access it during the operation. This is important because even though cmpxchg
is a single assembly instruction, the CPU needs multiple steps to execute it - read the current value from memory, compare it, etc.
The instruction tables (pdf) suggest both of these operations cost 20 to 30 CPU cycles. That’s in the ballpark of reading from the L3 cache, so quite reasonable.
RwLock
Read
Take the lock - add a reader, if possible, atomically:
mov eax,DWORD PTR [rdi] ; Load current lock state from [rdi] into eax (rdi = lock pointer)
cmp eax,0x3ffffffd ; Compare lock state with max reader count (1073741821)
ja 14255 ; Jump to slow path if state > max (overflow or write lock held)
lea ecx,[rax+0x1] ; Calculate new state (current + 1 reader) into ecx
lock cmpxchg DWORD PTR [rdi],ecx ; Atomically compare [rdi] with eax, swap with ecx if equal
jne 14255 ; Jump to slow path if CAS failed (state changed)
There are special cases to handle lock poisoning and to prevent writer starvation, which I won’t cover here. They don’t affect the performance of the fast path.
Release the lock – decrement reader count atomically:
mov esi,0xffffffff ; Load -1 into esi (will decrement reader count)
lock xadd DWORD PTR [rdi],esi ; Atomically add -1 to lock state, return old value in esi
The “magic values” such as 0x3FFF_FFF are explained here.
Write
Take the lock - put special write value, if possible, atomically.
mov ecx,0x3fffffff ; The special write lock bit pattern
xor eax,eax ; eax = 0
lock cmpxchg DWORD PTR [rdi],ecx ; Atomically compare [rdi] with eax, swap with ecx if equal
jne 15515 <rw_lock::write_it+0x75> ; Jump to wait path if CAS failed (read or write lock already held, or reader or writer waiting)
Release the lock – set it back to 0 atomically.
mov esi,0xc0000001 ; Negative 0x3FFF_FFFF in 32-bit signed arithmetic
lock xadd DWORD PTR [rdi],esi ; Clear the write lock, loading current value
Loading the current value with the xadd
(exchange-add) allows us to next check if there are readers or writers waiting, that we need to wake up.
Cost
The uncontended read and write both do the same core operation: lock cmpxchg
. Acquiring the read lock involves an extra comparison and addition, so technically the read lock is more expensive to acquire than the write lock, although those operations are very fast, and dwarfed by the lock cmpxchg
.
RwLock
is very similar to Mutex
. They both use 0 for unlocked. Mutex uses 1 for lock, whereas RwLock
stores more information in the rest of the u32
(futex). Value 1 to 0x3FFF_FFFE are the number of read locks, and 0x3FFF_FFFF is the write lock. Both locks also use some upper bits to indicate if there are threads waiting to acquire the lock.
Mutex vs RwLock Cost Comparison
In the uncontended state both locks should have very similar cost.
The big difference is when there are multiple concurrent readers. The RwLock
cost stays the same, each thread atomically incrementing the read lock count. The Mutex
however will push every reader after the first one to the slow contended path. They get parked until the first thread unlocks and notifies one of them.
The conclusion I take from this is that when the idea of separate “readers” and “writers” makes sense for the locked resource I should default to RwLock
not Mutex
.
AtomicPtr
If the value being shared is an integer or bool type, Rust’s atomic types are the obvious choice.
To share a richer type we can use AtomicPtr<T>
, and swap out the object being pointed to. This is ideal for the “many reads / occasional writes” scenario here because as we’ll see the atomic part of the read will be zero-cost.
AtomicPtr Example
Here’s an example. Let’s pretend we need to know which user is on-call. We need a User:
struct User {
name: String,
permissions: u64,
}
impl User {
fn make(i: usize) -> Box<Self> {
Box::new(Self {
name: format!("name_{i}"),
permissions: i as u64,
})
}
}
Note how make
returns a Box
, because we will need a heap pointer next.
We store that pointer in an AtomicPtr
:
let on_call_ptr = AtomicPtr::new(Box::into_raw(User::make(0)));
Then we share it with the writer (which changes who is on-call currently) and the many readers:
use std::thread;
thread::scope(|s| {
// Writer
s.spawn(|| writer(&on_call_ptr));
// Readers
for thread_idx in 0..3 {
let on_call_ptr = &on_call_ptr;
s.spawn(move || reader(thread_idx, on_call_ptr));
}
});
Usually to manage the lifetime of the AtomicPtr
we would put it in an Arc
. Here the lifetime is known and limited, so we can tell Rust that with thread::scope
and avoid the Arc
.
The writer updates the value being pointed to, and de-allocates the previous value’s memory:
let user_ptr = Box::into_raw(User::make(user_id));
let old = on_call_ptr.swap(user_ptr, Ordering::Relaxed);
unsafe { drop(Box::from_raw(old)) };
We atomically replace the user on call, and cleanup the old allocation.
We comply with from_raw
’s safety requirements because swaps ensures we are the sole owner of the old pointer, and it came from a Box
originally so the layout is correct.
The readers read it:
let on_call = unsafe { &*on_call_ptr.load(Ordering::Relaxed) };
The de-reference is safe here because we know it contains a valid User
(so not null, unaligned, etc), which is never modified after being written (so no data race).
The full code is here.
Writer assembly
The write is straight-forward, using the same xchg
instruction we saw in the Mutex unlock case.
call 1a070 <ptr_magic::User::make>
mov r15,rax ; The new User is in rax, move it to r15
xchg QWORD PTR [rbx],r15 ; on_call_ptr.swap
; on_call_ptr is at address in rbx
That’s the same cost as any of the uncontended Mutex
or RwLock
operations we looked at. Our writer is always uncontended, we only have one.
Reader assembly
This line in the reader, the key atomic load:
let on_call = unsafe { &*on_call_ptr.load(Ordering::Relaxed) };
Compiles to this:
mov rax,QWORD PTR [rbx] ; on_call_ptr is at address in rbx
Don’t you think that’s wonderful?
On x86 any memory ordering belows SeqCst
compiles to a simple mov
instruction, it is effectively ignored because we get that for free. I have a post with more detail and examples here: “Rust atomics on x86: How and why”.
The x86 cache coherency rules ensure that when one CPU core changes a value (the writer updates the on-call pointer), it will be visible to all other CPU cores (the readers) in about the same time as an L3 cache fetch, typically tens of nanoseconds. The compiler does not need to insert atomic instructions to achieve this.
This means we can also do zero-cost atomic writes. The xchg
instruction earlier in the writer case is only necessary because we are swapping, we need the old pointer to de-allocate it’s memory. If we don’t mind leaking the memory (e.g. a short running program with only a few writes), or the protected resource doesn’t need cleaning up, we could use a simple store like this:
on_call_ptr.store(user_ptr, Ordering::Relaxed);
That also compiles to a simple mov
instruction:
mov QWORD PTR [rbx],rax ; on_call_ptr is at address in rbx
Being as on x86 the compiler is basically going to ignore that we used an AtomicPtr
and read and write directly to/from memory, we could do that too in Rust. I prefer not to however. It involves more unsafe
, and sometimes converting *mut T
to u64
because raw pointers are not Send
, all of which tend to upset code reviewers. While it makes me feel clever at the time I write it (the assembly comes out the same!) I find it annoying to maintain three months from now. So let’s not do that.
Conclusion
We already have nice things. That’s the conclusion.
- Avoid having a contended
Mutex
. That is the biggest performance killer here. - Use a
RwLock
. That’s the safe, easy option for a many-readers / occasional-writer situation. It scales consistently to over a billion (0x3FF_FFFE) active read locks, at a cost similar to an extra L3 cache access on every read. - If that is too slow, an
AtomicPtr
swap gets you atomiticity at zero performance cost, but adds some development complexity. You have to remember to de-allocate the old memory, you must have a single writer, and must not otherwise access the pointed-to memory.
Remembering to de-allocate the memory seems obvious, but in Rust it’s not something we usually have to think about, hence it is easy to miss. And a single writer might become two when a different team adds a feature three months from now.
RwLock
is safe from all this.