Underrust: What is the cost of sync::Once?

September 18, 2022 software rust underrust assembly

Summary

Rust's `std::sync::Once` ensures that initialization occurs only once, even with concurrent calls from multiple threads, using `call_once`. While `call_once` is invoked repeatedly after the first call, it primarily checks if initialization is complete via a simple comparison in the `is_completed` function, which translates to minimal assembly with a few cycles for a memory load. The cost of invoking `sync::Once` is around 3 to 5 cycles when fetching from L1 cache, making it efficient, with negligible overhead from the comparison and branching. Thus, unless `state_and_queue` is pushed out of L1 cache, `sync::Once` remains performant, even in tight loops. Overall, the cost is minor relative to other operations, highlighting its effectiveness for synchronization in Rust programs.

Part of the Underrust series.

Rust’s std::sync::once runs initialization code exactly once, even when called concurrently from multiple threads. Here’s an example usage, and the test code we’ll analyze:

use std::sync::Once;

static START: Once = Once::new();
static mut VAL: i32 = 0;

// Usually called from somewhere we don't control
fn my_func() {
	START.call_once(|| unsafe {
		// one-time setup
		VAL = std::env::args().len() as i32;
	});
	// body of my_func
}

fn main() {
    for _ in 0..1_000_001 {
		my_func();
    }
    std::process::exit(unsafe { VAL });
}

In practice my_func would be an FFI callback called from multiple threads concurrently. We want some initialization on first call, but we don’t want to do it if we’re never called.

After our closure runs on first call we are still hitting call_once a million more times, on every call to my_func. What is the cost of that? Let’s find out!

The source code for call_once is nice and short:

pub fn call_once<F>(&self, f: F)
    where
        F: FnOnce(),
    {
        // Fast path check
        if self.is_completed() {
            return;
        }

        let mut f = Some(f);
        self.call_inner(false, &mut |_| f.take().unwrap()());
    }

Every call except the first will exit on the fast path, so is_completed is the method of interest and it’s a single line:

#[inline]
pub fn is_completed(&self) -> bool {
	self.state_and_queue.load(Ordering::Acquire).addr() == COMPLETE
}

That looks complex but it turns out not to be. It’s actually significantly simpler in assembly as we’ll see.

state_and_queue is AtomicPtr<()> which “has the same in-memory representation as a *mut T”. So it’s a u64. It is cleverly used for multiple things, but of interest to us are only the bottom two bits which are it’s state flags.
Atomic orderings are a lie on x86, so we can ignore that nonsense.
COMPLETE is a constant with value 3.

So what that line really does it compare a variable with the number 3. And sure enough, the assembly for is_completed is:

mov    rax,QWORD PTR [rdi]		# Load `state_and_queue` from memory
cmp    rax,0x3					# Compare it to 3
jne    7957 <std::sync::once::Once::call_once+0x17>  # Jump to call_once if they differ

The is_completed function call is always inlined, so there’s no overhead there.

The call_once call is inlined if you call it a lot. In my experiments it got inlined if I loop more than 37 times. Fewer iterations than that and LLVM unrolls the loop but doesn’t inline the call. The exact number will vary, but I expect the pattern to hold: If you call call_once a lot, which is the most likely time you’d worry about performance, there will be no function call overhead here either.

The jump branch (jne) is never taken after the first time. It will be easy for the branch predictor to get right, so we won’t have any extra delay there.

Looking up the latency of our three assembly instructions in Agner Fog’s tables for my Tiger Lake CPU, we get:

Three cycles for the mov from L1 cache. Intel says 5 cycles.
One cycle for cmp, and we can do four at once.
Less than one for jne. According to Cloudflare’s Branch Predictor post conditional branches never taken (well, once and then never here) are close to free. If the compiler uses a je (jump-if-equal, which will be nearly always true) that’s about 1.5 cycles (also according to the Cloudflare post).

Even in a very tight loop state_and_queue never gets promoted to a register, it’s always a memory load. I’m guessing that’s because it’s an AtomicPtr, it can be changed by other threads.

External Memory Model

The cost of a sync::Once therefore reduces to the cost of fetching the value from L1 cache: 3 to 5 cycles. The compare (cmp) and not-taken jump (jne) are so cheap that we can ignore them.

Computing the cost of some code by counting only the I/O operations is called the External Memory Model. You count network, disk and memory activity (in that order) and ignore other operations. It is usually a good way to approximate and compare the performance of a snippet of code. The fewer I/O operations it does the faster it will run.

By far the biggest delay would be if state_and_queue (asm [rdi]) is not in L1 cache. An L2 fetch is 13 cycles, L3 is around 50 cycles, and RAM is in the hundreds of cycles.

If your loop is big enough to push the sync::Once out of L1 cache, is_completed is unlikely to be your bottleneck.

Each 64-byte section of memory can’t be stored just anywhere in the L1 cache. On my Tiger Lake CPU each line can only choose from eight different places, the Tiger Lake L1 cache is “eight-way associative”. That means possibly our sync::Once could be evicted earlier than expected. Even in that case there are still seven other loads making the sync::Once load at most 1/8th of our perf issues.

The worst case is false sharing, where another piece of data shares the cache line and is being written concurrently on another processor. The line will keep getting invalidated, written to main memory, and re-loaded. That’s a 100+ cycle penalty on each access. Ouch.

Luckily our START is static and initialized to zero (Once::new() sets our value to 0) so it will be stored in the .bss segment. It will share a cache line with other .bss values and possibly with some of the .data segment. Those are static variables - things like the ThreadId counter or panic and teardown machinery. Writing to a mutable static variable also requires an unsafe block. Doing that in a tight loop is very rare. Our cache-line neighbours won’t be noisy, so we probably don’t need to worry about false sharing here.

CPU instructions also need to be loaded from memory and decoded, and you might wonder how much space that mov / cmp / jne takes in the instruction cache. Is it going to make us front-end bound? Well, my example program at the start of this post is front-end bound. That’s because it’s running is_completed in a very tight loop and not doing any real work. The mov/cmp/jne takes 13 bytes to encode, and my L1 instruction cache is 32 KiB. If my_func did any actual work I expect these extra 13 bytes would be insignificant.

Finally, the L1 fetch of our sync::Once can start before we hit that instruction, out of order. The CPU will do other things while it waits and so in practice you may not see any additional latency at all. Always measure.

Conclusion

Is five CPU cycles expensive?

On one hand my CPU could do twenty additions in that time (if you set them up just right), but on the other a single division (or remainder, it’s the same instruction) takes 15 cycles. And fn main() { println!("hello"); } is about a million cycles. Five cycles is very small indeed.

The cost of sync::once is an L1 cache fetch, which is very small.

Me, you, the CPU, we’re all in this together. It’s not us against the world.

Graham King

External Memory Model

Conclusion