Return Value Optimization in Rust
Summary
Update Nov 2024: Return value optimization changed starting in Rust 1.71. See the last section for details.
In the program below where in memory is St
created, and where does it end up? Or to ask it a different way, what does this program print?
struct St {
a: u64,
_b: u64,
_c: u64,
}
fn main() {
let v: u32 = 1;
let x = sub1();
println!("main {:p}, {:p}", &v, &x.a);
}
#[inline(never)]
fn sub1() -> St {
let v: u32 = 1;
let x = sub2();
println!("sub1 {:p}, {:p}", &v, &x.a);
x
}
#[inline(never)]
fn sub2() -> St {
let v: u32 = 1;
let x = St{ a: 42, _b: 0, _c: 0 };
println!("sub2 {:p}, {:p}", &v, &x.a);
x
}
Try it out on the Rust playground.
I assumed that St
would be created on sub2
’s stack. It would be copied from there into sub1
’s stack, and then again into main
’s stack, but that’s not what we see. The output looks like this:
sub2 0x7ffdac0fea04, 0x7ffdac0feb90
sub1 0x7ffdac0fead4, 0x7ffdac0feb90
main 0x7ffdac0feb8c, 0x7ffdac0feb90
The first address is the v
variables which are clearly going to be on the local stack. We print their address to see where that is.
The curious and wonderful part is the second address, the address of St
: it doesn’t move! It is created in it’s final resting place on main
’s stack. Notice that it is bigger than main’s v
address, so it is in main’s stack frame (remember that the stack grows downwards).
That is return value optimization. It is required by the C++ standard, so as far as I can tell Rust gets it from LLVM for free.
The process stack for the example above looks like this in memory (showing only the bottom four bytes of the address):
main
0xeba0 St._c (u64: 8 bytes)
0xeb98 St._b (8 bytes)
0xeb90 St.a (8 bytes)
0xeb8c v (u32: 4 bytes)
.. lots of space for println!
sub1
0xead4 v
.. lots of space for println!
sub2
0xea04 v
If you look at the assembly output it’s clear what’s happening. The address to store St
in is passed down into sub1
and sub2
as a function parameter.
main:
; rsp is the stack pointer, so [rsp + <num>] means something on local stack
sub rsp, 200 ; space on main's stack
mov dword ptr [rsp + 44], 1 ; the local variable 'v'
; rdi gets the address, on main's stack, where x (struct St) will go
lea rdi, [rsp + 48] ; lea = Load Effective Address
call playground::sub1
playground::sub1:
; .. stuff removed, but rdi doesn't change ..
call playground::sub2
playground::sub2:
sub rsp, 184 ; most of this space is for println!
mov rax, rdi ; I'm not sure ..
mov qword ptr [rsp + 16], rax ; .. what this is for
mov dword ptr [rsp + 52], 1 ; let v: u32 = 1
; rdi still has an address in main's stack, 0x7ffdac0feb90 in our earlier example
mov qword ptr [rdi], 42 ; x.a = 42
mov qword ptr [rdi + 8], 0 ; x._b = 0
mov qword ptr [rdi + 16], 0 ; x._c = 0
That’s intel assembly syntax. The target is on the left, source on right. The calling convention is to put the first function parameter in register rdi
. Here none of the functions take a parameter, and yet rdi
is being passed.
If you comment out the println!
in sub2
and generate the assembly again (the playground can do this - select Intel assembly flavor under Config) sub2
’s stack space goes down to exactly the 4 bytes needed for v
(a u32). No space for St
in there.
LLVM has transformed our code to this:
fn main() {
let v: u32 = 1;
// space for x on main's stack
let x = St{}; // !! Uninitialized, not valid Rust
sub1(&mut x); // pass x's address so sub1/2 can fill it in
println!("main {:p}, {:p}", &v, &x.a);
}
The reason this happens is because Rust loves you and you have a heart of gold.
Update: Rust 1.71 and beyond
The example above stopped working from Rust 1.71. Rust / LLVM still do return value optimization, but only if you don’t read the address. If you run the example above you will get a different address each time. Rust constructs the object on the local function’s stack, prints that address, then moves the value into the callers stack. This holds even if instead of printing the address you saved it in an array passed in, or in a global variable.
So how do I know it still works? We have to look at the assembly.
If you remove the println!
statements from sub1
and sub2
, and don’t store or otherwise access the address of x
, you get some very nice return value optimization. This is 1.83-nightly --release
profile:
0000000000013b40 <rto::main>:
; stuff removed
lea rbx,[rsp+0x68] ; St will be constructed here
mov rdi,rbx
call 13be0 <rto::sub1>
; more stuff removed
0000000000013be0 <rto::sub1>: ; Nothing removed, this is the entire function!
jmp 13bf0 <rto::sub2> ; 💖
0000000000013bf0 <rto::sub2>:
mov QWORD PTR [rdi],0x2a ; x.a = 42
vxorps xmm0,xmm0,xmm0 ; Create 16 bytes of 0 (xmm0 is a 128 bit register)
vmovups XMMWORD PTR [rdi+0x8],xmm0 ; x.b = 0 and x.c = 0 in one move
ret ; Jump directly back to main
Those two functions are only there because I annotated them with #[inline(never)]
so we can see what’s going on. sub1
is lovely, it doesn’t even call
sub2 (which would push the return address on the stack, requiring a ret
to return). It jumps straight into sub2
, which means when sub2 does ret
, it pops main
’s address and goes directly back there.
sub2
constructs St
directly at the address in rdi
, which is where main
stored an address in it’s own stack. Exactly the return value optimization we were seeing in earlier Rust versions.
- My colleague Geoffry told me what this optimization is called. Thank you.
- Pedro from Brazil informed me that it stopped working at some point, which led me to adding the last section. Thank you.
- I don't know for sure if it's LLVM doing this. I'm guessing based on C++ requiring it, and the fact that it happens in both Rust's debug and release modes.