Graham King

Solvitas perambulum

Underrust: Multiple Return Values

Part of the Underrust series.

How does Rust return values, and does it make any difference to us programmers? The ABI defines both how to pass values to a function and how to return values. Let’s investigate.

One or two integers: rax and rdx

The most normal case is a single integer return value (which includes pointers). That goes in rax. If you have a second value, or the first value is bigger than 64 bits, that goes in rdx. And this is what we see.

Rust

fn ret(a: u32) -> (u32, u32) {
    (a, a)
}
Assembly

underrust::ret:
    ; param `a` is in edi
    1020:       mov    eax,edi ; First return in eax
    1022:       mov    edx,edi ; Second return in edx
    1024:       ret

If ret returned a u128 we would see the upper 64 bits in rdx and the lower in rax.

Returning a struct follows the same rules as returning individual values because at the assembly level those two things are identical. This will generate the exact same assembly as the function above.

Rust

struct Obj {
    x: u32,
    y: u32,
}
fn ret(a: u32) -> Obj {
    Obj { x: a, y: a }
}

Indeed if you put had both ret functions in the same program like this

  • fn ret1(u32) -> (u32, u32)
  • fn ret2(u32) -> Obj

then LLVM will output a single function and call it twice.

Returning a function returns it’s address in rax as a function pointer. The caller then calls it via the register: call rax.

One or two floats: xmm0 and xmm1

Returning floating point values uses xmm0 and xmm1.

Rust

fn ret(a: u32) -> (f64, f64) {
    let ff = a as f64;
    (ff, ff)
}
Assembly

underrust::ret:
    1040:       vcvtusi2sd xmm0,xmm0,edi ; xmm0 = `a`
    1046:       vmovapd xmm1,xmm0        ; xmm1 = xmm0
    104a:       ret

That unpronounceable vcvtusi2sd converts our u32 param to an f64 return value. It is using a 128 bit register because that’s the ABI. vmovpad is simply mov for SSE/AVX registers.

Three or more: The caller’s stack

Beyond two values we use the caller’s stack. Our first parameter (rdi) becomes the address to write the return values and a is now in the second parameter (rsi).

Rust

fn ret(a: u32) -> (u32, u32, u32) {
    (a, a, a)
}
Assembly

underrust::ret:
    1030:       mov    DWORD PTR [rdi],esi
    1032:       mov    DWORD PTR [rdi+0x4],esi
    1035:       mov    DWORD PTR [rdi+0x8],esi
    1038:       ret

In the general case it goes on like this forever, more values on the stack.

If you return an object (struct) it’s the same, as we saw earlier. The struct is returned as it’s component values on the stack. From the assemblers point of view there is no such thing as a struct.

If you have a chain of function calls and a return value that gets passed straight back up you will see return value optimization. If your call chain goes a -> b -> c, and b returns the output of c, then instead of c writing the values into b’s stack, and then b copying them to a’s stack, c will write them directly to a’s stack, eliding a copy.

Aside: An elegant optimization

In the example I’m using I am returning the same value multiple times. SIMD instructions are really good at working with the same value multiple times. Hence when we go to four return values LLVM does something really elegant. It packs the values into a larger AVX register and does a single write to the stack. It’s an unusual case but it’s pleasing to look at, so here it is.

Rust

fn ret(a: u32) -> (u32, u32, u32, u32) {
    (a, a, a, a)
}
Assembly

underrust::ret:
    ; fill xmm0 with four copies of `a`
    1030:       vpbroadcastd xmm0,esi
	; copy xmm0 (four `a`'s) to return location
    1036:       vmovdqu XMMWORD PTR [rdi],xmm0
    103a:       ret

and the caller picks the 32-bit words off like this

Assembly

lea    rdi,[rsp+0x8]           ; where to write the return values
call   1030 underrust::ret
mov    esi,DWORD PTR [rsp+0xc] ; second return value
add    esi,DWORD PTR [rsp+0x8] ; first return value (0xc - 8 = 4)

Conclusions

Here’s what I learnt:

  • If you return one or two primitive values they will be in registers making them zero cost.
  • If you return anything beyond that you will use stack memory. That’s most likely L1 cache, 3-5 cycles per read/write, so still very fast but no longer free.
  • If your function is inlined none of this matters.

And compiler optimizations are endlessly fascinating.

Thanks for reading. Have some energy: Neon Hearts.