Graham King

Solvitas perambulum

Underrust: Does it matter what type (u8, u32, ...) I use?

Updated 2022-10-18

Part of the Underrust series.

Yes. Yes it does.

Smaller types use less memory so you can fit more of them into an L1 cache line. All other things being equal that will make your program run faster. I expect this is obvious to you, but I checked anyway.

Smaller types use less memory

Say you have a Person struct with some natural values like age and num_children. In Rust we often default to usize (meaning u64, we’re on a 64-bit CPU) because that makes indexing less noisy. But here it is the wrong choice.

If you’re following along using the template in the intro post, use the stdlib template variation for this first snippet (for Box). The rest of the code uses the normal stripped down template.

Rust

struct Person {
    age: usize,
    num_children: usize,
}

let mut p = Box::new(Person { age: 30, num_children: 2 });
// trick to prevent compiler optimizing it all away
let ret = &p as *const _ as usize;
p.num_children = ret;
Assembly

mov    edi,0x10    ; allocate 16 bytes (0x10, usize * 2)
mov    esi,0x8     ; aligned on 8 byte boundary
call   QWORD PTR [rip+0x40a17]    ; call std::alloc::alloc
test   rax,rax     ; rax is address of allocation. check it's not null
; [not showing address is null case]

; p.age = 30 (0x1e)
; p.age is at address stored in rax
mov    QWORD PTR [rax],0x1e

; ret = &p
mov    QWORD PTR [rsp],rax
mov    rsi,rsp

; p.num_children = ret
; p.num_children is 8 bytes after p.age, as expected
mov    QWORD PTR [rax+0x8],rsi

Our initial write of num_children = 2 is never read so the assembly doesn’t include it at all. Rust warns us about this (“field age is never read”).

If we change Person to { age: u8, num_children: u8 } we will save a lot of space. Now the assembly looks like this:

Assembly

; allocate 2 bytes (u8 * 2) aligned on single byte boundary
mov    edi,0x2
mov    esi,0x1

; call std::alloc::alloc, check not null
call   QWORD PTR [rip+0x40a17]        # 46c28 <_GLOBAL_OFFSET_TABLE_+0x1a8>
test   rax,rax
[skip not-equal code]

; p.age = 30
mov    BYTE PTR [rax],0x1e

; ret = &p
mov    QWORD PTR [rsp],rax
mov    rsi,rsp

; p.num_children = ret
; note how it's now only 1 byte further than p.age
mov    BYTE PTR [rax+0x1],sil

We can store 32 instances of the u8 version in a 64-byte cache line, but only 4 instances of the usize version. That means a lot less waiting on memory loads. Depending on the operation the u8 version may also benefit from vectorization (which I’m hoping to cover later in this series).

On the stack it’s the same

Let’s repeat the experiment above with local variables, so on the stack. Two usize local variables take 16 bytes on the stack (assuming a 64-bit CPU, which is almost certainly your case and your servers case):

Rust

let age: usize = 30;
let num_children: usize = 2;
Assembly

mov    QWORD PTR [rsp+0x18],0x1e ; age = 30
mov    QWORD PTR [rsp+0x20],0x2  ; num_children = 1. 0x20 - 0x18 = 8 bytes

The important part is 0x20 - 0x18 = 8 bytes. Each usize takes the expected 8 bytes of stack space.

Two u8 local variables will only take two bytes on the stack:

Rust

let age: u8 = 30;
let num_children: u8 = 2;
Assembly

mov    BYTE PTR [rsp+0x1a],0x1e
mov    BYTE PTR [rsp+0x1b],0x2 ; 0x1b - 0x1a = 1

The important part is 0x1b - 0x1a = 1 byte per u8.

The stack must be aligned to a 16 byte boundary when calling a function. If you are not in a leaf function (a function that doesn’t call any further functions) you sometimes can use a larger type without increasing stack space. Due to function inlining it is usually not obvious which functions are leaves. The mostly-accurate way to remember it is that yes smaller types use less memory.

Signed and Unsigned types are the same

It probably matters to your application which you use, but it doesn’t matter to the CPU. You are not giving away any space or performance by choosing one over the other.

These two functions generate identical assembly:

Rust

fn sum1(b1: u32, b2: u32) -> u32 {
    (b1 + b2) * 2
}
Rust

fn sum2(b1: i32, b2: i32) -> i32 {
    (b1 + b2) * 2
}

In fact the compiler notices that they are the same and only outputs a single function (using the name of the first function it encounters):

Assembly

underrust::sum1:
	lea    eax,[rdi+rsi*1] ; b1 (rdi) + b2 (rsi)
	add    eax,eax         ; * 2
	ret

I think this ability for the CPU to treat signed and unsigned numbers the same is the main reason why negative numbers are encoded as two’s complement.

Smaller types have (almost) no performance penalty

8-bit and 16-bit registers (e.g. al, ax) are considered partial registers. They do not zero their upper bytes. For example mov ax, 0x1 only changes the bottom two bytes of rax, leaving the top six bytes containing whatever was there before. That dependency on the register’s previous value leads to partial register stalls.

This is why 32-bit registers are usually the optimal choice, and that is what we see in Rust/LLVM’s output. If you choose u8 or u16, Rust/LLVM will try to use 32-bit registers instead.

Here for example it will completely ignore your request for u16 and stick with u32’s:

Rust

fn sum(b1: u16, b2: u16) -> u16 {
      (b1 + b2) * 2
  }
Assembly

underrust::sum:
	lea    eax,[rdi+rsi*1]
	add    eax,eax
	ret

Those annoying as usize are usually free

If you already have a u32, converting it to a u64 (usize) is free. The upper bytes are already zeroed because when you wrote to eax the upper half of rax was zeroed. 32-bit registers are not partial. The compiler doesn’t emit any code for my_u32 as u64

If you need to convert your u8 or u16 to a larger types, for example to index an array, in Rust you will write my_u8 as usize. The compiler will need to ensure the upper bytes are zero so it will issue a movzx (move with zero extend) instruction. That costs a single CPU cycle, and is the only case I have seen so far where using smaller types would slow you down.

Booleans are bytes

A bool variable is stored as a byte, 0 for false and 1 for true. x86 memory is “byte addressable”, meaning the smallest unit in memory we can point at is a byte.

Rust

let b1 = true;
let b2 = false;
let b3 = true;
let b4 = false;
Assembly

mov    BYTE PTR [rsp],0x1
mov    BYTE PTR [rsp+0x1],0x0
mov    BYTE PTR [rsp+0x2],0x1
mov    BYTE PTR [rsp+0x3],0x0

Casting to a u8 is a no-op: let my_byte = my_bool as u8; generates no code.

Even with 8 bytes in a row they do not become a bitfield. That would be slower (load + extract, versus load only).

The stack must remaining 16-byte aligned, so even if you have a single bool (hence a single byte), the function will reserve 16 bytes on the stack by doing sub rsp,0x10. This has no impact on your code (unless you overflow the stack, which is unlikely).

The optimal registers are 32-bit (see “Smaller types” above) so it should be no surprise that even booleans use 32-bit registers:

Rust

#[inline(never)]
fn either(b1: bool, b2: bool) -> bool {
    b1 | b2
}
Assembly

underrust::either:
	mov    eax,edi  ; edi is b1
	or     eax,esi  ; esi is b2
	ret             ; return value is in eax

The advice

What does all this roaming around the dark tunnels of the underrust get us? Do I have any concrete advice? Yes, yes I do.

Use the smallest type possible unless you have a strong reason not to. The smallest type typically best models your domain; a person’s age range is more u8 than u64. And it will give you the best chance at good performance.

If you’re not sure start with a u32 or i32. That’s what the compiler is going to do anyway.