Underrust: Does it matter what type (u8, u32, ...) I use?
Summary
Part of the Underrust series.
Yes. Yes it does.
Smaller types use less memory so you can fit more of them into an L1 cache line. All other things being equal that will make your program run faster. I expect this is obvious to you, but I checked anyway.
Smaller types use less memory
Say you have a Person
struct with some natural values like age
and num_children
. In Rust we often default to usize
(meaning u64, we’re on a 64-bit CPU) because that makes indexing less noisy. But here it is the wrong choice.
If you’re following along using the template in the intro post, use the stdlib template variation for this first snippet (for Box
). The rest of the code uses the normal stripped down template.
Rust
struct Person {
age: usize,
num_children: usize,
}
let mut p = Box::new(Person { age: 30, num_children: 2 });
// trick to prevent compiler optimizing it all away
let ret = &p as *const _ as usize;
p.num_children = ret;
Assembly
mov edi,0x10 ; allocate 16 bytes (0x10, usize * 2)
mov esi,0x8 ; aligned on 8 byte boundary
call QWORD PTR [rip+0x40a17] ; call std::alloc::alloc
test rax,rax ; rax is address of allocation. check it's not null
; [not showing address is null case]
; p.age = 30 (0x1e)
; p.age is at address stored in rax
mov QWORD PTR [rax],0x1e
; ret = &p
mov QWORD PTR [rsp],rax
mov rsi,rsp
; p.num_children = ret
; p.num_children is 8 bytes after p.age, as expected
mov QWORD PTR [rax+0x8],rsi
Our initial write of num_children = 2
is never read so the assembly doesn’t include it at all. Rust warns us about this (“field age is never read”).
If we change Person
to { age: u8, num_children: u8 }
we will save a lot of space. Now the assembly looks like this:
Assembly
; allocate 2 bytes (u8 * 2) aligned on single byte boundary
mov edi,0x2
mov esi,0x1
; call std::alloc::alloc, check not null
call QWORD PTR [rip+0x40a17] # 46c28 <_GLOBAL_OFFSET_TABLE_+0x1a8>
test rax,rax
[skip not-equal code]
; p.age = 30
mov BYTE PTR [rax],0x1e
; ret = &p
mov QWORD PTR [rsp],rax
mov rsi,rsp
; p.num_children = ret
; note how it's now only 1 byte further than p.age
mov BYTE PTR [rax+0x1],sil
We can store 32 instances of the u8
version in a 64-byte cache line, but only 4 instances of the usize
version. That means a lot less waiting on memory loads. Depending on the operation the u8
version may also benefit from vectorization (which I’m hoping to cover later in this series).
On the stack it’s the same
Let’s repeat the experiment above with local variables, so on the stack. Two usize
local variables take 16 bytes on the stack (assuming a 64-bit CPU, which is almost certainly your case and your servers case):
Rust
|
Assembly
|
The important part is 0x20 - 0x18 = 8 bytes. Each usize
takes the expected 8 bytes of stack space.
Two u8
local variables will only take two bytes on the stack:
Rust
|
Assembly
|
The important part is 0x1b - 0x1a = 1 byte per u8
.
The stack must be aligned to a 16 byte boundary when calling a function. If you are not in a leaf function (a function that doesn’t call any further functions) you sometimes can use a larger type without increasing stack space. Due to function inlining it is usually not obvious which functions are leaves. The mostly-accurate way to remember it is that yes smaller types use less memory.
Signed and Unsigned types are the same
It probably matters to your application which you use, but it doesn’t matter to the CPU. You are not giving away any space or performance by choosing one over the other.
These two functions generate identical assembly:
Rust
|
Rust
|
In fact the compiler notices that they are the same and only outputs a single function (using the name of the first function it encounters):
Assembly
underrust::sum1:
lea eax,[rdi+rsi*1] ; b1 (rdi) + b2 (rsi)
add eax,eax ; * 2
ret
I think this ability for the CPU to treat signed and unsigned numbers the same is the main reason why negative numbers are encoded as two’s complement.
Smaller types have (almost) no performance penalty
8-bit and 16-bit registers (e.g. al
, ax
) are considered partial registers. They do not zero their upper bytes. For example mov ax, 0x1
only changes the bottom two bytes of rax
, leaving the top six bytes containing whatever was there before. That dependency on the register’s previous value leads to partial register stalls.
This is why 32-bit registers are usually the optimal choice, and that is what we see in Rust/LLVM’s output. If you choose u8
or u16
, Rust/LLVM will try to use 32-bit registers instead.
Here for example it will completely ignore your request for u16
and stick with u32
’s:
Rust
|
Assembly
|
Those annoying as usize
are usually free
If you already have a u32
, converting it to a u64
(usize
) is free. The upper bytes are already zeroed because when you wrote to eax
the upper half of rax
was zeroed. 32-bit registers are not partial. The compiler doesn’t emit any code for my_u32 as u64
If you need to convert your u8
or u16
to a larger types, for example to index an array, in Rust you will write my_u8 as usize
. The compiler will need to ensure the upper bytes are zero so it will issue a movzx
(move with zero extend) instruction. That costs a single CPU cycle, and is the only case I have seen so far where using smaller types would slow you down.
Booleans are bytes
A bool
variable is stored as a byte, 0
for false and 1
for true. x86 memory is “byte addressable”, meaning the smallest unit in memory we can point at is a byte.
Rust
|
Assembly
|
Casting to a u8 is a no-op: let my_byte = my_bool as u8;
generates no code.
Even with 8 bytes in a row they do not become a bitfield. That would be slower (load + extract, versus load only).
The stack must remaining 16-byte aligned, so even if you have a single bool (hence a single byte), the function will reserve 16 bytes on the stack by doing sub rsp,0x10
. This has no impact on your code (unless you overflow the stack, which is unlikely).
The optimal registers are 32-bit (see “Smaller types” above) so it should be no surprise that even booleans use 32-bit registers:
Rust
|
Assembly
|
The advice
What does all this roaming around the dark tunnels of the underrust get us? Do I have any concrete advice? Yes, yes I do.
Use the smallest type possible unless you have a strong reason not to. The smallest type typically best models your domain; a person’s age range is more u8
than u64
. And it will give you the best chance at good performance.
If you’re not sure start with a u32
or i32
. That’s what the compiler is going to do anyway.