Graham King

Solvitas perambulum

Rust HashMap notes

Rust’s HashMap has a very efficient SIMD-friendly design based on Google’s SwissTable. There is a fantastic CppCon talk about it’s design: Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step. That video is necessary to make sense of what follows.

I spent some time learning how the Rust implementation works, with the intention to write a deep-dive post about it. I’m ready to admit that’s not going to happen, so here are my raw research notes.

Edited down to the core pieces Rust’s HashMap looks like this:

std::collections::HashMap<K, V> {
     base: hashbrown::HashMap<K, V> {

         // Factory for hasher. Default to RandomState who's build_hasher(k1, k2)
         // returns a SipHasher (SipHash 1-3) initialized with two random values.
         hash_builder: impl BuildHasher

         // Key/Value become a tuple, that's what we store in the map's backing memory
         // T = (K, V)
         table: RawTable<T> {
             // non generic part
             table: RawTableInner {

                 bucket_mask: usize, // num buckets - 1 (??)

                 ctrl: *const T, // the memory - the exciting part!

                 growth_left: usize, // inserts left until we have to grow
                 items: usize, // len()
             }
         }
     }
 }

The implementation is also available as a separate crate: hashbrown.

A traditional compsci-101 hashtable is an array of linked lists. The key is hashed to an index in the array. That’s it’s bucket. Each bucket contains a linked-list. The element is appended to the end. That tends to result in elements spread out in memory which performs poorly on modern processors. That is a chaining table.

hashbrown is instead a probing table. The elements are contiguous in memory. The values are stored as an array, ending with ctrl bytes. Most of the probing happens in ctrl bytes, so in L1 cache. The probing uses SSE instructions (128 bit SIMD) to compare 16 bytes at once.

The number of buckets is always a power of 2. HashMap is made up of contiguous groups: 16 control bytes and 16 values is a Group.

Metadata at start of allocation (ctrl):

  • a byte per entry, first bit says if it’s deleted,
  • then (if 0) 7 bits of hash code, bottom 7 bits of hash
  • if top bit 1 is non-value: empty, deleted or sentinel (for table scan)

Hash is u64:

  • top 57 bits are h1, position in array to start looking at. We start looking in ctrl bytes, often that’s enough.
  • bottom 7 bits are h2

Find (probing):

  • hash the value, treat as h1 (top 57 bits) and h2 (bottom 7 bits)
  • h1 % size gives us an $index into ctrl bytes. For mod we use & lower bits because num buckets is always a power of 2, so for 4 buckets we keep only the 2 bits, for 8 bucket the lower 3 bits, etc.
  • jump to that index compare bottom 7 bits - h2 - of next 16 bytes at once, noting all the matches (usually 1 maybe 2)
  • initial index plus offset of byte that matches gives us the slot position (index into regular data part of group)
  • compare the full key in that slot with our key (because hash collisions are possible). This is almost always true so we can tell branch predictor to predict branch taken. If we found it that was only 2 memory accesses.

Growth:

  • allocates on first insert (not on ‘new’)
  • grows by allocating a new table and inserting all elements from previous table into it (hash+memcopy)
  • always one spare bucket at end - a bucket is just entry in array [V] where V is value type
  • always a power of 2 (but holds that -1)
  • initial allocation is space for 4 entries (3 elements plus the spare)
  • next allocation is space for 8 entries
  • then next power of 2 above or equal to cap*8 / 7. close enough to “grows by doubling”. 9-14 -> 16, 15->28 -> 32, etc

Memory layout:

  • contiguous block
  • data bytes ((size_of + size_of) * NumBuckets), then 16 ctrl bytes, then NumBuckets ctrl bytes? (why the duplication)
  • data and ctrl bytes are mirrors of each other. we use positive offset for ctrl bytes, negative offset for data
  [optional padding (so that we align 16?)] d3, d2, d1, d0, c0, c1, c2, c3
                                                            ^
                                                            |- RawTableInner.ctrl

I hope that was useful to you.