The inner implementation. A story (for normal people).
Note: I don’t consider myself an advanced programmer. So don’t be scared. You won’t find some difficult and weird concepts here. I will try to make it as simple as possible. It is just you and me trying to make sense of smart people’s code.
Another note: Go maps are an implementation of hash tables, they are the same thing. So don’t be confused.
📜 This article is based on Go 1.15
📜 The Go code contains a lot of small details that were added over time. Don’t torture yourself for not understanding it right away. What you see is the result of 10+ years of evolution.
Thanks very much to Keith Randall for helping me understand this.
Keith is one of the Go team, and he is the original implementor of Go maps.
Hash tables, or maps, are one of the handiest common data structures that comes with most programming languages. But did you ever tried to implement one?
Go’s source code has a lot of interesting implementations, even in tasks that you may think it is standard and there are no more tweaks to add to make it better, like sorting.
So as a way to push my knowledge a little bit further, I thought it is a good idea to take a look at a part like Go maps and see how the Go team has done it. At the end of the day, programming languages are one of the most interesting projects for any programmer, and they include the meat of computer science.
So let’s Go…
Wait, here is something to cheer you up before we dive in 🐱
⏱ If you know what hash tables are, just skip this part.
What is a Hash Table?
First I think I should tell you a little bit about Hash Tables.
So as you know, our programs are loaded in memory for the processor to run them. So what is the memory?
It is just a really boring series of boxes, that’s it, nothing fancy there.
But when we code, we like to have some handy data structures to work with, but those data structures usually are more complicated than a series of boxes.
So what we do?
We do what programmers always do, we build a layer of abstraction over them to let us forget about the boxes, and just see it as we want to see it.
And here comes the old, not very much used now, term Abstract Data Structure (ADT). A Hash Table is an ADT, which simply means that it looks to you, the programmer, as a hash table, but it is represented in the memory as something else.
So what kind of ADT a hash table is?
A Hash table is a data structure that represents data in the form of key-value pairs.
So we want to represent our data with keys, while each key holds a certain value. Kind of like this:
And we want to be able to do (at least) the following operations:
- Add a new key-value pair
- Get the value of a given key
- Delete a key-value pair
So how could we achieve that?
Since we will use an array as the underlying data structure, we would like to map the hash table keys to array indexes, then store the values into those indexes.
This mapping operation is called hashing (hence the name hash table), which includes many crazy methods invented by many crazy computer scientists to convert a string (could be something else) into a practical number to be used as an array index (or as something else in another case).
So the data we presented earlier would be represented internally like this:
So when we add a key-value pair, we take the key, pass it to the hash function, then take the resulting number, use it as an index, and store the value at that index. And when we want to get a value for a given key, we do the same, except we return the value at the end, not storing any. You got the idea.
So what are the complications that arise from this design?
We have three main issues to deal with (in the Go implementation we have more than those issues, like generic values — comes later in the article):
- What if the hash function produced the same number for two different strings? (which is called a collision)
- What if the numbers produced were too far from each other (2, and 1000 for example)? we would have too much wasted space in the underlying array.
- What if we filled all the array and wanted to add more key-value pairs?
Now you see the core of the problem. 🤯
The first issue, collisions, is handled by using a linked list (this method is called chaining, there are other methods like open addressing). So we don’t save the value at its index right away. Instead, we save a linked list and append the values to it. So if two different keys collide with each other (produced the same hash result), they will get appended to the same linked list. And when we want to get any one of them, we will hash the key, get the index, then search the linked list for the right value.
The second issue is handled in Go by not using the result of the hash function right away. Instead, we do some bit manipulation to get a number of the lowest bits (the ones from the right) and use that as an index.
The third issue is handled by growing the underlying table, then rehashing all keys. That is called growing in Go, which is not covered in this article.
📺 There is also an awesome introduction to Hash Tables from MIT here.
So how does Go implements Hash Tables? Let’s dive in.
How does Go implement Hash Tables?
So here is the interesting part. The main code we will explore is in this file.
First, The Generics Fail
Generics is a way to write code that deals with more than one type.
And that is exactly what we want to do with maps, but Go doesn’t have generics, so how could Go implement maps then?
We fake it, as Keith Randall said. So how we fake it?
With some help from the compiler and the runtime, our heroes 🦸.
When you type any maps code, the compiler converts it to a call to the runtime, similar to this:
- maptype: is a type descriptor (struct that holds many type descriptors to be accurate, one for the key, one for the value, and some other things), which is just a meta-data about your key and value types.
- hmap: the map header (we will talk more about it later)
- k: a pointer to the key you provided
So why we need maptype? I will quote Dave Cheney here:
Why do we need a
*maptypeif we already have a
*maptypeis the special sauce that makes the generic
*hmapwork for (almost) any combination of key and value types. There is a
maptypevalue for each unique map declaration in your program. There will be one that describes maps from
http.Headers, and so on.
Rather than having, as C++ has, a complete map implementation for each unique map declaration, the Go compiler creates a
maptypeduring compilation and uses that value when calling into the runtime’s map functions.
So here you go, the compiler creates
maptype for your specific map, and pass that to the runtime calls. 👌
When you create a map, the runtime creates a bunch of stuff for you and returns a pointer to what the language calls map header. The map header is just a struct, nothing fancy about it. Here is how it looks like:
At creation, the important fields we have to fill are the following:
- B: log_2 of the number of buckets
- hash0: a random seed. It is how the Go runtime avoids hash collisions.
- buckets: the underlying array of buckets (actually a pointer to it)
- extra: overflow buckets
What is B?
As I said it is log_2 of the number of buckets. But why we need that here?
Remember, we will take a string, hash it, then use that hash to determine the index in the underlying array.
But the hash usually won’t be in the range of the size of the array (remember, the second issue above). If the length of the underlying array is 8, and the hash function produced the number 1378 for a certain string, we have a problem. We can’t just expand the array and add the value there, we will waste too much space.
So we need a way to make sure that the end result of the hashing process is in the range of the underlying array. And here comes the magic of log_2.
Let’s take a look at this table first
Can you see it? the number of bits we need to cut from the hash to be in the range of the underlying array equals
ceil(log2(N)) (or just
log2(N) if we restricted the length of the array to be an exponential of 2, which what the Go code does).
Here is how the code does it:
x & (1 << n -1) is a bit manipulation way to get the lowest n bits (the ones from the right) from x. You can use another way if you like.
In our case, the result will be 0010, which is 2 in decimal. And that is our index. 🥳🎉
Play with it here.
As we said, when you create a map, Go calculates B, and allocates memory for the underlying buckets array.
But the size of the array elements will differ between maps. And here comes
maptype again. As it has an internal representation for the bucket, it uses it to create an array of the right size.
So you can think of the data as the following:
It is important to keep that representation in mind because as you look into the code and see the definition of bmap, you won’t find fields for the keys and values. But the code actually allocates memory for them, and assigns value to them using some fake pointer arithmetic.
Let’s look at that line I referenced above, it is used on assignment to get a pointer to our element in an empty slot in our bucket:
- add: an internal function (which may be exported later) which does pointer arithmetic
- b: a pointer to bmap
- dataOffset: the size of the bmap struct
- bucketCnt: equals 8, which is the length of our bucket
- keysize: the size of the type of the keys of the map
- i: our index inside the loop
- elemsize: the size of the type of the elements of the map
This basically says, take that pointer and add to it the size of the bucket struct, and the size of all our 8 keys, and the size of the preceding elements, then return it. And that would be the place of our current element.
To understand assignment, you have to know that Go doesn’t use linked lists for chaining, as usual, instead, it uses just another array.
If you use linked lists, you would get the index of the bucket containing the linked list, get the linked list, loop over it to check if the value already exists, if it does exist then you update it, if it doesn’t you just append it to the list.
Go does something a little bit different. As I said it uses just an array of size 8, and uses the highest bits (the ones from the left) to decide where to put the value inside the array. But not in the way it uses the lowest bits to decide the index of the bucket.
This is important to understand. If the index of the bucket is 3, and the highest bits equal 5 in decimal, Go doesn’t do the following:
Instead, Go uses any empty slot in the array and puts the value there.
On assignment, it loops over the array checking if an array element equals the highest bits, if none exists, it just adds the value to an empty slot (the last one in the loop actually), if it found a match then it just updates it.
I didn’t mention a key detail about hash tables and left it to the end of the article, which is the load factor.
The load factor is the average number of key-value pairs per bucket. It is used as a measure of how full the hash table is allowed to get before its capacity is automatically increased. And it defines the balance between access time and space.
What does all of that mean?
The Load Factor is a typical trade-off between time and space, that arise whenever you design a hash table. It is a constant that the language chooses and uses it to decide if it should grow the table or not. So the hash table is not only grown when it becames full, but if the total number of key-value pairs over the number of buckets perceeds the load factor we trigger growing.
You can use a larger load factor to save space, but at more elements per bucket and therefore worse access time. You can reduce the load factor to improve access time, but the hash table will be larger.
🔮 I may write another article later explaining more details like iteration, overflowing, race detection, deletion, and growing.