Inside python dict — an explorable explanation

JavaScript code is loading...

def simple_search(simple_list, key):

 idx = 0 
   ~ Start from the beginning of the list

 while idx < len(simple_list): 
   ~ [Try #6] 5 < 8, so some elements have not been processed yet, and the number may be there

 if simple_list[idx] == key: 
   ~ 25 == 25 — the wanted number is found

 return True 
   ~ so return True

(The visualization is interactive. The buttons allow you to step through the code. Notice here that the time slider is draggable - feel free to rewind the time or move it forward. Also, feel free to mess with the input and the original list - the visualization will update automatically)

What's not so great about linear scans? If we have a million distinct numbers, in the worst case scenario, we may need to scan the whole list. But scanning over a few elements is no big deal. We need to have some order and predictability to make the search fast. We need to have some idea of where the searched element is located.

A Python dict implementation is basically a scan of a list (but a pretty weird scan). We'll build the actual algorithm and data structure inside Python dictionary step by step, starting with the code above, which is intentionally verbose.

Chapter 1: searching efficiently in a list

A Python dict is a collection of key-value pairs. And, the most important part of it is handling keys. Keys need to be organized in such a way that efficient searching, inserting and deleting is possible.

In this chapter, to keep things simple, we won't have any values, and "keys" will just be plain integers. So, the simplified problem is to check if a number is present in a list, but we have to do this fast. We'll tackle the real problem in the following chapters. But for now, bear with me.

Accessing a single element by index is very fast. Accessing only a few elements would be fast too. We don't want to doing a linear scan over the whole list every time we look up a number, so we need to organize our data in a clever way.

Here's how.

Let's begin by creating a new list of slots. Each slot will either hold a number from the original list or be empty (empty slots will hold None). We'll use the number itself to compute an index of a slot. The simplest way to do this is to take the remainder of number divided by len(the_list): number % len(the_list) and put our number in slot with this index. To check if the number is there we could compute the slot index again and see if it is empty.

Would this approach work, however? Not entirely. For example, 50 will get the same slot index (3) as 2, and it will be overwritten. Situations like these are called collisions.

def build_not_quite_what_we_want(original_list):

 new_list = [None] * len(original_list) 
   ~ Create a new list of 8 empty slots

 for number in original_list: 
   ~ [8/8] The number to insert is 4

 idx = number % len(new_list) 
   ~ Compute the slot index: 4 == 4 % 8

 new_list[idx] = number 
   ~ Collision of 4 with 44 in slot 4 - the number is overwritten

 return new_list 
   ~ Return created list with some numbers missing: 50, 1, 25, 44

def build_insert_all(original_list): 

 new_list = [None] * (2 * len(original_list))
   ~ Create a new list of 16 empty slots

 for number in original_list: 
   ~ [8/8] The number to insert is 4

 idx = number % len(new_list) 
   ~ Compute the slot index: 4 == 4 % 16

 while new_list[idx] is not None: 
   ~ After 1 collision, an empty slot (at 5) is found: the collision is successfully resolved

 idx = (idx + 1) % len(new_list) 

 new_list[idx] = number 
   ~ Put 4 in slot 5

 return new_list 
   ~ Return created list with all original numbers present

def has_number(new_list, number): 

 idx = number % len(new_list) 
   ~ Compute the slot index: 2 == 2 % 16

 while new_list[idx] is not None: 
   ~ [Try #2] Slot 3 is occupied, so check it

 if new_list[idx] == number: 
   ~ The number is found: 2 == 2

 return True 
   ~ Now simply return True

 idx = (idx + 1) % len(new_list)

Calculating an index based on the value of the number and resolving collisions by linear probing is an incredibly powerful idea. What we've just implemented is a simple hash table (more about the term in the next chapter). Python dict uses a hash table internally, albeit a more complex variant.

We still haven't discussed adding more elements (what happens if a table overflows?), removing elements (removing an element without a trace would cause a hole to appear, wouldn't this cause the search algorithm to stop prematurely in many cases?), and perhaps most importantly, handling objects other than integers - strings, tuples, floats. We'll do this in the next chapters.

Collision resolution via separate chaining

There is a different method of collision resolution, called separate chaining. It is also a good strategy which is commonly used. But that's not how Python resolves collision in dicts, so it is beyond the scope of this explanation.

A couple of the notes about the explanation

First, this explanation discusses dict as it is implemented in CPython — the "default" and most common implementation of the Python language (if you are not sure what implementation you use, it is almost certainly CPython). Some other implementations are PyPy, Jython and IronPython. The way dicts works in each of these implementations may be similar to CPython (in the case of PyPy) or very different from CPython (in the case of Jython).

Second, even though dict in CPython is implemented in C, this explanation uses Python for code snippets. The goal of this page is to help you understand the algorithms and the underlying data structures, not the minutiae of the C code (these details are interesting too - they are just are beyond the scope of this explanation).