An exploration of threads, processes, and coroutines in Python, with interesting examples that illuminate the differences between each.
As a data scientist who is spending more time on software engineering, I was recently forced to confront an ugly gap in my knowledge of Python: concurrency. To be honest, I never completely understood how the terms async, threads, pools and coroutines were different and how these mechanisms could work together. Every time I tried to learn about the subject, the examples were a bit too abstract for me, and I hard time internalizing how everything worked.
Because of restrictions with this YouTube video, I couldn’t embed the video in this article, so you will have to open it in a different window.
This talk is incredibly intimidating at first. Not only is it coded live from scratch, but it also jumps immediately into socket programming, something that I had never encountered as a data scientist. However, if you go through it slowly and understand all the components (as we do in this blog post) it turns out to be the best educational material on Python concurrency I have ever encountered. This blog post documents what I learned along the way so others can benefit, too.
Before getting started, David sets up the following infrastructure that is used to demonstrate concurrency.
A cpu-bound task: Fibonacci
To demonstrate concurrency, it is useful to create a task that can saturate your CPU (such as mathematical operations) for a noticeable period of time. David uses a function that computes a Fibonacci number.
This function takes much longer for large inputs versus smaller inputs3, which allows us to profile different workloads.
A Simple Web Server
A web server is one of the best ways to illustrate different types of concurrency. However, to really demonstrate how things work it is useful to use something that is sufficiently low level enough to see how all the pieces work. For this, David sets up a web server using socket programming. If you aren’t familiar with socket programming, I’ll explain the important bits below, but feel free to dive deeper with this tutorial later if you like.
To begin with, David starts with the below code (I’ve highlighted the most interesting bits):
Here is an explanation of this code:
- Lines 6-9 are socket programming boilerplate. It’s ok to just take this for granted as a reasonable way to set up a socket server. This also matches the the tutorial I linked to above.
- Line 11 waits for an incoming connection from a client. Once a connection is made, the server can begin receiving data from a client. The code will stop execution on this line until a connection is made.
- Line 13: Once a connection is established, the client object is passed to a function which can handle data sent by the client.
- Line 17: waits for data to be sent by the client. The code will stop execution on this line until data is received from the client.
- Line 21: The server sends a response back to the client. The code could stop execution on this line if the send buffers are full, but unlikely in this toy example.
Testing the non-concurrent code
In the above example, the server will only be able to accept a connection from a single client, because the call to
fib_handler will never return (because it will run in an infinite loop unless a kill signal is received). This means that
sock.accept() can only be called once.
You can test this out by first running the server:
Then establish a client:
You can type numbers in as David does in his video and verifies that fibonacci numbers are returned. However, if you try to connect with another client at the same time from a different terminal session:
You will notice that the second client just hangs and doesn’t return anything from the server. This is because the server is only able to accept a single connection. Next, we explore how to tackle this issue.
We can solve this issue with threads. You can add threads to the handler so that more connections can be accepted with the following code highlighted in yellow:
You can verify that this works by connecting two separate clients to the server by running the following command in two separate terminal windows:
By executing the
fib_handler in a thread, the main while loop in
fib_server will continue, allowing
sock.accept() to receive additional clients. If you haven’t encountered threads before this tutorial provides a good introduction to the topic.
Thread performance & the GIL
When code stops execution and waits for an external event to occur (like a connection to be made, or data to be sent), this is often referred to as blocking.
One important utility of threads is that it allows blocking tasks to release control of the CPU when the CPU is not being used. However, the Python interpreter can only run on one thread at a time due to the Global Interpreter Lock. Because Python can only run a single thread at any given time, any CPU-bound work in threads must take turn running one after the other.
Therefore, you have to think carefully about what kind of tasks you execute in threads with Python. If you try to execute CPU bound tasks, these tasks will slow each other down. David demonstrates this with the below script that sends requests to our threaded server:
If you run several instances of this script (after starting the server first):
You will see the execution times for each script linearly increase as you increase the number of these scripts running in parallel. For this particular task, adding threads does not make anything faster. But why? This is because the fibonacci task is CPU bound so threads will compete with each other for resources.
Python threads work by interleaving the execution of different tasks on your CPU.4 Only one thread runs at a time, and have the ability to take turns executing in small bits until all threads are done. The details of how thread processing is interleaved is carried out by the GIL and your operating system, so you need not worry about this detail (with one exception mentioned below). Interleaving a bunch of CPU bound tasks will not speed up the total runtime of those tasks. However, if your tasks involve lots of non-CPU time, such as waiting for network connections, or disk I/O, threading tasks may result in a considerable speedup. A canonical way of simulating a non-cpu bound task in python is to use the built-in function
To check my understanding about threads and performance, I ran the below experiment5 and changed
fib(20) and back again:
As expected, increasing the number of threads while running
time.sleep(2) did not increase the program’s overall execution time (the program runs in roughly 2 seconds). On the other hand, replacing
fib(20) causes this program’s running time to increase as more threads are added. This is because
fib(20) is a cpu bound task so interleaving the task doesn’t really help much. You should try running the same thing to see for yourself.
You will often hear that Python is not good at parallelism and that you can only run on one CPU core at a time. This is likely referring to the aforementioned issues with threads and the GIL. Because you are limited to one thread, this means that thread-based tasks can only use one CPU core at a time (a single thread cannot run across multiple CPUs). Outside of Python, threads are a popular choice for parallelizing CPU-bound tasks because you are able to run a separate thread per CPU core simultaneously. However, with Python you must look for other ways to accomplish parallelism for cpu-bound tasks.
Another interesting but less known aspect that David discusses is the relation between the following two types of tasks:
- things that take much longer to compute on the CPU, like
fib(30), demonstrated with perf1.py.
- things that compute relatively fast on the CPU, like
fib(1), demonstrated with perf2.py.
The Python GIL will prioritize the first type of task at the expense of the second if they are made to compete for resources in threads. You can optionally follow along with a demonstration of this here. This is interesting because this is the opposite of how typical operating systems prioritize threads (by favoring shorter running tasks) and is something unique to the implementation of the Python GIL. More importantly, this behavior has a very practical consequence: if you are running a web-server where most tasks are fairly quick, an expensive cpu-bound task can grind everything to a halt.
Threads are not just about making things faster
It is tempting to think of Python threads as a tool to make things run faster, but that’s not the only use case. Recall that the socket server used threads to allow multiple connections at once without any speedup. David illustrates another way to use threads with his code used to measure the runtime of short-running tasks:
In this case, David uses a single thread with a blocking call to
sleep(1) to make sure that
monitor only prints once per second, while allowing the rest of the program to send requests hundreds of times per second. In other words, this is a clever use of threads and blocking that allow part of a program to run at a desired time interval while allowing the rest of the program to run as usual. 6
These different angles of looking at threads allowed me to understand threads more holistically. Threads are not only about making certain things run faster or run in parallel, but also allows you to control how your program is executed.
How threads work
A thread is always contained in a process, and each process contains one or more threads. Threads in the same process can share memory which means they can easily communicate and write to common data structures. Threads are useful in the following two scenarios:
- When there are lots of non-cpu bound tasks (disk I/O, network calls, etc.).
- Outside of Python, if you want to parallelize a CPU bound task by splitting up the task across individual threads running on separate CPU cores.
A process can span across multiple CPU cores, however a single thread can only utilize one CPU core.
Generally speaking, only one thread can run cpu-bound tasks on a single core at any given time. If multiple threads are sharing a CPU core, your operating system will interleave these threads. There are some exceptions to this rule. For example single CPU cores are able to run multiple threads concurrently by using things like SMT/hyper-threading or compute over data in parallel using SIMD, which is popular in scientific computing libraries.
On the other hand, processes offer isolation which is helpful when you have different users or different programs that should not be sharing information. Since we cannot run more than a single thread at a time in Python, a common workaround is to spawn several Python processes. This is discussed more below.
Chapter 2 of This book discusses what processes and threads are in greater detail from an operating system perspective.
Processes For CPU Bound Tasks
One way to solve the problem with the GIL and cpu-bound tasks competing for resources is to use processes instead of threads. Processes are different from threads in the following respects:
- Python threads share a memory space, whereas each process has a separate memory space. This is an important consideration if you need to share variables or data between tasks.
- Processes have significant overhead compared to threads because data and program state has to be replicated across each process.
- Unlike Python threads, processes are not constrained to run on a single CPU, so you can execute cpu-bound tasks in parallel on different cores.
David uses python processes in his server example by using a process pool.7 The relevant lines of code are highlighted below:
If you then start this version of the server with:
And run the profiler perf2.py, we can make the following observations:
- The requests/sec are lower than the thread based version, because there is more overhead required to execute tasks in a pool.
- However, if you also run perf1.py it will not materially interfere with the first task (from
perf2.py), as this will not compete for resources on the same CPU.
- The above example involves a CPU-bound task (computing the fibonacci number). However, if we simulated a non CPU-bound task instead such as
time.sleep(), using processes instead of threads would actually be detrimental to overall performance. A concrete example of this is provided in the section below.
This is a realistic example that allow you to gain more intuition about how threads and processes work. This tutorial contains more examples of Python processes and threads.
A note for data scientists: processes vs. threads
I’ve found many data scientists (formerly including myself) blindly apply processes and completely ignore threads. I understand why - processes are a kind of least common denominator where you can achieve some kind of parallelism regardless of if your task is CPU bound or not. However, I’ve found that this approach is very suboptimal and prevents full utilization of compute sources. Some examples to clarify where threads or processes might be more appropriate:
If you are downloading lots of files from the internet, consider using threads. This is because most of your time is spent on network I/O, not on the CPU. For example, this article demonstrates a 50% speedup when using threads compared to processes for downloading files.
If you are transforming or cleaning a large dataset, this work is mostly CPU bound so using processes makes sense. The only part of this that isn’t CPU-bound is reading and writing the data to disk.
If you just want to load a bunch of files into memory or write a bunch of files to disk, without really doing any transformations, consider using threads as the work is mostly disk I/O and not CPU bound.
Keep in mind that threads can be more memory efficient than processes because of differences in the way they work. So using lots of processes when you don’t need them can lead to memory bloat.
Most importantly, try avoid having to think about processes and threads where you can and use scientific computing libraries like numpy and write vectorized operations wherever you can. It is always worth being aware of the concurrency tools available in the library or framework you are using (especially numerical computing and other data science libraries) and consider using them when appropriate.
Recall that Python can only operate on one thread at a time, and the operating system automatically decides when to interrupt each thread to allow the threads to take turns running. This is called pre-emptive multitasking since the operating system, not you, determine when your thread makes the switch. When you don’t care about how tasks are interleaved, threads are great because you don’t have to worry about how they are scheduled.
However, there is third type of concurrency paradigm in Python that allows you to control how this switching occurs: Asynchronous Programming. This is also called cooperative multitasking which means each task must announce when it wants to switch. One way to achieve cooperative multitasking is to create a coroutine.
One way to create coroutines in Python is by using the
yield statement. David provides some intuition on how you can achieve multi-tasking with yield in the following code:
When you run this code, you can see from the output the three countdown tasks are being interleaved:
This clever use of
yield allows you to pause execution of a task and move onto a different task kind of like threading, except you, not the operating system are controlling how compute is interleaved. This is the key intuition for understanding the rest of the talk, which goes on to to push this example further.
One of the most popular ways to accomplish async programming is by using the various utilities in the built-in asyncio module, which uses similar yield statements under the hood. I didn’t end up diving deeply into the asyncio module or this particular flavor of programming as my goal was to understand the concept so that I wouldn’t be lost when encountering this in the wild.
There is no silver bullet with regards to choosing the correct type of concurrency in Python. You have to consider how much of your task is CPU bound vs non-CPU bound (and if it is feasible to break up the task appropriately) to determine whether tweaking your code will make a material difference.
Most importantly, I recommend only reaching for these tools when you need them rather than trying to prematurely optimize your code. Always start with the simplest code, without any concurrency, and build incrementally from there. If you do add concurrency, make sure you can justify it through a measurable difference in performance or functionality. I’ve sometimes found that my code was slow in places I didn’t expect and that concurrency wasn’t the tool I needed at all!
Profiling your code is beyond the scope of this blog post, however I hope this post demystified the confusing jungle of terminology of python concurrency so that you can more quickly navigate these topics in the future.
Not all programs that run in Python using threads are limited to a single CPU. It is possible to escape the constraints of the GIL by carefully writing C code that has a Python interface. This is what popular scientific computing libraries such as NumPy and SciPy do to achieve parallelism.
In David’s code, deque from the
collections module was introduced, which is a very handy data structure not only for async programming but also for threads because they are thread-safe, which means that you don’t have to worry about race conditions. Similarly, the queue module provides other types of thread-safe queues.
The following is terminology associated with Python concurrency that is often confused that we didn’t touch on in this blog post:
- Concurrency: this means creating programs do more than one thing at a time. It does not mean parallelization. If two parts of a program take turns executing until they are both complete this is concurrency even if they don’t run any faster than if run separately.
- Multiplexing: this means sharing resources.
- Mutex: (stands for Mutual Exclusion) this is used to enforce exclusive access to a resource across threads to avoid race conditions.