Shrinking your Python application’s Docker image: an overview


You’ve finished building the initial Docker image for your Python application, you push it to the registry–and that takes a while, because your image is 2GB. Your image is clearly too large, and so your next step is to try to make your Docker image smaller.

In this article you’ll find an overview of the many techniques you can use to shrink your image, organized approximately by logical order packaging. The focus is on Python, though many of these techniques are more generic. Techniques are broken down by category, with each suggesting follow-up articles covering the details:

  1. Base image.
  2. Docker layers and their impact on image size.
  3. System packages (apt/dnf) and Python packages.
  4. Avoid copying unnecessary files.
  5. Additional tools, tips, and techniques.

Before you begin: should image size be your top priority?

You only have limited time to work on any given task, so it’s important to prioritize and work on the most important tasks first. And when it comes to Dockerizing your application, image size probably isn’t the most important thing to work on.

If you haven’t thought about security, debuggability, or reproducibility for your images, it’s best to focus on those first, and put off optimizations like image size until later. For an overview of my recommended process for Dockerizing your package, consider reading my free Introduction to Dockerizing for Production mini-ebook.

Base image

The starting point for your image is typically a base image of some sort. Your options include:

Note that base image choice also needs to be traded off against other criteria: access to system packages, Python performance, build time (particularly relevant to Alpine), and compatibility (again, relevant to Alpine).

A different approach is to choose whichever image is most convenient, and then use the docker-slim tool to remove all files your application doesn’t touch. The tool works by runtime instrumentation, so you have to ensure there are files your application might open are actually opened.

Docker layers and their impact on image size

Docker’s image format is comprised of layers, much like Git commits. You can see layer size using the docker history command. And as with Git commits, once you’ve added some files to a layer, they are always there.

For example, let’s say you have to download a 100MB file to build your image. The following Dockerfile will still have that 100MB file lying around, because each RUN command adds a layer:

RUN wget https://example.com/largefile.tar.gz
RUN tar xvfz largefile.tar.gz
RUN largefile/install.sh
# BAD, This will not shrink your image:
RUN rm -rf largefile.tar.gz largefile/

Instead, you can combine these commands into a single layer, and then the temporary files won’t end up in the image:

# GOOD, temporary files deleted before RUN ends:
RUN wget https://example.com/largefile.tar.gz && \
 tar xvfz largefile.tar.gz && \
 largefile/install.sh && \
 rm -rf largefile.tar.gz largefile/

Note: Outside the very specific topic under discussion, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.

To ensure you’re following all the best practices you need to have a secure, correct, fast Dockerfiles, check out the Python on Docker Production Handbook.

When combining doesn’t work

Combining layers doesn’t work across COPY and RUN:

COPY installer.sh .
# BAD, installer.sh is still in previous layer:
RUN ./installer.sh && rm -f installer.sh

So what can you do?

One approach is the docker-squash tool, that let’s you combine multiple layers into one.

A more standard approach, using built-in Docker functionality, is to use multi-stage builds. In a multi-stage build you have one image where you build everything, and then another image which just has the final artifacts you need to run your code.

System packages and Python packages

When it comes to installing both system packages and Python packages, we’d like to:

  1. Not store index files (“this Debian repository has these packages available”).
  2. Not store the downloaded packages once they’re installed.
  3. Avoid installing unnecessary files, like documentation.

See my article on installing system packages for details on doing this RPM and Debian-based systems.

For Python packages:

  • You can pass the --no-cache-dir option to pip install to avoid keeping copies of downloaded files.
  • Other packaging tools for Python should have similar options.
  • You can alternatively use BuildKit caching with pip or other tools, which also helps speed up your builds during development.

Avoid copying in unnecessary files

When you COPY files into your image, they will make your image bigger. If you need those files, that’s fine. If you don’t, it’s a waste of space.

  • You can use the .dockerignore file to list files you don’t want copied in.
  • You can explicitly list which files to COPY in, also useful to avoid leaking secrets. For example, “COPY mycode/ setup.py /app” instead of “COPY . /app”.

Other ideas:

A final reminder

As mentioned above, image size is probably the last thing you should work on; only start working on it once you’ve ensured your image is ready for production usage in other ways, starting with security. But when the time comes, you can make significant improvements by using the techniques above.

Learn a step-by-step iterative DevOps packaging process in this free mini-ebook. You'll learn what to prioritize, the decisions you need to make, and the ongoing organizational processes you need to start.

Plus, you'll join my email list and get weekly articles covering practical tools and techniques, from Docker packaging to Python best practices.