Our unit of value at Quilt Data is data packages: like code packages, but for data. We wanted data packages to look and feel as much like code packages as possible — right down to how you
import wasn’t an immediate fit for data packages; the default import logic is intended for code modules, which point to a
.py file somewhere on disk, whilst our data packages are objects, and are backed by
Luckily imports are extremely hackable!
We were able to get the
import behavior we wanted by building our own module loader:finder pair. In this blog post we’ll learn how this feature works — and how you can use it yourself to enable Python imports on almost anything.
Here’s how we used this feature in
# code dependencies
import pandas as pd
import numpy as np
# data dependencies
from t4.data.aleksey import fashion_mnist
from t4.data.quilt import open_images
# your code
To get this code snippet working, we had to “teach” Python how to construct data packages (e.g.
aleksey/fashion_mnist) out of JSON files.
Before we can hack
import we first have to know a little bit about how it works.
The first time that you try to
import foo, Python will scan a pre-selected group of paths for an importable thing with the name
foo. You can see the list of paths that it rummages through using
sys.path. For example, on my machine:
In : import sys; print(sys.path)
The first path in
sys.path is usually
'', which is an alias for “the current directory”. The next handful of paths are various mount points for Python code packages. Python scans these paths in order, and returns with the first matching name it can find.
So for instance, if you were to create and
pip install a new package named
os, when you run
import os you will get the stdlib
os module instead. That’s because stdlib
os is in the
python3.6 directory, whereas your version of
os is in
site-packages — which is further down the list.
sys.path is the easiest way to make your own code importable. If you want to make
something.py importable, all you have to do is
append the path to the directory containing that file to your
Another important feature of Python imports is module caching. Every time you try to import a name, Python first checks the module cache to see if it has already imported it. If it hasn’t, it tries to import that name, then (if successful) adds it to the module cache. If it has, it does nothing.
This is why you can’t import a package, change the code, then import it again.
Here’s a feature of Python imports that’s much less well known:
In : import sys; print(sys.meta_path)
<six._SixMetaPathImporter at 0x10c14ef60>]
The objects in this list are what are known as finders. Every time you import a name, Python presents that name to each of these finders (in precedence order, just like in
sys.path). Finders determine whether they can import a particular name; if so, they return a loader object that then actually, you know, loads it.
So here’s an updated picture of how Python imports work. Python goes through the list of finders in your
sys.meta_path, asking each one if it knows how to import a name. The first finder to say it can wins the rights to the import. If no finder knows what to do with a name, Python gives up and returns an
So what is
sys.path? It’s actually just the list of directories that
PathFinder will search through in an attempt to find a code module with the given name. Most Python imports are handled by
PathFinder; only a handful of built-in modules compiled directly in C, like
sys, are imported by
FrozenImporter further up the list.
sys.meta_path is our “in”! In order to implement a new type of module in Python, we need to create a new finder:loader pair for our new module type and append it to the
I’ll demonstrate how finders and loaders work using the importer code in
The code that follows has only been verified to work in Python 3.6+.
First, the finder object:
DataPackageFinder implements a required
find_spec method. It is the job of
find_spec to determine whether or not it can import a name. If it can’t, it should return
None. If it can, it should return a module specification parameterized with two things: the module name, and the module loader.
In our case this was easy: we just matched any names starting with
t4.data. Other use cases may require more complicated name matching. For example, here in the
The module loader is more complicated, and requires a bit more study.
DataPackageImporter implements two required methods. The first is
create_module. If this method returns
None, the default module creator will be used; this is probably what you want, unless you’re doing something super weird.
The second required method is
exec_module takes a constructed module object as input: a bare-bones representation of an importable module whose only special characteristic is a
__name__ attribute, which has been set to the value of the
fullname parameter from
In Python, modules are really objects, and objects are really dictionaries. So to extend this module, we hang new “stuff” directly on the module
__dict__, e.g. via
module.__dict__['foo'] = 'bar'.
Module names with many parts are executed in a top-down manner. So to
import foo.bar.baz, we first
import foo, then
import foo.bar, and then only then
import foo.bar.baz. The only restriction on what you can return on import is that you have to return a module object. The module objects you return as you go down the list of module namespaces don’t have to be at all related!
In our case, since we wanted to be able to import using a
from t4.data.namespace import packagename pattern, we need logic for the
t4.data.namespace names. For
t4.data we return an empty module, and for
t4.data.namespace we return a module object relevant objects keyed by
packagename into the
However, since Python requires that you answer imports with a
module object, but we want to return a
t4.Package object, we can’t and don’t implement logic for
import t4.data.namespace.packagename. Instead we tell users to run
from t4.data.foo import bar. Running this code imports the
t4.data.foo module object, then plucks the
bar object out of the module
All of this logic is implemented in the
DataPackageImporter above. Scroll up to the code sample again and see if you now understand what it’s doing!
There’s just a couple more details to keep in mind:
- It’s not possible to “look ahead” and check what specific sub-module a user asked for. Because of this, we have to include every possible user request in the module we return (this is why we have to iterate over
list_packages()in the code above). This is a conscious design decision: we’re worse off if we only import one object, since we had to do all the additional work of finding all the other things; but better off if we import many objects, since now every object except for the first is just plucked from the module cache.
modulemust have its
__path__attribute set. This is true even if the module is “virtual”, e.g. it doesn’t actually exist on-disk. You can get around this by setting
__path__to an empty list
(module paths are lists for…reasons).
Finally, there’s one more thing we need to do — hang the finder on
from imports import DataPackageFinder
That’s it! We’re now ready to import our data packages. Code like
from t4.data.foo import bar will “just work”.
Hopefully from reading this blog post you’ve learned a lot about how Python
import works, and about how you can extend for yourself.
If you’re not satisfied with what you’ve seen here, and want to dig even deeper into the (super hairy) world of Python imports, I highly recommend watching David Beazley’s classic PyCon talk, “Packages and modules: live and let die!”. Warning though — it’s three hours long. 🙂
If you want to see just how far you can go with
import hacks check out “How to use loader and finder objects in Python”, which talks through how a group of researchers hacked
import so they could interface with model files written in the Clojure (!) programming language.
Interested in data packages? Help us build them in the Quilt T4 repo.