In a previous post I introduced the MNIST dataset and the problem of classifying handwritten digits. In this post I’ll be using the code I wrote in that post to port a simple neural network implementation to rust. My goal is to explore performance and ergonomics for data science workflows in rust.

## The Python Implementation

Chapter 1 of the book describes a very simple single-layer Neural Network that can classify handwritten digits from the MNIST dataset using a learning algorithm based on stochastic gradient descent. This sounds complicated — and it kind of is, this stuff was state-of-the-art in the mid 1980s — but really it all comes down to about 150 lines of heavily commented Python code.

I’m going to assume that you already know the content of that chapter so stop here and go read that if you want to brush up on neural network basics. Or don’t and just pay attention to the code, it’s not super important to understand the details of exactly why the code works the way it does to see the differences between the Python approach and the Rust approach.

The fundamental data container in this code is a `Network`

class that represents
a neural network with a user-controllable number of layers and number of neurons
per layer. The data for the `Network`

class are represented internally as lists
of 2D NumPy arrays. Each layer of the network is represented as a 2D array of
weights and 1D array of biases, contained in attributes of the `Network`

class
named `biases`

and `weights`

. These are both lists of 2D arrays. The biases
are column vectors but are still stored as 2D arrays by making use of a dummy
dimension.
The initializer for the `Network`

class looks like this:

```
class Network(object): def __init__(self, sizes): """The list ``sizes`` contains the number of neurons in the
respective layers of the network. For example, if the list
was [2, 3, 1] then it would be a three-layer network, with the
first layer containing 2 neurons, the second layer 3 neurons,
and the third layer 1 neuron. The biases and weights for the
network are initialized randomly, using a Gaussian
distribution with mean 0, and variance 1. Note that the first
layer is assumed to be an input layer, and by convention we
won't set any biases for those neurons, since biases are only
ever used in computing the outputs from later layers.""" self.num_layers = len(sizes) self.sizes = sizes self.biases = [np.random.randn(y, 1) for y in sizes[1:]] self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
```

In this simple implementation the weights and biases are initialized by drawing from the standard normal distribution — a normal distribution with a mean of zero, standard deviation of 1. We can also see how the biases are explicitly initialized as column vectors.

The `Network`

class exposes two methods that users would call directly. First,
the `evaluate`

method, which asks the network to try to identify the digits in a
set of test images and then scores the result based on the *a priori* known
correct answer. Second, the `SGD`

method runs a stochastic gradient descent
learning procedure by iterating over a set of images, breaking up the full set
of images into small mini-batches, updating the network’s state based on each
mini-batch of images and a user-specifiable learning rate, `eta`

, and then
re-running the training procedure for a new randomly selected set of
mini-batches for a user-specifiable number of *epochs*. The core of the
algorithm, where each mini-batch and the state of the neural network gets
updated, looks like this:

```
def update_mini_batch(self, mini_batch, eta): """Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
is the learning rate.""" nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] for x, y in mini_batch: delta_nabla_b, delta_nabla_w = self.backprop(x, y) nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]
```

For each training image in the mini-batch, we accumulate estimates for the
gradient of the cost function via backpropagation (implemented in the `backprop`

function). Once we exhaust the mini-batch, we adjust the weights and biases
according to the estimated gradients. The update includes `len(mini_batch)`

in
the denominator because we want the average gradient over all the estimates in
the mini-batch. We can also control how fast the weights and biases get updated
by adjusting the learning rate, `eta`

, which globally modulates how big the
updates from each mini-batch can be.

The `backprop`

function calculates the cost gradient for the neural network by
starting with the expected output of the network given the input image and then
working backward through the network to propagate the error in the network
through the layers. This requires a substantial amount of data munging, and its
where I spent most of my time porting the code to rust but I think it’s a little
too long to dive into in depth here, take a look at chapter
2 of the book if you want
more detail.

## The Rust Implementation

The first step here was to figure out how to load the data. That ended up being
fiddly enough that I decided to break that off into its own
post. With that
sorted I then had to figure out how to represent the Python `Network`

class in
rust. I ended up deciding to use a struct:

```
use ndarray::Array2; #[derive(Debug)]
struct Network { num_layers: usize, sizes: Vec<usize>, biases: Vec<Array2<f64>>, weights: Vec<Array2<f64>>,
}
```

The struct gets initialized with the number of neurons in each layer in much the same way as the Python implementation:

```
use rand::distributions::StandardNormal;
use ndarray::{Array, Array2};
use ndarray_rand::RandomExt; impl Network { fn new(sizes: &[usize]) -> Network { let num_layers = sizes.len(); let mut biases: Vec<Array2<f64>> = Vec::new(); let mut weights: Vec<Array2<f64>> = Vec::new(); for i in 1..num_layers { biases.push(Array::random((sizes[i], 1), StandardNormal)); weights.push(Array::random((sizes[i], sizes[i - 1]), StandardNormal)); } Network { num_layers: num_layers, sizes: sizes.to_owned(), biases: biases, weights: weights, } } }
```

One difference is that in Python we used
`numpy.random.randn`

to initialize the biases and weights while in rust we use the
`ndarray::Array::random`

function which accepts a
`rand::distribution::Distribution`

as a parameter, allowing the choice of an
arbitrary distribution. In this case we used the
`rand::distributions::StandardNormal`

distribution. It’s worth noting that this
uses an interface defined in three different crates, two of which — `ndarray`

itself and `ndarray-rand`

— are maintained by the `ndarray`

authors, and
another — `rand`

— maintained by a different set of developers.

### The merits of monolithic packages

In principle it’s nice that random number generation is not isolated inside the
`ndarray`

codebase and if new random number distributions or capabilities are
added to `rand`

, `ndarray`

and all other crates in the rust ecosystem that need
random numbers can benefit equally. On the other hand it does add some cognitive
overhead to need to refer between the documentation for the various crates
instead of having a single centralized place to look. In my particular case I
also got a little unlucky and happened to do this project right after `rand`

made a release that changed its public API. This led to an incompatibility
between `ndarray-rand`

, which depended on version 0.6 of `rand`

, and my project which
declared a dependency on version 0.7.

I’d heard that `cargo`

and rust’s build system handle this sort of problem
really well but at least in this case I was presented with a confusing error
message about how the random number distribution I was passing in didn’t satisfy
the `Distribution`

trait. While this is true — it satisfied the `Distribution`

trait from `rand 0.7`

but not the one from `rand 0.6`

that `ndarray-rand`

expected — it is extremely confusing because the version numbers of the
various crates don’t show up in the error message. I ended up reporting this as
an issue. I discovered
there that these confusing error messages from crates with incompatible APIs is
a long-standing issue for the
rust language. Hopefully in the future rust can grow more helpful error
messages.

In the end this separation of concerns caused a lot of friction for me as a new
user. In Python I could have simply done `import numpy`

and be done. I do think
that NumPy probably went a bit too far in the direction of being completely
monolithic — it was originally written at a time when packaging and
distributing Python code with C extensions was much harder than it is today —
I do think that going too far in the other extreme can make a language or
ecosystem of tools harder to learn.

### Types and ownership

The next bit I’ll show in detail is the rust version of `update_mini_batch`

:

```
impl Network { fn update_mini_batch( &mut self, training_data: &[MnistImage], mini_batch_indices: &[usize], eta: f64, ) { let mut nabla_b: Vec<Array2<f64>> = zero_vec_like(&self.biases); let mut nabla_w: Vec<Array2<f64>> = zero_vec_like(&self.weights); for i in mini_batch_indices { let (delta_nabla_b, delta_nabla_w) = self.backprop(&training_data[*i]); for (nb, dnb) in nabla_b.iter_mut().zip(delta_nabla_b.iter()) { *nb += dnb; } for (nw, dnw) in nabla_w.iter_mut().zip(delta_nabla_w.iter()) { *nw += dnw; } } let nbatch = mini_batch_indices.len() as f64; for (w, nw) in self.weights.iter_mut().zip(nabla_w.iter()) { *w -= &nw.mapv(|x| x * eta / nbatch); } for (b, nb) in self.biases.iter_mut().zip(nabla_b.iter()) { *b -= &nb.mapv(|x| x * eta / nbatch); } }
}
```

The function makes use of two short helper functions I defined that makes this a little less verbose:

```
fn to_tuple(inp: &[usize]) -> (usize, usize) { match inp { [a, b] => (*a, *b), _ => panic!(), }
} fn zero_vec_like(inp: &[Array2<f64>]) -> Vec<Array2<f64>> { inp.iter() .map(|x| Array2::zeros(to_tuple(x.shape()))) .collect()
}
```

Comparing with the Python implementation the interface for calling
`update_mini_batch`

is a little different. Rather than passing in a list of
objects directly, instead of I pass in a reference to the full set of training
data and a slice of indices to consider within that full set. This ended up
being a little easier to reason about without triggering the borrow checker.

Creating `nabla_b`

and `nabla_w`

in `zero_vec_like`

is very similar to the list
comprehension we used in Python. There is one wrinkle that caused me some
frustration which is that if I try to create a zero-filled array with
`Array2::zeros`

and pass it a slice or `Vec`

for the shape, I get back an `ArrayD`

instance. To get an `Array2`

— that is explicitly a 2D array and not a generic
D-dimensional array — I need to pass a tuple to `Array::zeros`

. However, since
`ndarray::shape`

returns a slice, I need to convert the slice to a tuple
manually using the `to_tuple`

function. This sort of thing can be glossed over
in Python but in rust the difference between a tuple and slice can be very
important, as in this API.

The code to estimate the updates for the weights and biases via backpropagation has a very similar structure to the python implementation. We train each example image in the mini-batch and obtain estimates for the gradient of the quadratic cost as a function of the biases and weights:

```
let (delta_nabla_b, delta_nabla_w) = self.backprop(&training_data[*i]);
```

and then accumulate these estimates:

```
for (nb, dnb) in nabla_b.iter_mut().zip(delta_nabla_b.iter()) { *nb += dnb;
}
for (nw, dnw) in nabla_w.iter_mut().zip(delta_nabla_w.iter()) { *nw += dnw;
}
```

Once we’ve finished processing the mini-batch, we update the weights and biases, modulated by the learning rate:

```
let nbatch = mini_batch_indices.len() as f64;
for (w, nw) in self.weights.iter_mut().zip(nabla_w.iter()) { *w -= &nw.mapv(|x| x * eta / nbatch);
}
for (b, nb) in self.biases.iter_mut().zip(nabla_b.iter()) { *b -= &nb.mapv(|x| x * eta / nbatch);
}
```

This example illustrates how the ergonomics of working with array data is very
different in Rust compared with Python. First, rather than multiplying the array
by the float `eta / nbatch`

, we instead use `Array::mapv`

and define a closure
in-line to map in a vectorized manner over the full array. This sort of thing
would not be very fast in Python because function calls are very slow. In rust
it doesn’t make much difference. We also need to borrow the return value of
`mapv`

with `&`

when we subtract, lest we consume the array data while we
iterate over it. Needing to think carefully about whether functions consume data
or take references makes it much more conceptually demanding to write code like
this in Rust than in Python. On the other hand I do have much higher confidence
that my code is correct when it compiles. I’m not sure whether the fact that
this code was so demanding for me to write is due to Rust really being harder to
write or the disparity between my experience in Rust and Python.

## Rewrite it in rust and everything will be better

At this point I was left with something that was faster than the unoptimized
Python version I had started with. However, instead of a 10x or better speedup
that one might expect moving from a dynamic, interpreted language like Python to
a compiled performance-oriented language like rust, I only observed about a 2x
improvement. To understand why I decided to measure the performance of the rust
code. Luckily there is a very nice project that makes it easy to generate flame
graphs for rust projects:
flamegraph. This adds a
`flamegraph`

subcommand to `cargo`

, so one needs only to do `cargo flamegraph`

in a crate, it will run the code, and then write a flamegraph `svg`

file one can
inspect with a web browser.

If you’ve never looked at a flamegraph before the idea is that the proportion of
a program’s runtime that occurs in a routine is proportional to the width of the
bar for that routine. The main function is at the bottom of the graph and
functions called by main are stacked on top. This gives you a simple view into
what functions take up the most time in a program - anything that is very “wide”
in the graph is where most of the time is spent and any stack of functions that
is very tall and wide is spending a lot of time in code very deep in a call
stack. Looking at the flamegraph above we can see that about half of the time is
spent in functions with names like `dgemm_kernel_HASWELL`

— these are
functions in the OpenBLAS linear alebra library. The rest of the time is spent
doing addition between arrays in `update_mini_batch and allocating arrays —
all other parts of my program make a negligible contribution to the runtime.

If we made an analogous flamegraph for the python code, we would see a
similar pattern — most time is spent doing linear algebra (in the places
where `np.dot`

is called inside the backpropagation routine). So since most of
the time in either Rust or Python is spent inside a numerical linear algebra
library, we can never hope for a 10x speedup.

In fact it’s worse than that. One of the exercises in the book is to rewrite the Python code to use vectorized matrix multiplication. In this approach the backpropagation for all of the samples in each mini-batch happens in a single set of vectorized matrix multiplication operations. This requires the ability to matrix multiplication between 3D and 2D arrays. Since each matrix multiplication operation happens using a larger amount of data than the non-vectorized case, OpenBLAS is able to more efficiently occupy CPU caches and registers, ultimately better using the available CPU resources on my laptop. The rewritten Python version ends up faster than the Rust version, again by a factor of two or so.

In principle it’s possible to apply the same optimization to the Rust code,
however the `ndarray`

crate does not yet
support matrix
multiplication for dimensionalities higher than 2. It might also be possible to use
thread parallelization on the mini-batch updates using a library like
rayon. I tried this on my laptop and did
not see any speedups but might have on a beefier machine with more CPU
threads. I could also have tried using a different low-level linear algebra
implementation, for example there are rust bindings for
tensorflow and
torch, however at that point I feel
like I might as well be using the Python bindings for those libraries.

## Is rust suitable for data science workflows?

Right now I have to say that the answer is “not yet”. I’ll definitely reach for rust in the future when I need to write optimized low-level code with minimal dependencies. However using it as a full replacement for python or C++ will require a more stabilized and well-developed ecosystem of packages.