In my previous article on deep rectifier networks I focused on their algebraic structure and showed that the latent space formed a Hilbert space which means that they function by decorrelating input signals . Before continuing I must clarify that by deep rectifier networks I am referring to networks of arbitrary architecture with relu activations. Now, while the algebraic perspective is insightful it doesn’t clearly explain why deeper networks might be more expressive. If the latent space of a network with nodes has possible linear feature maps why would it matter whether it has five layers with nodes per layer or one hidden layer with nodes? Why would depth make a difference?
A colleague suggested that I take a look ‘Tropical Geometry of Deep Neural Networks’ [2] which supposedly explains why deeper networks are exponentially more expressive but as I have little familiarity with tropical geometry I decided to try and figure out whether I could answer as much of the question myself before reading that publication. In fact, by following my own reasoning it occurred to me that a geometric interpretation of affine transformations was crucial and that it may be useful to identify deep rectifier networks with a particular kind of geometric decision tree.
In summary, here’s a short breakdown of my attempt to answer this question:

A deep rectifier network may be identified with an level decision tree for solving geometric problems in a sequential manner with lower layers of the network forming reusable subexpressions.

Each expression is an affine transformation and all geometric transformations in Euclidean space are affine. It follows that deep rectifier networks may be viewed as geometric programs where each subexpression is a geometric operation.

At each layer of the network the candidate expressions are a subset of the power set of distinct nodes at that layer. Hence the importance of network width for versatile geometric reasoning.

Furthermore, each subexpression at layer may be reused at most times by subexpressions at higher levels of the decision tree. Montufar [1] gives a similar argument but he identifies expressions with ‘folds’ which is incorrect in my opinion.

The importance of the last statement is clear when we think in terms of geometric transformations. Deep rectifier networks permit an exponential number of possible sequences of geometric transformations of length and therefore the complexity of the set of possible transformations of the latent space is proportional to network depth.
In summary depth gives us complexity and width gives us versatility. Now, to better understand these ideas I propose applying deep rectifier networks to the problem of selforganising origami which involves designing faulttolerant structures by folding matter in a selforganised manner, something nature does amazingly well [3]. In fact, I see the application of selforganising origami as a proofofprinciple that my explanation must be correct.
Deep Rectifier Networks as Nlevel decision trees:
Let’s suppose that our deep rectifier network has layers and nodes at each layer . Then I claim that a deep rectifier network may be identified with an level decision tree for solving geometric problems in a sequential manner. Let me explain:

At the ith layer there are possible affine transformations that may be applied to the output of the (i1)th layer.

It follows that the combinatorial versatility at layer is proportional to so network width is important.

Within the context of computational geometry(where algorithms and data structures are defined in terms of geometric objects), these affine transformations are subexpressions may be reused by higher levels of the decision tree.
The last insight is useful because it becomes immediately clear that each subexpression at layer may be reused at most:
\begin{equation} 2^{N\sum_{k=1}^i n_k} \end{equation}
times by subexpressions at higher levels of the decision tree. Now, we see that the tree analogy is actually quite good because the number of geometric sequences grows exponentially with network depth so by the time we have reached the end of the network we are in a real jungle with a astronomical number of options.
But, why would the composition of affine transformations lead to greater geometric reasoning ability? Why not simply use a single layer network with nodes?
All geometric transformations in Euclidean space are affine transformations:
Let’s make the reasonable assumption that each layer has nodes. How do sequences of affine transformations transform subspaces of ? To answer this question we shall first handle the case of i.e. three dimensional space.
Geometric transformations:
There are four elementary geometric transformations in which I shall describe briefly below. The reader may check that these are all invertible transforms and therefore geometric transforms are elements of the affine group on , .
Translation:
Given a vector , the identity matrix and a translation vector we may define translations as follows:
\begin{equation} X \mapsto IX + b \end{equation}
Shear:
This is simply the identity matrix with one of the six null elements replaced with a nonzero real number.
Rotation matrices:
For these I actually have to typeset matrices in mathjax which I haven’t figured out how to do. In any case, there are three types of rotation matrices as you can guess and one interesting thing about rotations is that they aren’t commutative.
Scaling:
This is similar to the identity matrix except that one or more of the diagonal entries has an absolute value which is neither zero nor one.
Now, the power of these elementary geometric transforms is due to the fact that any geometric transform is a composition of these elementary transforms.
Geometric transformations in higherdimensional Euclidean spaces:
Let’s suppose so that we have to interpret the affine transformation at layer assuming that . Let’s note that there are orthogonal projections of onto . It follows that there is a direct interpretation of affine transforms on . These are deformations of at most three dimensional subspaces of the dimensional latent space where each geometric transform is followed by a featuredependent translation which is a linear combination of the other values of the feature vector.
There is however the natural question of whether this interpretation is exactly correct. How many of the columnwise projections of the layer affine transform are invertible? These are the ones that may be interpreted as geometric transformations.
A kinematic view:
Let’s consider a finite set of transforms . How can we interpret the sequence of geometric(and hence invertible) transformations:
\begin{equation} T = x_n \circ x_{n1} \circ … \circ x_1 \end{equation}
Well, these may be understood as a motion through space with discrete time increments. For all the above reasons network depth makes a difference in terms of the affordable complexity of the geometric expressions. With one affine transform you have a line, with two a curve and with a large number you have arbitrarily complex trajectories i.e. deformations of space.
Applying deep and wide rectifier networks to selforganised origami:
Origami, as the reader probably knows, is the art of creating new shapes via a sequence of elementary folds which correspond to rotations followed by translations and viceversa. What may surprise the reader however is the Universality of Origami.
It may be demonstrated that every 3D shape can be approximated by a sequence of origami folds applied to a sufficiently large piece of paper. Similarly, the universality of rectifier networks implies that a sufficiently deep and wide rectifier network may be used to approximate any finite number of distinct sequences of geometric transformations.
This begs the question: can we use deep rectifier networks in combination with a planning algorithm to explore the space of possible origami shapes?
The challenge of selforganised origami:
L. Mahadevan in ‘selforganised origami’ [3] introduces the challenge of selforganised origami in the following manner:
The controlled folding and unfolding of maps, space structures, wings, leaves, petals, and other foldable laminae is potentially complicated by the independence of individual folds; as their number increases, there is a combinatorial explosion in the number of folded possibilities’
This combinatorial explosion is what makes the search problem in neural network parameter space potentially really hard. But, the potential benefits of selforganised origami is that we will discover structures that are faulttolerant in principle and it may prove to be a very useful tool for modeling morphogenesis. Structures that are commonly observed in nature must be robust by necessity.
Here are a concrete examples of origamiinspired designs already used by engineers:
 NASA’s star shade project: a sunflower shaped object that is designed to protect a space telescope. More information in this wonderful interview.
 James Webb space telescope: The primary mirror is made from segments that can be folded inwards. More information in this BBC article.
The primary reason why space technology makes heavy use of origami structures is due to the rather small diameter of rockets ~5 m relative to the large structures that must be sent into space. Having said this, I think we have barely scratched the potential of selforganised origami from both the perspective of science and engineering.
How exactly do we go about this?:
I would actually hesitate to answer this question because I actually think this problem isn’t straightforward at all. Only that deep rectifier networks may be one of the building blocks. On a high level this is a generative modeling problem but the devil is in the details.
The short answer is to try everything that appears to be reasonable. That’s exactly what I plan to do.
References:
 Montufar, G. et al. On the Number of linear Regions of Deep Neural Networks. 2014.
 L. Zhang, G. Naitzat & LekHeng Lim. Tropical Geometry of Deep Neural Networks. 2018.
 L. Mahadevan and S. Rica. SelfOrganized Origami. 2005.