Deep Rectifier Networks as geometric decision trees

In my previous article on deep rectifier networks I focused on their algebraic structure and showed that the latent space formed a Hilbert space which means that they function by de-correlating input signals . Before continuing I must clarify that by deep rectifier networks I am referring to networks of arbitrary architecture with relu activations. Now, while the algebraic perspective is insightful it doesn’t clearly explain why deeper networks might be more expressive. If the latent space of a network with nodes has possible linear feature maps why would it matter whether it has five layers with nodes per layer or one hidden layer with nodes? Why would depth make a difference?

A colleague suggested that I take a look ‘Tropical Geometry of Deep Neural Networks’ [2] which supposedly explains why deeper networks are exponentially more expressive but as I have little familiarity with tropical geometry I decided to try and figure out whether I could answer as much of the question myself before reading that publication. In fact, by following my own reasoning it occurred to me that a geometric interpretation of affine transformations was crucial and that it may be useful to identify deep rectifier networks with a particular kind of geometric decision tree.

In summary, here’s a short breakdown of my attempt to answer this question:

1. A deep rectifier network may be identified with an level decision tree for solving geometric problems in a sequential manner with lower layers of the network forming re-usable sub-expressions.

2. Each expression is an affine transformation and all geometric transformations in Euclidean space are affine. It follows that deep rectifier networks may be viewed as geometric programs where each sub-expression is a geometric operation.

3. At each layer of the network the candidate expressions are a subset of the power set of distinct nodes at that layer. Hence the importance of network width for versatile geometric reasoning.

4. Furthermore, each sub-expression at layer may be re-used at most times by sub-expressions at higher levels of the decision tree. Montufar [1] gives a similar argument but he identifies expressions with ‘folds’ which is incorrect in my opinion.

5. The importance of the last statement is clear when we think in terms of geometric transformations. Deep rectifier networks permit an exponential number of possible sequences of geometric transformations of length and therefore the complexity of the set of possible transformations of the latent space is proportional to network depth.

In summary depth gives us complexity and width gives us versatility. Now, to better understand these ideas I propose applying deep rectifier networks to the problem of self-organising origami which involves designing fault-tolerant structures by folding matter in a self-organised manner, something nature does amazingly well [3]. In fact, I see the application of self-organising origami as a proof-of-principle that my explanation must be correct.

## Deep Rectifier Networks as N-level decision trees:

Let’s suppose that our deep rectifier network has layers and nodes at each layer . Then I claim that a deep rectifier network may be identified with an level decision tree for solving geometric problems in a sequential manner. Let me explain:

1. At the ith layer there are possible affine transformations that may be applied to the output of the (i-1)th layer.

2. It follows that the combinatorial versatility at layer is proportional to so network width is important.

3. Within the context of computational geometry(where algorithms and data structures are defined in terms of geometric objects), these affine transformations are sub-expressions may be re-used by higher levels of the decision tree.

The last insight is useful because it becomes immediately clear that each sub-expression at layer may be re-used at most:

$$2^{N-\sum_{k=1}^i n_k}$$

times by sub-expressions at higher levels of the decision tree. Now, we see that the tree analogy is actually quite good because the number of geometric sequences grows exponentially with network depth so by the time we have reached the end of the network we are in a real jungle with a astronomical number of options.

But, why would the composition of affine transformations lead to greater geometric reasoning ability? Why not simply use a single layer network with nodes?

## All geometric transformations in Euclidean space are affine transformations:

Let’s make the reasonable assumption that each layer has nodes. How do sequences of affine transformations transform subspaces of ? To answer this question we shall first handle the case of i.e. three dimensional space.

## Geometric transformations:

There are four elementary geometric transformations in which I shall describe briefly below. The reader may check that these are all invertible transforms and therefore geometric transforms are elements of the affine group on , .

### Translation:

Given a vector , the identity matrix and a translation vector we may define translations as follows:

$$X \mapsto IX + b$$

## Shear:

This is simply the identity matrix with one of the six null elements replaced with a non-zero real number.

## Rotation matrices:

For these I actually have to type-set matrices in mathjax which I haven’t figured out how to do. In any case, there are three types of rotation matrices as you can guess and one interesting thing about rotations is that they aren’t commutative.

## Scaling:

This is similar to the identity matrix except that one or more of the diagonal entries has an absolute value which is neither zero nor one.

Now, the power of these elementary geometric transforms is due to the fact that any geometric transform is a composition of these elementary transforms.

## Geometric transformations in higher-dimensional Euclidean spaces:

Let’s suppose so that we have to interpret the affine transformation at layer assuming that . Let’s note that there are orthogonal projections of onto . It follows that there is a direct interpretation of affine transforms on . These are deformations of at most three dimensional subspaces of the dimensional latent space where each geometric transform is followed by a feature-dependent translation which is a linear combination of the other values of the feature vector.

There is however the natural question of whether this interpretation is exactly correct. How many of the column-wise projections of the layer affine transform are invertible? These are the ones that may be interpreted as geometric transformations.

## A kinematic view:

Let’s consider a finite set of transforms . How can we interpret the sequence of geometric(and hence invertible) transformations:

$$T = x_n \circ x_{n-1} \circ … \circ x_1$$

Well, these may be understood as a motion through space with discrete time increments. For all the above reasons network depth makes a difference in terms of the affordable complexity of the geometric expressions. With one affine transform you have a line, with two a curve and with a large number you have arbitrarily complex trajectories i.e. deformations of space.

## Applying deep and wide rectifier networks to self-organised origami:

Origami, as the reader probably knows, is the art of creating new shapes via a sequence of elementary folds which correspond to rotations followed by translations and vice-versa. What may surprise the reader however is the Universality of Origami.

It may be demonstrated that every 3D shape can be approximated by a sequence of origami folds applied to a sufficiently large piece of paper. Similarly, the universality of rectifier networks implies that a sufficiently deep and wide rectifier network may be used to approximate any finite number of distinct sequences of geometric transformations.

This begs the question: can we use deep rectifier networks in combination with a planning algorithm to explore the space of possible origami shapes?

## The challenge of self-organised origami:

L. Mahadevan in ‘self-organised origami’ [3] introduces the challenge of self-organised origami in the following manner:

The controlled folding and unfolding of maps, space structures, wings, leaves, petals, and other foldable laminae is potentially complicated by the independence of individual folds; as their number increases, there is a combinatorial explosion in the number of folded possibilities’

This combinatorial explosion is what makes the search problem in neural network parameter space potentially really hard. But, the potential benefits of self-organised origami is that we will discover structures that are fault-tolerant in principle and it may prove to be a very useful tool for modeling morphogenesis. Structures that are commonly observed in nature must be robust by necessity.

Here are a concrete examples of origami-inspired designs already used by engineers:

1. NASA’s star shade project: a sunflower shaped object that is designed to protect a space telescope. More information in this wonderful interview.
2. James Webb space telescope: The primary mirror is made from segments that can be folded inwards. More information in this BBC article.

The primary reason why space technology makes heavy use of origami structures is due to the rather small diameter of rockets ~5 m relative to the large structures that must be sent into space. Having said this, I think we have barely scratched the potential of self-organised origami from both the perspective of science and engineering.