Convolutional Networks have been hugely successful in the field of deep learning and they are the primary reason why deep learning is so popular right now! They have been very successful, but they have drawbacks in their basic architecture, causing them to not work very well for some tasks.
CNN’s detect features in images and learn how to recognize objects with this information. Layers near the start detecting really simple features like edges and layers that are deeper can detect more complex features like eyes, noses, or an entire face. It then uses all of these features which it has learned, to make a final prediction. Herein lies the flaws of this system — there is no spatial information that is used anywhere in a CNN and the pooling function that is used to connect layers, is really really inefficient.
The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster. Source.
In the process of max pooling, lots of important information is lost because only the most active neurons are chosen to be moved to the next layer. This operation is the reason that valuable spatial information gets lost between the layers. To solve this issue, Hinton proposed that we use a process called “routing-by-agreement”. This means that lower level features (fingers, eyes, mouth) will only get sent to a higher level layer that matches its contents. If the features it contains resemble that of an eye or a mouth, it will get to a “face” or if it contains fingers and a palm, it will get send to “hand”. This complete solution that encodes spatial info into features while also using dynamic routing(routing by agreement) was presented by one of the most influential people in the field of deep learning, Geoffrey Hinton, at NIPS 2017; Capsule Networks.
When we construct objects through rendering in computer graphics, we have to specify and provide some sort of geometric information which tells the computer where to draw the object, the scale of this object, its angle, along with other spatial information. This information is all represented as an object on the screen. But what if we could extract this information just by looking at an object in an image? This is the process which capsule networks are based on, inverse rendering.
Let’s take a look at capsules and how they go about solving the problem of providing spatial information.
When we look at some of the logic that’s behind CNN’s, we begin to notice where it’s architecture fails. Take a look at this picture.
It doesn’t look quite right for a face, even though it has all the necessary components to make up a face. We know that this is not how faces are supposed to look, but because CNN’s only look for features in images, and don’t pay attention to their pose, it’s hard for them to notice a difference between that face and a real face.
How capsule networks solve this problem is by implementing groups of neurons that encode spatial information as well as the probability of an object being present. The length of a capsule vector is the probability of the feature existing in the image and the direction of the vector would represent its pose information.
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. — Source.
In computer graphics applications such as design and rendering, objects are often created by giving some sort of parameter which it will render from. However, in capsules networks, it’s the opposite, where the network learns how to inversely render an image; looking at an image and trying to predict what the instantiation parameters for it are.
It learns how to predict this by trying to reproduce the object it thinks it detected and comparing it to the labelled example from the training data. By doing this it gets better and better at predicting the instantiation parameters. The Dynamic Routing Between Capsulespaper by Geoffrey Hinton proposed the use of two loss functions as opposed to just one.
The main idea behind this is to create equivariance between capsules. This means moving a feature around in an image will also change its vector representation in the capsules, but not the probability of it existing. After lower level capsules detect features, this information is sent up towards higher level capsules that have a good fit with it.
As seen in this picture, all of the pose parameters of the features are used to determine the final result.
As you may already know, a traditional neuron in a neural net performs the following scalar operations:
Weighting of inputs
Sum of weighted inputs
These operations are slightly changed within capsules and are performed as follows:
Matrix multiplication of input vectors with weight matrices. This encodes really important spatial relationships between low-level features and high-level features within the image.
Weighting input vectors. These weights decide which higher level capsule the current capsule will send it’s output to. This is done through a process of dynamic routing, which I’ll talk more about soon.
Sum of weighted input vectors. (Nothing special about this)
Nonlinearity using “squash” function. This function takes a vector and “squashes” it to have a maximum length of 1, and a minimum length of 0 while retaining its direction.
In this process of routing, lower level capsules send its input to higher level capsules that “agree” with its input. For each higher capsule that can be routed to, the lower capsule computes a prediction vector by multiplying its own output by a weight matrix. If the prediction vector has a large scalar product with the output of a possible higher capsule, there is top-down feedback which has the effect of increasing the coupling coefficient for that high-level capsules and decreasing it for others.
The Encoder takes the image input and learns how to represent it as a 16-dimensional vector which contains all the information needed to essentially render the image.
Conv Layer — Detects features that are later analyzed by the capsules. As proposed in the paper, contains 256 kernels of size 9x9x1.
Primary(Lower) Capsule Layer — This layer is the lower level capsule layer which I described previously. It contains 32 different capsules and each capsule applies eighth 9x9x256 convolutional kernels to the output of the previous convolutional layer and produces a 4D vector output.
Digit(Higher) Capsule Layer — This layer is the higher level capsule layer which the Primary Capsules would route to(using dynamic routing). This layer outputs 16D vectors that contain all the instantiation parameters required for rebuilding the object.
The decoder takes the 16D vector from the Digit Capsule and learns how to decode the instantiation parameters given into an image of the object it is detecting. The decoder is used with a Euclidean distance loss function to determine how similar the reconstructed feature is compared to the actual feature that it is being trained from. This makes sure that the Capsules only keep information that will benefit in recognizing digits inside its vectors. The decoder is a really simple feed-forward neural net that is described below.
Fully Connected (Dense) Layer 1
Fully Connected (Dense) Layer 2
Fully Connected (Dense) Layer 3 — Final Output with 10 classes
While CapsNet has achieved state of the art performance on simple datasets such as MNIST, it struggles on more complex data that might be found on datasets such as CIFAR-10 or Imagenet. This is because of the excess amount of information that is found in images throw off the capsules.
Capsule nets are still in a research and development phase and not reliable enough to be used in commercial tasks as there are very few proven results with them. However, the concept is sound and more progress in this area could lead to the standardization of Capsule Nets for deep learning image recognition.
If you enjoyed my article or learned something new, make sure to: