Is it possible to implement object detection models with real-time performance without GPU?
faced is a proof of concept that it is possible to build a custom object detection model for a single class object (in this case, faces) running in real time on a CPU.
There are many scenarios where a single class object detection is needed. This means that we want to detect the location of all objects that belong to a specific class in an image. For example, we could be detecting faces for a face identification system or people for pedestrian tracking.
What is more, most of the time we would like to run these models in real time. In order to achieve this, we have a feed of images providing samples at rate x and we need a model to run in less than rate x for each of the samples. Then, we can process images as soon as they are available.
The most accessible and used solution nowadays to solve this task (and many others in computer vision) is to perform transfer learning on previously trained models (in general standard models trained on huge datasets like those found in Tensorflow Hub or in TF Object Detection API)
There are plenty of trained object detection architectures (e.g. FasterRCNN, SSD or YOLO) that achieve impressive accuracy within real-time performance running on GPUs.
GPUs are expensive but necessary in the training phase. However, in inference having a dedicated GPU to achieve real-time performance is not viable. All of the general object detection models (as those mentioned above) fail to run in real time without a GPU.
Then, how can we revisit the object detection problem for single class objects to achieve real-time performance but on CPU?
All of the above mentioned architectures were designed to detect multiple object classes (trained on COCO or PASCAL VOC datasets). In order to be able to classify each bounding box to it’s appropriate class, these architectures require a massive amount of feature extraction. This translates to huge amount of learnable parameters, huge amount of filters, huge amount of layers. In other words, this networks are big.
If we define simpler tasks (rather than multiple-class bounding box classification) then we can think of the network needing to learn less features to perform the task. Detecting a face in an image is obviously more simple than detecting cars, people, traffic signs and dogs (all within the same model). The amount of features required by a Deep Learning model in order to recognize faces (or any single class object) will be less than the amount of features for detecting tens of classes at the same time. The required information to perform the first task is less than the latter task.
Single class object detection models will need less learnable features. Less parameters mean that the network will be smaller. Smaller networks run faster because it requires less computations.
Then, the question is: how small can we go to achieve real time performance on CPU but keeping accuracy?
faced main concept: building the smallest possible network to (hopefully) run in real time in CPU while keeping accuracy.
faced is an ensemble of 2 neural networks, both implemented using Tensorflow.
faced main architecture is heavily based on YOLO’s architecture. Basically, it’s a Fully Convolutional Network (FCN) that runs a 288x288 input image through a series of convolutional and pooling layers (no other layer types are involved).
Convolutional layers are in charge of extracting space-aware features. Pooling layers increase the receptive field of consequent convolutional layers.
The architecture’s output is a 9x9 grid (versus 13x13 grid in YOLO). Each grid cell is in charge of predicting whether a face is inside that cell (versus YOLO where each cell can detect up to 5 different object).
Each grid cell has 5 associated values. The first one is the probability p of that cell containing the center of a face. The other 4 values are the (x_center, y_center, width, height) of the detected face (relative to the cell).
The exact architecture is defined as follows:
- 2x [8 filter convolutional layer on 288x288 image]
- Max pooling (288x288 to 144x144 feature map)
- 2x [16 filter convolutional layer on 144x144 feature map]
- Max pooling (144x144 to 72x72 feature map)
- 2x [32 filter convolutional layer on 72x72 feature map]
- Max pooling (72x72 to 36x36 feature map)
- 2x [64 filter convolutional layer on 36x36 feature map]
- Max pooling (36x36 to 18x18 feature map)
- 2x [128 filter convolutional layer on 18x18 feature map]
- Max pooling (18x18 to 9x9 feature map)
- 4x [192 filter convolutional layer on 9x9 feature map]
- 5 filter convolutional layer on 9x9 feature map for the final grid
All activation function are
faced has6,993,517 parameters. YOLOv2 has 51,000,657 parameters. It’s size is 13% of YOLO’s size!
(x_center, y_center, width, height) outputs of the main network were not as accurate as expected. Hence, a small CNN network was implemented to take as input a small image containing a face (cropped with the main architecture outputs) and to output a regression on the ground truth bounding box of the face.
It’s only task is to complement and improve the output coordinates of the main architecture.
The specific architecture of this network is not relevant.
Both networks were trained on the WIDER FACE dataset.
“WIDER FACE dataset is a face detection benchmark dataset […]. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images.”
Training was done on an Nvidia Titan XP GPU. Training time took ~20 hours. Batch Normalization technique was used to help convergence and dropout was used (at 40% rate) as regularization method to avoid overfitting.
faced for inference, first the image is resized to 288x288 in order to be fed into the network. The image goes under the FCN giving the 9x9 grid output described above.
Each cell has a probability p of containing an image. Cells are filtered by a configurable threshold (i.e. only cells with p > t are kept). For those kept cells, the face is located using the cell’s (x_center, y_center, width, height).
There are some cases where multiple cells can compete for the same face. Let’s suppose that a face center is located in the exact location where 4 cells intersect. Those 4 cells could have a high p (probability of containing a face center inside the cell). If we kept all cells and project the face coordinates of each cell, then we would see the same face with 4 similar bounding boxes around it. This problem is fixed through a technique called non max suppression. The result is shown in the following image:
faced is able to achieve the following speed on inference:
Pretty good considering that YOLOv2 cannot achieve even 1FPS on an i5 2015 MBP.
Let’s see some results!
Now let’s see a comparison between
faced and Haar Cascades, which is a computer vision traditional approach that does not use Deep Learning. Both methods run under similar speed performance.
faced shows significant more accuracy.
faced is a really simple program that can be used both embeded in Python code or as command line program.
Go to the github repo for further instructions:
Liked the project? Leave a ⭐ on the project’s repo!
faced is a proof of concept that you don’t always need to rely on general purpose trained models in scenarios were these models are an overkill to your problem and performance issues are involved. Don’t overestimate the power of spending time designing custom neural network architectures that are specific to your problem. These specific networks will be a much better solution than the general ones.