Which face detection API is the best? Let’s have a look into Amazon Rekognition, Google Cloud Vision API, IBM Watson Visual Recognition and Microsoft Face API.
Did you ever had the need for face detection?
Maybe to improve image cropping, ensure that a profile picture really contains a face or maybe to simply find images from your dataset containing people (well, faces in this case).
Which face detection SaaS vendor would be the best for your project? Let’s have a deeper look into the differences in success rates, pricing and speed.
In this blog post I'll be analyzing the face detection API's of:
How does face detection work anyway?
Before we dive into our analysis of the different solutions, let’s understand how face detection works today in the first place.
The Viola–Jones Face Detection
It’s the year 2001. Wikipedia is being launched by Jimmy Wales and Larry Sanger, the Netherlands becomes the first country in the world to make same-sex marriage legal and the world witnesses one of the most tragic terror attacks ever.
At the same time two bright minds, Paul Viola and Michael Jone, come together to start a revolution in computer vision.
Until 2001, face detection was something which didn’t work very precise nor very fast. That was, until the Viola-Jones Face Detection Framework was proposed which not only had a high success rate in detecting faces but could do it also in real time.
While face and object recognition challenges existed since the 90’s, they surely boomed even more after the Viola–Jones paper was released.
Deep Convolutional Neural Networks
One of such challenges is the ImageNet Large Scale Visual Recognition Challenge which exists since 2010. While in the first two years the top teams were working mostly with a combination of Fisher Vectors and Support vector machines, 2012 changed everything.
The team of the University of Toronto (consisting of Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton) used for the first time a deep convolutional neural network for object detection. They scored first place with an error rate of 15.4% while the second placed team had a 26.2% error rate!
A year later, in 2013, every team in the top 5 was using a deep convolutional neural network.
So, how does such a network work?
An easy-to-understand video was published by Google earlier this year:
What do Amazon, Google, IBM and Microsoft use today?
Since then, not much changed. Today’s vendors still use Deep Convolutional Neural Networks, probably combined with other Deep Learning techniques though.
Obviously, they don’t publish how their visual recognition techniques exactly work. The information I found was:
While they all sound very similar, there are some differences in the results.
Before we test them, let’s have a look at the pricing models first though!
Amazon, Google and Microsoft have a similar pricing model, meaning that with increasing usage the price per detection drops.
With IBM however, you always pay the same price per API call after your free tier usage volume is exhausted.
Microsoft provides you the best free tier, allowing you to process 30'000 images per month for free.
If you need more detections though, you need to use their standard tier where you start paying from the first image on.
That being said, let’s calculate the costs for three different profile types.
- Profile A: Small startup/business processing 1’000 images per month
- Profile B: Digital vendor with lots of images, processing 100’000 images per month
- Profile C: Data center processing 10’000’000 images per month
|Profile A||$1.00 USD||Free||Free||Free|
|Profile B||$100.00 USD||$148.50 USD||$396.00 USD||$100.00 USD|
|Profile C||$8’200.00 USD||$10’498.50 USD||$39’996.00 USD||7’200.00 USD|
Looking at the numbers, for small customers there’s not much of a difference in pricing. While Amazon charges you starting from the first image, having 1’000 images processed still only costs one Dollar. However, if you don’t want to pay anything, then Google, IBM or Microsoft will be your vendor to go.
Note: Amazon offers a free tier on which you can process 5’000 images per month for the first 12 months for free! However, after this 12 month trial, you’ll have to start paying starting with the first image.
Large API usage
If you really need to process millions of images, it's important to compare how every vendor scales.
Here's a list of the minimum price you pay for the API usage after a certain amount of images.
- IBM constantly charges you $4.00 USD per 1’000 images (no scaling)
- Google scales down to $0.60 USD (per 1’000 images) after the 5’000’000th image
- Amazon scales down to $0.40 USD (per 1’000 images) after the 100’000’000th image
- Microsoft scales down to $0.40 USD (per 1’000 images) after the 100’000’000th image
So, comparing prices, Microsoft (and Amazon) seem to be the winner.
But can they also score in success rate, speed and integration? Let’s find out!
Hands on! Let’s try out the different API’s
Enough theory and numbers, let’s dive into coding! You can find all code used here in my GitHub repository.
Setting up our image dataset
First things first. Before we scan images for faces, let’s set up our image dataset.
For this blog post I’ve downloaded 33 images from pexels.com, many thanks to the contributors/photographers of the images and also to Pexels!
The images have been committed to the GitHub repository, so you don't need to search for any images if you simply want to start playing with the API's.
Writing a basic test framework
Framework might be the wrong word as my custom code only consists of two classes. However, these two classes help me to easily analyze image (meta-) data and have as few code as possibly in the different implementations.
Comparing the vendors SDK’s
As I’m most familiar with PHP, I've decided to stick to PHP for this test. I still want to point out what SDK’s each vendor provides (as of today):
Note: Microsoft doesn't actually provide any SDK's, they do offer code examples for the technologies listed above though.
If you’ve read the lists carefully, you might have noticed that IBM does not only offer the least amount of SDK’s but also no SDK for PHP.
However, that wasn’t a big issue for me as they provide cURL examples which helped me to easily write 37 lines of code for a (very basic) IBM Visual Recognition client class.
Integrating the vendors API’s
Getting the SDK's is easy. Even easier with Composer. However, I did notice some things that could be improved to make a developer’s life easier.
I've started with the Amazon Rekognition API. Going through their documentation, I really felt a bit lost at the beginning. Not only did I miss some basic examples (or wasn’t able to find them?), but also I had the feeling that I have to click a few times until I was able to find what I was looking for. In one case I even gave up and simply got the information by directly inspecting their SDK source code.
On the other hand, it could just be me? Let me know if Amazon Rekognition was easy (or difficult) for you to integrate!
Note: While Google and IBM return the bounding boxes coordinates, Amazon returns the coordinates as ratio of the overall image width/height. I have no idea why that is, but it's not a big deal. You can write a helper function to get the coordinates from the ratio, just as I did.
Next came Google. In comparison with Amazon, they do provide examples, which helped me a lot! Or maybe I was just already in the “investing different SDK’s” mindset.
Whatever the case may be, integrating the SDK felt a lot simpler and also I had to spend less clicks to retrieve information I was looking for.
As stated before, IBM doesn’t (yet?) provide a SDK for PHP. However, with the provided cURL examples, I had a custom client set up in no time. There’s not much that you can do wrong if a cURL example is provided to you!
Looking at Microsoft's code example for PHP (which uses Pear's HTTP_Request2 package), I ended up writing my own client for Microsoft's Face API.
I guess I'm simply a cURL person.
Before we compare the different face detection API’s, let's scan the images first by ourselves! How many faces would a human be able to detect?
If you already had a look on my dataset, you might have seen a few images containing tricky faces. What do I mean by "tricky"? Well, when you e.g. only see a small part of a face and/or the face is in an uncommon angle.
Time for a little experiment
I went over all images and wrote down how many faces I thought I've detected. I would use this number to calculate a vendor's sucess rate for an image and see if it was able to detect as many faces as I did. However, setting the expected number of faces detected solely by me seemed a bit too biased to me. I needed more opinions.
This is when I kindly asked three coworkers to go through my images and tell me how many faces they would detect.
The only task I gave them was "Tell me how many faces, and not heads, you're able to detect". I didn't define any rules, I wanted to give them any imaginable freedom for doing this task.