Comparison of the best NSFW Image Moderation APIs 2018

By Aditya Ananthram

A comprehensive benchmark of multiple Image content filtering API providers across different categories like Nudity, Pornography and Gore.

A human being can instinctively decide whether what they are seeing is inappropriate or not. This problem however is far from solved when it comes to having an all seeing AI that can decide whether an image or video is inappropriate(Not Safe For Work). Many companies are now racing to be best when it comes to applying automated techniques to identify whether a piece of media is safe to disseminate or be purged from existence.

I wanted to understand for myself what the state-of-the-art is when it comes to automated detection of universally considered NSFW content out there. I will be basing the comparison on the APIs performance on the following categories

  • Explicit Nudity
  • Suggestive Nudity
  • Porn/sexual act
  • Simulated/Animated porn
  • Gore/Violence

Tl;DR: if you are just interested in finding out which is the best API out there you can skip right to the overall comparison at the end of the post.

Dataset: For the evaluation I created a custom NSFW dataset with equal weightage to each nsfw subcategory. The dataset consists of 120 images with 20 nsfw positive images for each of the five categories outlined above and 20 sfw images. I decided against using the open source YACVID 180 image dataset as it relies primarily on using nudity as a measure of NSFW content.

Collection of NSFW images is a tedious, time consuming and downright painful task hence the low image count.

The dataset has been open-sourced and is available for download Here.[WARNING:Contains Explicit Content]

Here is a sheet containing the raw predictions of the APIs on each image in the dataset.

Each of the classifiers are evaluated on the universally accepted metrics such as:

If a classifier called something NSFW and it was actually NSFW

If a classifier called something SFW and it was actually SFW

If a classifier called something NSFW and it was actually SFW

If a classifier called something SFW and it was actually NSFW

If the model makes a prediction can you trust it?

If the model says an image is NSFW how often is it right?

If all the NSFW Images how many does it identify?

It’s a mixture of Precision and Recall and often similar to accuracy.

I first evaluated each of the APIs by category to see how they perform at detecting each of the different types of NSFW content.

The Google and Sightengine API really shine here by being the only one that is able to detect all the pornographic images correctly. Nanonets and Algorithmia are a close second as they are able to correctly classify 90% of all pornographic images. Microsoft and Imagga have the worst performance on this category.

Links to original images: Porn19, Porn7, Porn18, Porn14

The images that are easy to identify are explicitly pornographic. All the providers got the images above correct. Most of them predicted NSFW content with a very high confidence.

Links to original images: Porn6, Porn2, Porn10, Porn3

The images that were difficult to identify were due to occlusion or blurring which made it difficult. In the worst case 11/12 vendors got the image wrong. Pornography has high variance in performance depending on the intensity of pornography and how clearly visible the pornographic content is.

Most of the APIs performed remarkably well in this category with many of them having a 100% detection rate. Even the lowest performing APIs(Clarifai and Algorithmia) had a 90% detection rate here. The definition of what is considered nudity has always been subject to debate and as is clear from the images that are difficult to identify they mostly fail in cases where one could argue these are SFW.

Links to original images: Nudity9, Nudity8, Nudity18, Nudity4

The images that are easy to identify had clear visible nudity and are explicit. These would be called NSFW by anybody without a difference of opinion. None of the providers made an error and the average scores were all 0.99.

Links to original images:

Links to original images: Nudity9, Nudity8, Nudity18, Nudity4

The images that were subject to debate were the ones that the providers got wrong. This could just be that each of the providers have different settings for sensitivity to nudity.

Google once again leads the pack here by having a 100% detection rate for this category. Sightengine and Nanonets perform better than the rest with detection rates of 95% and 90% respectively. Suggestive nudity is almost as easy to identify for a machine as nudity but the places where it makes a mistake are in images which normally look like SFW images but have some aspects of nudity.

Links to original images: Suggestive13, Suggestive10, Suggestive2, Suggestive8

Once again none of the providers got the easy to identify images wrong. These images were all clearly NSFW.

Links to original images: Suggestive17, Suggestive12 ,Suggestive11, Suggestive5

In the suggestive nudity the providers were split. Similar to the explicit nudity they all had different thresholds of what is tolerable. I personally am unsure of if these images should be SFW or not.

All the APIs performed exceptionally well here and were able to detect 100% of the simulated porn examples accurately. The only exception was IMAGGA which missed 1 image. It’s interesting to note that almost all the providers perform very well. This indicates that these algorithms find it easier to identify artificially generated images than naturally occurring images.

Links to original images: SimulatedPorn1, SimulatedPorn16, SimulatedPorn19, SimulatedPorn9

All the providers have perfect scores and high confidence scores.

Links to original images: SimulatedPorn15

The one image that Imagga got wrong could have been construed as maybe not porn if you didn’t look long enough.

This was one of the most difficult categories as the average detection rate across APIs was less than 50%. Clarifai and Sightengine outperforms its competitors here by being able to identify a 100% of the gore images.

Links to original images: Gore7, Gore9, Gore17, Gore18

The ones where all the providers had high thresholds were medical images probably because they are easier to find. However even in the best performing images 4/12 providers got the images wrong.

Links to original images: Gore2, Gore3, Gore6, Gore10

There was no discernible pattern in the images that were difficult to predict. However humans would find it very easy to find any of these images as Gory. Which probably means the reason for poor performance is the lack of available training data.

Safe for work are the images that should not have been identified as NSFW. Collecting a safe for work dataset itself is difficult which should be close to NSFW to get a sense of these providers do. A lot of debate can go into if all of these images are SFW or not. Here Sightengine and Google are the worst performers which kind of explains their great performance across other categories. They basically just call anything and everything NSFW. Imagga here does well because they call nothing NSFW. X-Moderator also does very well here.

Links to original images: SFW15, SFW12, SFW6, SFW4

The easy to identify images had very little skin showing and would be very easy for a human to identify as SFW. Only 1 or 2 providers got these images wrong.

Links to original images: SFW17, SFW18, SFW10, SFW3

The difficult to identify SFW images all had a higher amount of skin showing or were Anime (high bias towards Anime being porn). Most of the providers got the images with a high amount of sking showing as SFW. Which begs the question if these are truly SFW?

Looking at the performance of the APIs across all the NSFW categories as well as their performance in being able to correctly identify safe for work(SFW) content, I saw that Nanonets has the best F1 score and Average Accuracy thus performs consistently well across all categories. Google which does exceptionally well in detecting the NSFW categories marks too many of the SFW pieces of content as NSFW thus gets penalized in its F1 score.

This diagram gives us a sense of the biases of each of the providers and how sensitive they are. Higher values for TP and TN are better. Values for FN and FN should be as small as possible.

I compared the top 5 providers by accuracy and F1 Score to showcase the differences in their performance. The larger the area of the radar chart the better.

Nanonets does not perform the best overall in any one category. However is the most balanced overall doing well in every category. The place where it could do better is in identifying more images as SFW. It’s over sensitive to any skin.

Google performs the best in most NSFW categories but performs the worst in detecting SFW. One point to note is that the images I found were from Google which means they “should know” what the images I was using were anyway. This might be reason for the really good performance in most categories.

Clarifai really shines in identifying Gore and does better than most other APIs it is again well balanced and does well in most categories. It lacks in identifying Suggestive Nudity and Porn.

Sightengine like Google has an almost perfect score at trying to identify NSFW content. However it didn’t identify a single SFW image.

X-Moderator is another well balanced API. Apart from identifying Gore it identifies most other types of NSFW content well. It got 100% accuracy in SFW which sets it apart from it’s competitors.

One other criteria for deciding which API to go with is pricing. Below is a comparison of the pricing each of the vendors have. Most of the APIs have a free trial with limited usage. Yahoo is the only one that is completely free to use but is self hosted hence not included in this table.

Amazon, Microsoft, Nanonets, DeepAI all come in the lowest at $1k a month for 1M API calls.

The subjective nature of NSFW content makes it difficult to declare any one API as the go-to API for content moderation.

A general social media application that is more geared towards content distribution and wants a balanced classifier would prefer to use Nanonets API as proven by the highest F1 score for their classifier.

An application that is targeted towards kids would definitely err on the side of caution and would prefer to hide even marginal inappropriate content, thus they would prefer to use the Google API with its exemplary performance on all NSFW categories at the risk of filtering out some of the appropriate content as well. The trade off would be losing a lot of SFW content that Google might declare NSFW.

One key thing that I realized after spending considerable amount of time on this problem is what really is NSFW is very unclear. Every person will have their own definition and what you think is okay for your service to show your users depends heavily on what the service is providing. Partial nudity in a dating app might be okay but not Blood and in a Medical journal the reverse is true. The truly gray area is in suggestive nudity where it’s impossible to get a right answer.