Facebook today announced that it’s open-sourcing two algorithms capable of spotting identical and nearly identical photos and videos, which it says it actively uses to fight child exploitation, terrorist propaganda, and graphic violence on its platform. The company notes that it’s the first time it’s shared any media-matching technology — technology it hopes industry partners, smaller developers, and nonprofits will employ to more easily identify harmful content.
“When we identify a harmful piece of content … technology can help us find duplicates and prevent them from being shared,” wrote global head of safety Antigone Davis and VP of integrity Guy Rosen in a blog post time to coincide with Facebook’s fourth annual Child Safety Hackathon. “For those who already use their own or other content matching technology, these technologies are another layer of defense … making the systems that much more powerful.”
Facebook says that the two algorithms in question — PDQ and TMK+PDQ — were designed to operate at “high scale” and inspired by existing models and implementations, including pHash, Microsoft’s PhotoDNA, aHash, and dHash. The photo-matching PDQ was modeled after pHash (although it was designed from scratch), while the video-recognizing TMK+PDQF was developed jointly by the Facebook Artificial Intelligence Research team and academics from the University of Modena and Reggio Emilia in Italy.
Both efficiently store files as short digital hashes — unique identifiers — that help to determine whether two files are the same or similar, even without the original image or video. Facebook points out that these hashes can be easily shared among companies and nonprofits, as well as with industry partners, through the Global Internet Forum to Counter Terrorism (GIFCT), so they can also take down the same content if it’s uploaded to their services.
“We designed these technologies based on our experience detecting abuse across billions of posts on Facebook,” wrote Davis and Rosen. “We hope that by contributing back to the community we’ll enable more companies to keep their services safe and empower non-profits that work in the space.”
Facebook’s contributions of PDQ and TMK+PDQ follow on the heels of the aforementioned PhotoDNA 10 years ago, an effort to fight child exploitation. More recently, Google launched Content Safety API, an AI platform designed to identify online child sexual abuse material and reduce human reviewers’ exposure to the content.
Facebook CEO Mark Zuckerberg often asserts that AI will substantially cut down on the amount of abuse perpetrated by millions of ill-meaning Facebook users. A concrete example of this in production is a “nearest neighbor” algorithm that’s 8.5 times faster at spotting illicit photos than the previous version, which complements a system that learns a deep graph embedding of all the nodes in Facebook’s Graph — the collection of data, stories, ads, and photos on the network — to find abusive accounts and pages that might be related to each other.
In Facebook’s Community Standards Enforcement Report published in May, the company reported that AI and machine learning helped cut down on abusive posts in six of the nine content categories. Concretely, Facebook said it proactively detected 96.8% of the content it took action on before a human spotted it (compared with 96.2% in Q4 2018), and for hate speech, it said it now identifies 65% of the more than four million hate speech posts removed from Facebook each quarter, up from 24% just over a year ago and 59% in Q4 2018.
Those and other algorithmic improvements contributed to a decrease in the overall amount of illicit content viewed on Facebook, according to the company. It estimated in the report that for every 10,000 times people viewed content on its network, only 11 to 14 views contained adult nudity and sexual activity, while 25 contained violence. With respect to terrorism, child nudity, and sexual exploitation, those numbers were far lower — Facebook said that in Q1 2019, for every 10,000 times people viewed content on the social network, less than three views contained content that violated each of those policies.