face outline

We are familiar with how faces differ by age, gender, and skin tone, and how different faces can vary across some of these dimensions. But, as prior studies have shown, these dimensions are not adequate for characterizing the full diversity of human faces. Dimensions like face symmetry, facial contrast, the pose the face is in, the length or width of the face’s attributes (eyes, nose, forehead, etc.) are also important. For the facial recognition systems to perform as desired – and the outcomes to become increasingly accurate – training data must be diverse and offer a breadth of coverage. For example, the training datasets must be large enough and different enough that the technology learns all the ways in which faces differ to accurately recognize those differences in a variety of situations. The images must reflect the distribution of features in faces we see in the world.

To help accelerate the study of diversity and coverage of data for AI facial recognition systems, IBM Research has released a large and diverse dataset called Diversity in Faces (DiF) to advance the study of fairness and accuracy in facial recognition technology.

1-million images of human faces

1-million images of human faces from the publicly available YFCC-100M Creative Commons dataset.

Scientificly annotated facical features

The faces annotated using 10 well-established and independent coding schemes from the scientific literature [1-10]. The coding schemes principally include objective measures of human faces, such as craniofacial features (e.g., head length, nose length, forehead height).

Advancing study of fairness and accuracy

Studying diversity in faces is complex. The dataset provides a jumping off point for the global research community to further our collective knowledge.

Our initial analysis has shown that the DiF dataset provides a more balanced distribution and broader coverage of facial images compared to previous datasets. Furthermore, the insights obtained from the statistical analysis of the 10 initial coding schemes on the DiF dataset has furthered our own understanding of what is important for characterizing human faces and enabled us to continue important research into ways to improve facial recognition technology.

The dataset is available today to the global research community upon request. IBM is proud to make this available and our goal is to help further our collective research and contribute to creating AI systems that are more fair. 

Step 2

Download and complete the questionnaire.

DiF Questionnaire (PDF)

Step 4

Further instructions will be provided from IBM Research via email once application is approved.