graviti logo产品公开数据集关于我们


In total, once clustered and optimized MF2 contains 4,753,320 faces and 672,057 identities. On average this is 7.07 photos per identity, with a minimum of 3 photos per identity, and maximum of 2469. We expanded the tight crop version by re-downloading the clustered faces and saving a loosely cropped version. The tightly cropped dataset requires 159GB of space, while the loosely cropped is split into 14 files each requiring 65GB for a total of 910GB. In order to gain statistics on age and gender, we ran the WIKI-IMDB models for age and gender detection over the loosely cropped version of the data set. We found that females accounted for 41.1% of subjects while males accounted for 58.8%. The median gender variance within identities was 0. The average age range to be 16.1 years while the median was 12 years within identities. The distributions can be found in the supplementary material. A trade off of this algorithm is that we must strike a balance between noise and quantity of data with the parameters. It has been noted by the VGG-Face work, that given the choice between a larger, more impure data set, and a smaller hand-cleaned data set, the larger can actually give better performance. A strong reason foropting to remove most faces from the initial unlabeled corpus was detection error. We found that many images were actually non-faces. There were also many identities that did not appear more than once, and these would not be as useful for learning algorithms. By visual inspection of 50 randomly thrown out faces by the algorithm: 14 were non faces, 36 were not found more than twice in their respective Flickr accounts. In a complete audit of the clustering algorithm, the reason for throwing out faces are follows: 69% Faces which were below the < 3 threshold for identity 4% Faces which were removed from clusters as impurities 27% Faces which were part of clusters which were still impure even after purifification.

Data Collection

To create a data set that includes hundreds of thousands of identities we utilize the massive collection of Creative Commons photographs released by Flickr. This set contains roughly 100M photos and over 550K individual Flickr accounts. Not all photographs in the data set contain faces. Following the MegaFace challenge, we sift through this massive collection and extract faces detected using DLIB’s face detector. To optimize harddrive space for millions of faces, we only saved the crop plus 2 % of the cropped area for further processing. After collecting and cleaning our fifinal data set, we re-download the fifinal faces at a higher crop ratio (70%). As the Flickr data is noisy and has sparse identities (with many examples of single photos per identity, while we are targeting multiple photos per identity), we processed the full 100M Flickr set to maximize the number of identities. We therefore employed a distributed queue system, RabbitMQ, to distribute face detection work across 60 compute nodes which we save locally. A second collection process aggregates faces to a single machine. In order to optimize for Flickr accounts with a higher possibility of having multiple faces of the same identity, we ignore all accounts with less than 30 photos. In total we obtained 40M unlabeled faces across 130,154 distinct Flickr accounts (representing all accounts with more than 30 face photos). The crops of photos take over 1TB of storage. As the photos are taken with different camera settings, photos range in size from low resolution (90x90px) to high resolution (800x800+px). In total the distributed process of collecting and aggregating photos took 15 days.

Data Annotation

Labeling million-scale data manually is challenging and while useful for development of algorithms, there are almost no approaches on how to do it while controlling costs. Companies like MobileEye, Tesla, Facebook, hire thousands of human labelers, costing millions of dollars. Additionally, people make mistakes and get confusedwith face recognition tasks, resulting in a need to re-test and validate further adding to costs. We thus look to automated, or semi-automated methods to improve the purity of collected data.

There has been several approaches for automated cleaning of data. O. M. Parkhi et al. used near-duplicate removal to improve data quality. G. Levi et al. used age and gender consistency measures. T. L. Berg et al. and X. Zhang et al. included text from news captions describing celebrity names. H.-W Ng et al. propose data cleaning as aquadratic programming problem with constraints enforcing assumptions that noise consists of a relatively small portion of the collected data, gender uniformity, identities consistof a majority of the same person, and a single photo cannot have two of the same person in it. All those methods proved to be important for data cleaning given rough initial labels, e.g., the celebrity name. In our case, rough labels are not given. We do observe that face recognizers perform well at a small scale and leverage embeddings to provide ameasure of similarity to further be used for labeling.


Please use the following citation when referencing the dataset:

title={Level Playing Field For Million Scale Face Recognition},
author={Nech, Aaron and Kemelmacher-Shlizerman, Ira},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
更新时间2021-03-24 22:53:31