This is a synthetically generated dataset, in which word instances are placed in natural scene images, while taking into account the scene layout.
The dataset consists of 800 thousand images with approximately 8 million synthetic word instances. Each text instance is annotated with its text-string, word-level and character-level bounding-boxes.
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman Visual Geometry Group, University of Oxford, 2016
SynthText.zip (size = 42074172 bytes (41GB)) contains 858,750 synthetic scene-image files (.jpg) split into 200 directories, with 7,266,866 word-instances, and 28,971,487 characters.
Ground-truth annotations are contained in the file "gt.mat" (Matlab format). The file "gt.mat" contains the following cell-arrays, each of size 1x858750:
imnames : names of the image files
wordBB : word-level bounding-boxes for each image, represented by tensors of size 2x4xNWORDS_i, where:
charBB : character-level bounding-boxes, each represented by a tensor of size 2x4xNCHARS_i (format is same as wordBB's above)
txt : text-strings contained in each image (char array).
Words which belong to the same "instance", i.e.,
those rendered in the same region with the same font, color,
distortion etc., are grouped together; the instance
boundaries are demarcated by the line-feed character (ASCII: 10)
A "word" is any contiguous substring of non-whitespace
characters.
A "character" is defined as any non-whitespace character.
For any questions or comments, contact Ankush Gupta at: removethisifyouarehuman-ankush@robots.ox.ac.uk
If you use this data, please cite:
@InProceedings{Gupta16,
author = "Ankush Gupta and Andrea Vedaldi and Andrew Zisserman",
title = "Synthetic Data for Text Localisation in Natural Images",
booktitle = "IEEE Conference on Computer Vision and Pattern Recognition",
year = "2016",
}