graviti logo产品公开数据集关于我们
创建来自Hello Dataset / Robert


With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text:
(a) text detection,
(b) cropped word script classification,
(c) joint text detection and script classification,
(d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities.

Data Collection

We have created two datasets:

  1. The real MLT-2019 dataset that contains 20,000 real natural scene images withembedded text in 10 languages
  2. The synthetic MLT-2019 dataset that is prepared as an assistive training set only for Task-4. The synthetic dataset matches the scripts of real one

1. The MLT-2019 Dataset of Real Images

The images of the dataset are natural scene images with embedded text, such as street signs, street advertisement boards, shops names, passing vehicles and users photos in microblogs. The images were captured using different mobile phone cameras or were collected from freely available images from the Internet. The images mainly contain intentional – i.e. focused – scene text, however, some unintentional text may appear in some images. Such text – usually very small, blurry and/or occluded – is marked to be ignored in the evaluation. We have imposed conditions on the collection of our dataset related to the type (example: natural scenes), content (example: mostly focused text) and capture conditions of the images (example: no dark image). This is to ensure – to some extent – the homogeneity of the collectedimages as they have been collected by different people and indifferent countries.

2. Synthetic Multi-Language in Natural Scene Dataset

State-of-the-art scene text systems employ deep learning techniques which require a tremendous amount of labelled data. Hence, we have provided an additional synthetic dataset to complement the real one for training purposes. We adapt the framework proposed by Gupta et al. to a multi-language setup. The framework generates realistic images by overlaying synthetic text over existing natural background images and it accounts for3D scene geometry.
Gupta et al. proposed the following approach forscene-text image synthesis:

  • Text in real-world usually appears in well-defined regions, which can be characterized by uniform color and texture. This is achieved by thresholding gPb-UCM contour hierarchies using efficient graph-cut implementation. This gives us prospective segmented regions for rendering text.
  • Dense depth map of segmented regions is then ob-tained using and then planer facet are fitted to them using RANSAC. This way, normals to prospective regions for text rendering are estimated.
  • Finally, the text is aligned to a prospective image region for rendering. This is achieved by warping the image region to frontal-parallel view using the estimated region normals. Then, a rectangle is fitted to this region and the text is then aligned to the larger side of this rectangle

Note that the pipeline presented in renders text char-acter by character, which breaks down the ligature of ARABIC, BANGLA and DEVANAGARI words. We have made appropriate changes to handle this issue.

Data Annotation

1. The MLT-2019 Dataset of Real Images

The text in the scene images of the dataset is annotated at word level. A GT-word is defined as a consecutive set of characters without spaces, i.e. words are separated by spaces, except in Chinese and Japanese where the text is labeled at line level. Each GT-word is labeled by a 4-corner bounding box, and is associated with a script class and a Unicode transcription of that GT-word. Some text regions in the images are not readable to the annotators due to low resolution and/or other distortions. Such regions are markedas “don’t care” and ignored in the evaluation process.

2. Synthetic Multi-Language in Natural Scene Dataset

Annotations include word level and character level text bounding boxes along with the corresponding transcription and language class. The dataset has 277,000 images with thousands of images for each language.

Data Format

Download below the training dataset and the associated ground truth for each of the Tasks.

Task 1: Multi-script text detection

Training Set:

The training set is composed of 10,000 images:

  • TrainSetImagesTask1_Part1 (3.5G)
  • TrainSetImagesTask1_Part2 (3.3G)

The ground truth is composed 10,000 text files (corresponding to the images) with word-level localization, script and transcription:

  • TrainSetGT (6.5M)

Note that this task only requires localization results (as indicated in results format in the tasks page), but the ground truth also provides the script id of each bounding box and the transcription. This extra information will be needed in Tasks 3 and 4.
Extra information about the training set (may be useful for researchers who focus on one or only few languages, not all of the multi-lingual set): The 10,000 images are ordered in the training set such that: each consecutive 1000 images contain text of one main language (and it may of course contain additional text from 1 or 2 other languages, all from the set of the 10 languages)

00001 - 01000:  Arabic
01001 - 02000:  English
02001 - 03000:  French
03001 - 04000:  Chinese
04001 - 05000:  German
05001 - 06000:  Korean
06001 - 07000:  Japanese
07001 - 08000:  Italian
08001 - 09000:  Bangla
09001 - 10000:  Hindi

Test set:

Images (10,000 images):


Task 2: Cropped Word Script identification

Training Set:

  • Word_Images_Part1 (The Ground truth of the word images [2 files] is here too [in the same folder with the images])
  • Word_Images_Part2
  • Word_Images_Part3

Test set:

Cropped word images:


Task 3: Joint text detection and script identification

Training Set:

The same training set and ground truth as in Task 1 (see Task 1, above).

Test set: The same test set for Task 1.

Task 4: End-to-End text detection and recognition

Training Set:

It has two parts:

  1. Real dataset: The same training set and ground truth as in Task 1 (see Task 1, above).
  2. Synthetic dataset: We provide a synthetic dataset that matches the real dataset in terms of scripts, to help with the training for this task:
    • Images of the synthetic dataset:
      Arabic, Bangla, Chinese, Japanese, Korean, Latin, Hindi
    • GT of the synthetic dataset (same format as the real dataset):
      Arabic, Bangla, Chinese, Japanese, Korean, Latin, Hindi

Note that we provide a baseline method for this task: E2E-MLT. You can find the details of the method and also the synthetic dataset at: E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text:
Test set: The same test set for Task 1.


  title={ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019,
  author={Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-lin Liu, Jean-Marc Ogier},


CC BY 4.0

应用场景OCR/Text Detection
LicenseCC BY 4.0
更新时间2021-03-24 22:49:03
Computer Vision Center (CVC)