graviti logo产品公开数据集关于我们
登录
262
0
18
HIT-OR3C
创建来自Hello Dataset / Robert
概要
活动

Overview

HIT_OR3C is a dataset of handwritten Chinese characters. Both online and offline information is available.

Data Collection

The characters have been collected using a handwriting pad and are recorded and labelled automatically via the handwriting document collection software: OR3C Toolkit. The software used to collect the characters is also made available (supplied version is in Chinese).

Data Annotation

For each image, a label is provided. The labels of digits and letters are encoded in ASCII; the labels of Chinese characters are encoded in GB2312 80. The label file is in every folder and named “labels.txt”.

Data Format

The dataset is organised in 5 subsets: 4 subsets of characters [Digit (1-10), Letter (11-62), GB1 (63-3817), GB2 (3818-6825)], and 1 subset of documents.
The 4 subsets of characters contain 6,825 classes produced by 122 subjects and 832,650 samples in total. A single file per subject is provided for online data and a single file per subject for offline data (see below for the file format used). The different subsets are defined as index ranges within these files.
The document corpus corresponds to 10 news articles that contain in total 77,168 samples drawn from 2,442 classes and produced by 20 subjects. The document captured data have been post-processed and split into individual characters, the characters resized to 128 x 128 pixels and stored sequentially in a single image and a single vector file, similarly to the first four subsets.
The dataset contains 909,818 images. The total size of the dataset is 15.5 GB (1125 Mb compressed).
The individual character images are 128 x 128 greyscale.

数据集信息
应用场景OCR/Text Detection
标注类型ClassificationText
LicenseUnknown
更新时间2021-03-24 22:50:20
数据概要
数据格式Image
数据数量0
文件大小1MB
标注数量0
版权归属方
International Association for Pattern Recognition Technical Committee Number 11
标注方
未知
了解更多和支持
立即开始构建AI
免费开始联系我们