A dataset for the document understanding community.
199 fully annotated forms
31485 words
9707 semantic entities
5304 relations
All the annotations are encoded in a JSON file. An example showing the annotations for the image below is presented. A detailed description of each entry from the JSON file is provided in the original paper.
{
"form": [
{
"id": 0,
"text": "Registration No.",
"box": [94,169,191,186],
"linking": [
[0,1]
],
"label": "question",
"words": [
{
"text": "Registration",
"box": [94,169,168,186]
},
{
"text": "No.",
"box": [170,169,191,183]
}
]
},
{
"id": 1,
"text": "533",
"box": [209,169,236,182],
"label": "answer",
"words": [
{
"box": [209,169,236,182
],
"text": "533"
}
],
"linking": [
[0,1]
]
}
]
}
If you use this dataset for your research, please cite our paper:
@inproceedings{jaume2019,
title = {FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents},
author = {Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran},
booktitle = {Accepted to ICDAR-OST},
year = {2019}
}