graviti logo产品公开数据集关于我们
登录
193
0
7
PAWS-X
创建来自Hello Dataset / Robert
概要
活动

Overview

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki. Note: for multilingual experiments, please use dev_2k.tsv provided in the PAWS-X repo as the development sets for all languages, including English.

Data Format

All files are in tsv format with four columns:

Column NameData
idAn ID that matches the ID of the source pair in PAWS-Wiki
sentence1The first sentence
sentence2The second sentence
labelLabel for each pair

The source text of each translation can be retrieved by looking up the ID in the corresponding file in PAWS-Wiki.

The numbers of examples for each of the six languages are shown below:

LanguageTrainDevTest
fr49,4011,9921,985
es49,4011,9621,999
de49,4011,9321,967
zh49,4011,9841,975
ja49,4011,9801,946
ko49,4011,9651,972
Total296,40611,81511,844

Citation

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{pawsx2019emnlp,
  title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}},
  author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
  booktitle = {Proc. of EMNLP},
  year = {2019}
}

License

Custom

数据集信息
应用场景NLP
标注类型Text
LicenseCustom
更新时间2021-03-24 22:54:34
数据概要
数据格式Text
数据数量23.66k
文件大小29KB
标注数量0
版权归属方
Google Research
标注方
未知
了解更多和支持
立即开始构建AI
免费开始联系我们