graviti logo产品公开数据集关于我们
登录
434
0
5
curationCorpus
创建来自Hello Dataset / Robert
概要
代码
活动

Overview

The Curation Corpus is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves. This repository provides a scraper to access them. If you're interested in commercial use or access to the wider catalogue of Curation data, including a larger set of over 150,000 professionally-written abstracts and a scalable, on-demand content abstraction API (driven by humans or AI), please get in touch. For our thoughts on how we hope this release will help the NLP community, see our post introducing the dataset.

Instruction

  • Clone this repository (or just copy the code from scraper.py)
  • Download the urls, headlines, and summaries from here
  • Run web_scraper.py. Give as command line arguments the path to the csv file without article text, the path to a new csv file which will have article text, and a batch size to determine how many urls it will scrape at a time. Larger batch sizes will make it run faster but it may drop more articles due to timeouts. I recommend ~50 on a 2015 Macbook Pro.
git clone https://github.com/CurationCorp/curation-corpus.git
cd curation-corpus
wget https://curation-datasets.s3-eu-west-1.amazonaws.com/curation-corpus-base.csv
python web_scraper.py curation-corpus-base.csv curation-corpus-base-with-articles.csv 50

Some urls will return messy results due to content changing over time, paywalls, etc. We've tried to remove the worst offenders from this release. There is probably still scope though for improving the scraper though.

Citation

@misc{curationcorpusbase:2020,
  title={Curation Corpus Base},
  author={Curation},
  year={2020}
}

License

CC BY 4.0

数据集信息
应用场景NLP
标注类型Text
LicenseCC BY 4.0
更新时间2021-03-24 22:52:33
数据概要
数据格式Text
数据数量40k
已标注数量0
文件大小123KB
版权归属方
Henry Dashwood
标注方
未知
了解更多和支持
立即开始构建AI
免费开始联系我们