NLP常用数据集

原文地址: https://machinelearningmastery.com/datasets-natural-language-processing/

针对NLP中常见的7个问题进行分类,归纳常用数据集,mark之

  • Text Classification
  • Language Modeling
  • Image Captioning
  • Machine Translation
  • Question Answering
  • Speech Recognition
  • Document Summarization

Text Classification

Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.

Below are some good beginner text classification datasets.

  1. Reuters Newswire Topic Classification (Reuters-21578). A collection of news documents that appeared on Reuters in 1987 indexed by categories. Also see RCV1, RCV2 and TRC2.
  2. [IMDB Movie Review Sentiment Classification] (stanford)(http://ai.stanford.edu/~amaas/data/sentiment/). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.
  3. News Group Movie Review Sentiment Classification (cornell). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.

For more, see the post: Datasets for single-label text categorization.

Language Modeling

Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

Below are some good beginner language modeling datasets.

  1. Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.
  2. There are more formal corpora that are well studied; for example: Brown University Standard Corpus of Present-Day American English. A large sample of English words. Google 1 Billion Word Corpus.

Image Captioning

mage captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

  1. Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
  2. Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
  3. Flickr 30K. A collection of 30 thousand described images taken from flickr.com. For more see the post:

Exploring Image Captioning Datasets, 2016

Machine Translation

Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

  1. Aligned Hansards of the 36th Parliament of Canada. Pairs of sentences in English and French.
  2. European Parliament Proceedings Parallel Corpus 1996-2011. Sentences pairs of a suite of European languages. There are a ton of standard datasets used for the annual machine translation challenges; see:

Statistical Machine Translation

Question Answering

Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.

Below are some good beginner question answering datasets.

  1. Stanford Question Answering Dataset (SQuAD). Question answering about Wikipedia articles.
  2. Deepmind Question Answering Corpus. Question answering about news articles from the Daily Mail.
  3. Amazon question/answer data. Question answering about Amazon products. For more, see the post:

Datasets: How can I get corpus of a question-answering website like Quora or Yahoo Answers or Stack Overflow for analyzing answer quality?

Speech Recognition

Speech recognition is the task of transforming audio of a spoken language into human readable text.

Below are some good beginner speech recognition datasets.

  1. TIMIT Acoustic-Phonetic Continuous Speech Corpus. Not free, but listed because of its wide use. Spoken American English and associated transcription.
  2. VoxForge. Project to build an open source database for speech recognition.
  3. LibriSpeech ASR corpus. Large collection of English audiobooks taken from LibriVox.

Document Summarization

Document summarization is the task of creating a short meaningful description of a larger document.

Below are some good beginner document summarization datasets.

  1. Legal Case Reports Data Set. A collection of 4 thousand legal cases and their summarization.
  2. TIPSTER Text Summarization Evaluation Conference Corpus. A collection of nearly 200 documents and their summaries.
  3. The AQUAINT Corpus of English News Text. Not free, but widely used. A corpus of news articles. For more see:

Document Understanding Conference (DUC) Tasks. Where can I find good data sets for text summarization?

Further Reading

This section provides additional lists of datasets if you are looking to go deeper.

  1. Text Datasets Used in Research on Wikipedia
  2. Datasets: What are the major text corpora used by computational linguists and natural language processing researchers?
  3. Stanford Statistical Natural Language Processing Corpora
  4. Alphabetical list of NLP Datasets
  5. NLTK Corpora
  6. Open Data for Deep Learning on DL4J
  7. NLP datasets

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏数据结构与算法

P1111 修复公路

题目背景 A地区在地震过后,连接所有村庄的公路都造成了损坏而无法通车。政府派人修复这些公路。 题目描述 给出A地区的村庄数N,和公路数M,公路是双向的。并告诉你...

2489
来自专栏目标检测和深度学习

全球最全计算机视觉资料(0:|软件|数据集|挑战赛|创业公司)

1111
来自专栏专知

【最新】人工智能领域顶会AAAI 2018 Pre-Proceedings 论文列表(附pdf下载链接)

【导读】人工智能领域顶尖学术会议 AAAI 2018,暨第32届 AAAI 大会将于 2 月 2 日 - 2 月 7 日 在新奥尔良举行。AAAI 是由人工智能...

6486
来自专栏菩提树下的杨过

Flash/Flex学习笔记(38):缓动动画

缓动 与 匀变速 看上去很类似,但其实有区别: 匀变速的公式为 V = V0 + at --速度v与时间t是线性(正比)关系,而且这种运动不需要确定目标点,速度...

1895
来自专栏CreateAMind

Suggested Education for Future AGI Researchers

https://sites.google.com/site/narswang/home/agi-introduction/agi-education

842
来自专栏机器学习、深度学习

人群行为分类数据库--Novel Dataset for Fine-grained Abnormal Behavior Understanding in Crowd

Novel Dataset for Fine-grained Abnormal Behavior Understanding in Crowd 数据库:ht...

20610
来自专栏计算机视觉与深度学习基础

计算机视觉著名数据集CV Datasets

Detection PASCAL VOC 2009 datasetClassification/Detection Competitions, Segm...

1958
来自专栏腾讯高校合作

【犀牛鸟·视野】SIGGRAPH Asia 2017 (DAY 3):领略前沿poster papers,关注WebXR新技术

今天是SIGGRAPH Asia 2017的第三天,也是Poster papers讲解的最后一天(总共两天,每天中午13:00-14:00)。今年中了poste...

3356
来自专栏CreateAMind

Building Agents with Imagination

1183
来自专栏专知

【论文推荐】最新六篇自动问答相关论文—无监督迁移学习、综述、生成式问答、QDEE、可扩展文档理解

1693

扫码关注云+社区