原文地址: https://machinelearningmastery.com/datasets-natural-language-processing/
针对NLP中常见的7个问题进行分类,归纳常用数据集,mark之
Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.
Below are some good beginner text classification datasets.
For more, see the post: Datasets for single-label text categorization.
Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.
It is a pre-cursor task in tasks like speech recognition and machine translation.
Below are some good beginner language modeling datasets.
mage captioning is the task of generating a textual description for a given image.
Below are some good beginner image captioning datasets.
Exploring Image Captioning Datasets, 2016
Machine translation is the task of translating text from one language to another.
Below are some good beginner machine translation datasets.
Statistical Machine Translation
Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.
Below are some good beginner question answering datasets.
Speech recognition is the task of transforming audio of a spoken language into human readable text.
Below are some good beginner speech recognition datasets.
Document summarization is the task of creating a short meaningful description of a larger document.
Below are some good beginner document summarization datasets.
Document Understanding Conference (DUC) Tasks. Where can I find good data sets for text summarization?
This section provides additional lists of datasets if you are looking to go deeper.