Dataset 列表:机器学习研究

Face recognition

In computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other projects that use images of faces.

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Face Recognition Technology (FERET)

11338 images of 1199 individuals in different positions and at different times.

None.

11,338

Images

Classification, face recognition

2003

[6][7]

United States Department of Defense

CMU Pose, Illumination, and Expression (PIE)

41,368 color images of 68 people in 13 different poses.

Images labeled with expressions.

41,368

Images, text

Classification, face recognition

2000

[8][9]

R. Gross et al.

SCFace

Color images of faces at various angles.

Location of facial features extracted. Coordinates of features given.

4,160

Images, text

Classification, face recognition

2011

[10][11]

M. Grgic et al.

YouTube Faces DB

Videos of 1,595 different people gathered from YouTube. Each clip is between 48 and 6,070 frames.

Identity of those appearing in videos and descriptors.

3,425 videos

Video, text

Video classification, face recognition

2011

[12][13]

L. Wolf et al.

300 videos in-the-Wild

114 videos annotated for facial landmark tracking. The 68 landmark mark-up is applied to every frame.

None

114 videos, 218,000 frames.

Video, annotation file.

Facial landmark tracking.

2015

[14]

Shen, Jie et al.

Grammatical Facial Expressions Dataset

Grammatical Facial Expressions from Brazilian Sign Language.

Microsoft Kinect features extracted.

27,965

Text

Facial gesture recognition

2014

[15]

F. Freitas et al.

CMU Face Images Dataset

Images of faces. Each person is photographed multiple times to capture different expressions.

Labels and features.

640

Images, Text

Face recognition

1999

[16][17]

T. Mitchell

Yale Face Database

Faces of 15 individuals in 11 different expressions.

Labels of expressions.

165

Images

Face recognition

1997

[18][19]

J. Yang et al.

Cohn-Kanade AU-Coded Expression Database

Large database of images with labels for expressions.

Tracking of certain facial features.

500+ sequences

Images, text

Facial expression analysis

2000

[20][21]

T. Kanade et al.

FaceScrub

Images of public figures scrubbed from image searching.

Name and m/f annotation.

107,818

Images, text

Face recognition

2014

[22][23]

H. Ng et al.

BioID Face Database

Images of faces with eye positions marked.

Manually set eye positions.

1521

Images, text

Face recognition

2001

[24][25]

BioID

Skin Segmentation Dataset

Randomly sampled color values from face images.

B, G, R, values extracted.

245,057

Text

Segmentation, classification

2012

[26][27]

R. Bhatt.

Bosphorus

3D Face image database.

34 action units and 6 expressions labeled; 24 facial landmarks labeled.

4652

Images, text

Face recognition, classification

2008

[28][29]

A Savran et al.

UOY 3D-Face

neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised.

labeling.

5250

Images, text

Face recognition, classification

2004

[30][31]

University of York

CASIA

Expressions: Anger, smile, laugh, surprise, closed eyes.

None.

4624

Images, text

Face recognition, classification

2007

[32][33]

Institute of Automation, Chinese Academy of Sciences

CASIA

Expressions: Anger Disgust Fear Happiness Sadness Surprise

None.

480

Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per second

Face recognition, classification

2011

[34]

Zhao, G. et al.

BU-3DFE

neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted.

None.

2500

Images, text

Facial expression recognition, classification

2006

[35]

Binghamton University

Face Recognition Grand Challenge Dataset

Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.

None.

4007

Images, text

Face recognition, classification

2004

[36][37]

National Institute of Standards and Technology

Gavabdb

Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images.

None.

549

Images, text

Face recognition, classification

2008

[38][39]

King Juan Carlos University

3D-RMA

Up to 100 subjects, expressions mostly neutral. Several poses as well.

None.

9971

Images, text

Face recognition, classification

2004

[40][41]

Royal Military Academy (Belgium)

Action recognition

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Human Motion DataBase (HMDB51)

51 action categories, each containing at least 101 clips, extracted from a range of sources.

None.

6,766 video clips

video clips

Action classification

2011

[42]

H. Kuehne et al.

TV Human Interaction Dataset

Videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none.

None.

6,766 video clips

video clips

Action prediction

2013

[43]

Patron-Perez, A. et al.

UT Interaction

People acting out one of 6 actions (shake-hands, point, hug, push, kick, and punch) sometimes with multiple groups in the same video clip.

None.

120 video clips

video clips

Action prediction

2009

[44]

Ryoo, M. S. et al.

UT Kinect

10 different people performing one of 6 actions (walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands) in an office setting.

None.

200 video clips with depth information at 15 frames per second

video clips with depth information

Action classification

2012

[45]

Xia, L. et al.

SBU Interact

Seven participants performing one of 8 actions together (approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands) in an office setting.

None.

Around 300 interactions

video clips with depth information

Action classification

2012

[46]

Yun, K. et al.

Berkeley Multimodal Human Action Database (MHAD)

Recordings of a single person performing 12 actions

MoCap pre-processing

660 action samples

8 PhaseSpace Motion Cpature, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphones

Action classification

2013

[47]

Ofli, F. et al.

UCF 101 Dataset

Self described as “a dataset of 101 human actions classes from videos in the wild.” Dataset is large with over 27 hours of video.

Actions classified and labeled.

13,000

Video, images, text

Classification, action detection

2012

[48][49]

K. Soomro et al.

THUMOS Dataset

Large video dataset for action classification.

Actions classified and labeled.

45M frames of video

Video, images, text

Classification, action detection

2013

[50][51]

Y. Jiang et al.

Activitynet

Large video dataset for activity recognition and detection.

Actions classified and labeled.

10,024

Video, images, text

Classification, action detection

2015

[52]

Heilbron et al.

MSP-AVATAR

Improvised scenarios annotated for discourse functions: contrast, confirmation/negation, question, uncertainty, suggest, giving orders, warn, inform, size description, using pronouns.

Actions classified and labeled.

74 sessions

Motion-captured video, audio

Classification, action detection

2015

[53]

Sadoughi, N. et al.

LILiR Twotalk Corpus

Video datasets for non-verbal communication activity recognition: agreement, thinking, asking and understanding.

Actions classified and labeled.

527

Video

Action detection

2011

[54]

Sheerman-Chase et al.

Object detection & recognition

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

DAVIS: Densely Annotated VIdeo Segmentation

150 video sequences containing 10459 frames with a total of 376 objects annotated.

Dataset released for the 2017 DAVIS Challenge with a dedicated workshop co-located with CVPR 2017. The videos contain several types of objects and humans with a high quality segmentation annotation.

10,459

Frames annotated

Video object segmentation

2017

[55]

Pont-Tuset, J. et al.

T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects

30 industry-relevant objects. 39K training and 10K test images from each of three sensors. Two types of 3D models for each object.

6D poses for all modeled objects in all images. Per-pixel labelling can be obtained by rendering of the object models at the ground truth poses.

49,000

RGB-D images, 3D object models

6D object pose estimation, object detection

2017

[56]

T. Hodan et al.

Berkeley 3-D Object Dataset

849 images taken in 75 different scenes. About 50 different object classes are labeled.

Object bounding boxes and labeling.

849

labeled images, text

Object recognition

2014

[57][58]

A. Janoch et al.

Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500)

500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300.

Each image segmented by five different subjects on average.

500

Segmented images

Contour detection and hierarchical image segmentation

2011

[59]

University of California, Berkeley

Microsoft Common Objects in Context (COCO)

complex everyday scenes of common objects in their natural context.

Object highlighting, labeling, and classification into 91 object types.

2,500,000

Labeled images, text

Object recognition

2015

[60][61]

T. Lin et al.

SUN Database

Very large scene and object recognition database.

Places and objects are labeled. Objects are segmented.

131,067

Images, text

Object recognition, scene recognition

2014

[62][63]

J. Xiao et al.

ImageNet

Labeled object image database, used in the ImageNet Large Scale Visual Recognition Challenge

Labeled objects, bounding boxes, descriptive words, SIFT features

14,197,122

Images, text

Object recognition, scene recognition

2014

[64][65]

J. Deng et al.

TV News Channel Commercial Detection Dataset

TV commercials and news broadcasts.

Audio and video features extracted from still images.

129,685

Text

Clustering, classification

2015

[66][67]

P. Guha et al.

Statlog (Image Segmentation) Dataset

The instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel.

Many features calculated.

2310

Text

Classification

1990

[68]

University of Massachusetts

Caltech 101

Pictures of objects.

Detailed object outlines marked.

9146

Images

Classification, object recognition.

2003

[69][70]

F. Li et al.

Caltech-256

Large dataset of images for object classification.

Images categorized and hand-sorted.

30,607

Images, Text

Classification, object detection

2007

[71][72]

G. Griffin et al.

SIFT10M Dataset

SIFT features of Caltech-256 dataset.

Extensive SIFT feature extraction.

11,164,866

Text

Classification, object detection

2016

[73]

X. Fu et al.

LabelMe

Annotated pictures of scenes.

Objects outlined.

187,240

Images, text

Classification, object detection

2005

[74]

MIT Computer Science and Artificial Intelligence Laboratory

Cityscapes Dataset

Stereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included.

Pixel-level segmentation and labeling

25,000

Images, text

Classification, object detection

2016

[75]

Daimler AG et al.

PASCAL VOC Dataset

Large number of images for classification tasks.

Labeling, bounding box included

500,000

Images, text

Classification, object detection

2010

[76][77]

M. Everingham et al.

CIFAR-10 Dataset

Many small, low-resolution, images of 10 classes of objects.

Classes labelled, training set splits created.

60,000

Images

Classification

2009

[65][78]

A. Krizhevsky et al.

CIFAR-100 Dataset

Like CIFAR-10, above, but 100 classes of objects are given.

Classes labelled, training set splits created.

60,000

Images

Classification

2009

[65][78]

A. Krizhevsky et al.

German Traffic Sign Detection Benchmark Dataset

Images from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries.

Signs manually labeled

900

Images

Classification

2013

[79][80]

S Houben et al.

KITTI Vision Benchmark Dataset

Autonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners.

Many benchmarks extracted from data.

>100 GB of data

Images, text

Classification, object detection

2012

[81][82]

A Geiger et al.

Handwriting and character recognition

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Artificial Characters Dataset

Artificially generated data describing the structure of 10 capital English letters.

Coordinates of lines drawn given as integers. Various other features.

6000

Text

Handwriting recognition, classification

1992

[83]

H. Guvenir et al.

Letter Dataset

Upper case printed letters.

17 features are extracted from all images.

20,000

Text

OCR, classification

1991

[84][85]

D. Slate et al.

Character Trajectories Dataset

Labeled samples of pen tip trajectories for people writing simple characters.

3-dimensional pen tip velocity trajectory matrix for each sample

2858

Text

Handwriting recognition, classification

2008

[86][87]

B. Williams

Chars74K Dataset

Character recognition in natural images of symbols used in both English and Kannada

74,107

Character recognition, handwriting recognition, OCR, classification

2009

[88]

T. de Campos

UJI Pen Characters Dataset

Isolated handwritten characters

Coordinates of pen position as characters were written given.

11,640

Text

Handwriting recognition, classification

2009

[89][90]

F. Prat et al.

Gisette Dataset

Handwriting samples from the often-confused 4 and 9 characters.

Features extracted from images, split into train/test, handwriting images size-normalized.

13,500

Images, text

Handwriting recognition, classification

2003

[91]

Yann LeCun et al.

MNIST Database

Database of handwritten digits.

Hand-labeled.

60,000

Images, text

Classification

1998

[92][93]

National Institute of Standards and Technology

Optical Recognition of Handwritten Digits Dataset

Normalized bitmaps of handwritten data.

Size normalized and mapped to bitmaps.

5620

Images, text

Handwriting recognition, classification

1998

[94]

E. Alpaydin et al.

Pen-Based Recognition of Handwritten Digits Dataset

Handwritten digits on electronic pen-tablet.

Feature vectors extracted to be uniformly spaced.

10,992

Images, text

Handwriting recognition, classification

1998

[95][96]

E. Alpaydin et al.

Semeion Handwritten Digit Dataset

Handwritten digits from 80 people.

All handwritten digits have been normalized for size and mapped to the same grid.

1593

Images, text

Handwriting recognition, classification

2008

[97]

T. Srl

HASYv2

Handwritten mathematical symbols

All symbols are centered and of size 32px x 32px.

168233

Images, text

Classification

2017

[98]

Martin Thoma

Aerial images

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Aerial Image Segmentation Dataset

80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.

Images manually segmented.

80

Images

Aerial Classification, object detection

2013

[99][100]

J. Yuan et al.

KIT AIS Data Set

Multiple labeled training and evaluation datasets of aerial images of crowds.

Images manually labeled to show paths of individuals through crowds.

~ 150

Images with paths

People tracking, aerial tracking

2012

[101][102]

M. Butenuth et al.

Wilt Dataset

Remote sensing data of diseased trees and other land cover.

Various features extracted.

4899

Images

Classification, aerial object detection

2014

[103][104]

B. Johnson

Forest Type Mapping Dataset

Satellite imagery of forests in Japan.

Image wavelength bands extracted.

326

Text

Classification

2015

[105][106]

B. Johnson

Overhead Imagery Research Data Set

Annotated overhead imagery. Images with multiple objects.

Over 30 annotations and over 60 statistics that describe the target within the context of the image.

1000

Images, text

Classification

2009

[107][108]

F. Tanner et al.

Other images[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

MPII Cooking Activities Dataset

Videos and images of various cooking activities.

Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling.

881,755 frames

Labeled video, images, text

Classification

2012

[109][110]

M. Rohrbach et al.

Stanford Dogs Dataset

Images of 120 breeds of dogs from around the world.

Train/test splits and ImageNet annotations provided.

20,580

Images, text

Fine-grain classification

2011

[111][112]

A. Khosla et al.

The Oxford-IIIT Pet Dataset

37 categories of pets with roughly 200 images of each.

Breed labeled, tight bounding box, foreground-background segmentation.

~ 7,400

Images, text

Classification, object detection

2012

[112][113]

O. Parkhi et al.

Corel Image Features Data Set

Database of images with features extracted.

Many features including color histogram, co-occurrence texture, and colormoments,

68,040

Text

Classification, object detection

1999

[114][115]

M. Ortega-Bindenberger et al.

Online Video Characteristics and Transcoding Time Dataset.

Transcoding times for various different videos and video properties.

Video features given.

168,286

Text

Regression

2015

[116]

T. Deneke et al.

Microsoft Sequential Image Narrative Dataset (SIND)

Dataset for sequential vision-to-language

Descriptive caption and storytelling given for each photo, and photos are arranged in sequences

81,743

Images, text

Visual storytelling

2016

[117]

Microsoft Research

Caltech-UCSD Birds-200-2011 Dataset

Large dataset of images of birds.

Part locations for birds, bounding boxes, 312 binary attributes given

11,788

Images, text

Classification

2011

[118][119]

C. Wah et al.

YouTube-8M

Large and diverse labeled video dataset

YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities

8 million

Video, text

Video classification

2016

[120][121]

S. Abu-El-Haija et al.

YFCC100M

Large and diverse labeled image and video dataset

Flickr Videos and Images and associated description, titles, tags, and other metadata (such as EXIF and geotags)

100 million

Video, Image, Text

Video and Image classification

2016

[122][123]

B. Thomee et al.

Discrete LIRIS-ACCEDE

Short videos annotated for valence and arousal.

Valence and arousal labels.

9800

Video

Video emotion elicitation detection

2015

[124]

Y. Baveye et al.

Continuous LIRIS-ACCEDE

Long videos annotated for valence and arousal while also collecting Galvanic Skin Response.

Valence and arousal labels.

30

Video

Video emotion elicitation detection

2015

[125]

Y. Baveye et al.

MediaEval LIRIS-ACCEDE

Extension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films.

Vioence, valence and arousal labels.

10900

Video

Video emotion elicitation detection

2015

[126]

Y. Baveye et al.

Text data[edit]

Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Amazon reviews

US product reviews from Amazon.com.

None.

~ 82M

Text

Classification, sentiment analysis

2015

[127]

McAuley et al.

OpinRank Review Dataset

Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively.

None.

42,230 / ~259,000 respectively

Text

Sentiment analysis, clustering

2011

[128][129]

K. Ganesan et al.

MovieLens

22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.

None.

~ 22M

Text

Regression, clustering, classification

2016

[130]

GroupLens Research

Yahoo! Music User Ratings of Musical Artists

Over 10M ratings of artists by Yahoo users.

None described.

~ 10M

Text

Clustering, regression

2004

[131][132]

Yahoo!

Car Evaluation Data Set

Car properties and their overall acceptability.

Six categorical features given.

1728

Text

Classification

1997

[133][134]

M. Bohanec

YouTube Comedy Slam Preference Dataset

User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.

Video metadata given.

1,138,562

Text

Classification

2012

[135][136]

Google

Skytrax User Reviews Dataset

User reviews of airlines, airports, seats, and lounges from Skytrax.

Ratings are fine-grain and include many aspects of airport experience.

41396

Text

Classification, regression

2015

[137]

Q. Nguyen

Teaching Assistant Evaluation Dataset

Teaching assistant reviews.

Features of each instance such as class, class size, and instructor are given.

151

Text

Classification

1997

[138][139]

W. Loh et al.

News articles[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

NYSK Dataset

English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.

Filtered and presented in XML format.

10,421

XML, text

Sentiment analysis, topic extraction

2013

[140]

Dermouche, M. et al.

The Reuters Corpus Volume 1

Large corpus of Reuters news stories in English.

Fine-grain categorization and topic codes.

810,000

Text

Classification, clustering, summarization

2002

[141]

Reuters

The Reuters Corpus Volume 2

Large corpus of Reuters news stories in multiple languages.

Fine-grain categorization and topic codes.

487,000

Text

Classification, clustering, summarization

2005

[142]

Reuters

Thomson Reuters Text Research Collection

Large corpus of news stories.

Details not described.

1,800,370

Text

Classification, clustering, summarization

2009

[143]

T. Rose et al.

Saudi Newspapers Corpus

31,030 Arabic newspaper articles.

Metadata extracted.

31,030

JSON

Summarization, clustering

2015

[144]

M. Alhagri

Messages[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Enron Email Dataset

Emails from employees at Enron organized into folders.

Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.

~ 500,000

Text

Network analysis, sentiment analysis

2004 (2015)

[145][146]

Klimt, B. and Y. Yang

Ling-Spam Dataset

Corpus containing both legitimate and spam emails.

Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.

Text

Classification

2000

[147][148]

Androutsopoulos, J. et al.

SMS Spam Collection Dataset

Collected SMS spam messages.

None.

5574

Text

Classification

2011

[149][150]

T. Almeida et al.

Twenty Newsgroups Dataset

Messages from 20 different newsgroups.

None.

20,000

Text

Natural language processing

1999

[151]

T. Mitchell et al.

Spambase Dataset

Spam emails.

Many text features extracted.

4601

Text

Spam detection, classification

1999

[152]

M. Hopkins et al.

Twitter and tweets[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Sentiment140

Tweet data from 2009 including original text, time stamp, user and sentiment.

Classified using distant supervision from presence of emoticon in tweet.

1,578,627

Tweets, comma, separated values

Sentiment analysis

2009

[153][154]

A. Go et al.

ASU Twitter Dataset

Twitter network data, not actual tweets. Shows connections between a large number of users.

None.

11,316,811 users, 85,331,846 connections

Text

Clustering, graph analysis

2009

[155][156]

R. Zafarani et al.

SNAP Social Circles: Twitter Database

Large twitter network data.

Node features, circles, and ego networks.

1,768,149

Text

Clustering, graph analysis

2012

[157][158]

J. McAuley et al.

Twitter Dataset for Arabic Sentiment Analysis

Arabic tweets.

Samples hand-labeled as positive or negative.

2000

Text

Classification

2014

[159][160]

N. Abdulla

Buzz in Social Media Dataset

Data from Twitter and Tom’s Hardware. This dataset focuses on specific buzz topics being discussed on those sites.

Data is windowed so that the user can attempt to predict the events leading up to social media buzz.

140,000

Text

Regression, Classification

2013

[161][162]

F. Kawala et al.

Paraphrase and Semantic Similarity in Twitter (PIT)

This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.

tokenization, part-of-speech and named entity tagging

18,762

Text

Regression, Classification

2015

[163][164]

Xu et al.

Other text[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Legal Case Reports

Federal Court of Australia cases from 2006–2009.

None.

4,000

Text

Summarization, citation analysis

2012

[165][166]

F. Galgani et al.

Blogger Authorship Corpus

Blog entries of 19,320 people from blogger.com.

Blogger self-provided gender, age, industry, and astrological sign.

681,288

Text

Sentiment analysis, summarization, classification

2006

[167][168]

J. Schler et al.

Social Structure of Facebook Networks

Large dataset of the social structure of Facebook.

None.

100 colleges covered

Text

Network analysis, clustering

2012

[169][170]

A. Traud et al.

Dataset for the Machine Comprehension of Text

Stories and associated questions for testing comprehension of text.

None.

660

Text

Natural language processing, machine comprehension

2013

[171][172]

M. Richardson et al.

The Penn Treebank Project

Naturally occurring text annotated for linguistic structure.

Text is parsed into semantic trees.

~ 1M words

Text

Natural language processing, summarization

1995

[173][174]

M. Marcus et al.

DEXTER Dataset

Task given is to determine, from features given, which articles are about corporate acquisitions.

Features extracted include word stems. Distractor features included.

2600

Text

Classification

2008

[175]

Reuters

Google Books N-grams

N-grams from a very large corpus of books

None.

2.2 TB of text

Text

Classification, clustering, regression

2011

[176][177]

Google

Personae Corpus

Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.

In addition to normal texts, syntactically annotated texts are given.

145

Text

Classification, regression

2008

[178][179]

K. Luyckx et al.

CNAE-9 Dataset

Categorization task for free text descriptions of Brazilian companies.

Word frequency has been extracted.

1080

Text

Classification

2012

[180][181]

P. Ciarelli et al.

Sentiment Labeled Sentences Dataset

3000 sentiment labeled sentences.

Sentiment of each sentence has been hand labeled as positive or negative.

3000

Text

Classification, sentiment analysis

2015

[182][183]

D. Kotzias

BlogFeedback Dataset

Dataset to predict the number of comments a post will receive based on features of that post.

Many features of each post extracted.

60,021

Text

Regression

2014

[184][185]

K. Buza

Stanford Natural Language Inference (SNLI) Corpus

Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.

Entailment class labels, syntactic parsing by the Stanford PCFG parser

570,000

Text

Natural language inference/recognizing textual entailment

2015

[186]

S. Bowman et al.

Sound data[edit]

Datasets of sounds and sound features.

Speech[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Zero Resource Speech Challenge 2015

Spontaneous speech (English), Read speech (Xitsonga).

raw wav

English: 5h, 12 speakers; Xitsonga: 2h30; 24 speakers

sound

Unsupervised discovery of speech features/subword units/word units

2015

[187][188]www.zerospeech.com/2015

Versteegh et al.

Parkinson Speech Dataset

Multiple recordings of people with and without Parkinson’s Disease.

Voice features extracted, disease scored by physician using unified Parkinson’s disease rating scale

1,040

Text

Classification, regression

2013

[189][190]

B. E. Sakar et al.

Spoken Arabic Digits

Spoken Arabic digits from 44 male and 44 female.

Time-series of mel-frequency cepstrum coefficients.

8,800

Text

Classification

2010

[191][192]

M. Bedda et al.

ISOLET Dataset

Spoken letter names.

Features extracted from sounds.

7797

Text

Classification

1994

[193][194]

R. Cole et al.

Japanese Vowels Dataset

Nine male speakers uttered two Japanese vowels successively.

Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients.

640

Text

Classification

1999

[195][196]

M. Kudo et al.

Parkinson’s Telemonitoring Dataset

Multiple recordings of people with and without Parkinson’s Disease.

Sound features extracted.

5875

Text

Classification

2009

[197][198]

A. Tsanas et al.

TIMIT

Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.

Speech is lexically and phonemically transcribed.

6300

Text

Speech recognition, classification.

1986

[199][200]

J. Garofolo et al.

Arabic Speech Corpus

A single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level

Speech is orthographically and phonetically transcribed with stress marks.

~1900

Text, WAV

Speech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education.

2016

[201]

N. Halabi

Music[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Geographical Original of Music Data Set

Audio features of music samples from different locations.

Audio features extracted using MARSYAS software.

1,059

Text

Geographical classification, clustering

2014

[202][203]

F. Zhou et al.

Million Song Dataset

Audio features from one million different songs.

Audio features extracted.

1M

Text

Classification, clustering

2011

[204][205]

T. Bertin-Mahieux et al.

Free Music Archive

Audio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text.

Raw audio and audio features.

106,574

Text, MP3

Classification, recommendation

2017

[206]

M. Defferrard et al.

Bach Choral Harmony Dataset

Bach chorale chords.

Audio features extracted.

5665

Text

Classification

2014

[207][208]

D. Radicioni et al.

Other sounds[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

UrbanSound

Labeled sound recordings of sounds like air conditioners, car horns and children playing.

Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.

1,059

Sound (WAV)

Classification

2014

[209][210]

J. Salamon et al.

Signal data[edit]

Datasets containing electric signal information requiring some sort of Signal processing for further analysis.

Electrical[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Witty Worm Dataset

Dataset detailing the spread of the Witty worm and the infected computers.

Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers.

55,909 IP addresses

Text

Classification

2004

[211][212]

Center for Applied Internet Data Analysis

Cuff-Less Blood Pressure Estimation Dataset

Cleaned vital signals from human patients which can be used to estimate blood pressure.

125 Hz vital signs have been cleaned.

12,000

Text

Classification, regression

2015

[213][214]

M. Kachuee et al.

Gas Sensor Array Drift Dataset

Measurements from 16 chemical sensors utilized in simulations for drift compensation.

Extensive number of features given.

13,910

Text

Classification

2012

[215][216]

A. Vergara

Servo Dataset

Data covering the nonlinear relationships observed in a servo-amplifier circuit.

Levels of various components as a function of other components are given.

167

Text

Regression

1993

[217][218]

K. Ullrich

UJIIndoorLoc-Mag Dataset

Indoor localization database to test indoor positioning systems. Data is magnetic field based.

Train and test splits given.

40,000

Text

Classification, regression, clustering

2015

[219][220]

D. Rambla et al.

Sensorless Drive Diagnosis Dataset

Electrical signals from motors with defective components.

Statistical features extracted.

58,508

Text

Classification

2015

[221][222]

M. Bator

Motion-tracking[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Wearable Computing: Classification of Body Postures and Movements (PUC-Rio)

People performing five standard actions while wearing motion tackers.

None.

165,632

Text

Classification

2013

[223][224]

Pontifical Catholic University of Rio de Janeiro

Gesture Phase Segmentation Dataset

Features extracted from video of people doing various gestures.

Features extracted aim at studying gesture phase segmentation.

9900

Text

Classification, clustering

2014

[225][226]

R. Madeo et a

Vicon Physical Action Data Set Dataset

10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker.

Many parameters recorded by 3D tracker.

3000

Text

Classification

2011

[227][228]

T. Theodoridis

Daily and Sports Activities Dataset

Motor sensor data for 19 daily and sports activities.

Many sensors given, no preprocessing done on signals.

9120

Text

Classification

2013

[229][230]

B. Barshan et al.

Human Activity Recognition Using Smartphones Dataset

Gyroscope and accelerometer data from people wearing smartphones and performing normal actions.

Actions performed are labeled, all signals preprocessed for noise.

10,299

Text

Classification

2012

[231][232]

J. Reyes-Ortiz et al.

Australian Sign Language Signs

Australian sign language signs captured by motion-tracking gloves.

None.

2565

Text

Classification

2002

[233][234]

M. Kadous

Weight Lifting Exercises monitored with Inertial Measurement Units

Five variations of the biceps curl exercise monitored with IMUs.

Some statistics calculated from raw data.

39,242

Text

Classification

2013

[235][236]

W. Ugulino et al.

sEMG for Basic Hand movements Dataset

Two databases of surface electromyographic signals of 6 hand movements.

None.

3000

Text

Classification

2014

[237][238]

C. Sapsanis et al.

REALDISP Activity Recognition Dataset

Evaluate techniques dealing with the effects of sensor displacement in wearable activity recognition.

None.

1419

Text

Classification

2014

[238][239]

O. Banos et al.

Heterogeneity Activity Recognition Dataset

Data from multiple different smart devices for humans performing various activities.

None.

43,930,257

Text

Classification, clustering

2015

[240][241]

A. Stisen et al.

Indoor User Movement Prediction from RSS Data

Temporal wireless network data that can be used to track the movement of people in an office.

None.

13,197

Text

Classification

2016

[242][243]

D. Bacciu

PAMAP2 Physical Activity Monitoring Dataset

18 different types of physical activities performed by 9 subjects wearing 3 IMUs.

None.

3,850,505

Text

Classification

2012

[244]

A. Reiss

OPPORTUNITY Activity Recognition Dataset

Human Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms.

None.

2551

Text

Classification

2012

[245][246]

D. Roggen et al.

Real World Activity Recognition Dataset

Human Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors.

None.

3,150,000 (per sensor)

Text

Classification

2016

[247]

T. Sztyler et al.

Other signals[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Wine Dataset

Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

13 properties of each wine are given

178

Text

Classification, regression

1991

[248][249]

M. Forina et al.

Combined Cycle Power Plant Data Set

Data from various sensors within a power plant running for 6 years.

None

9568

Text

Regression

2014

[250][251]

P. Tufekci et al.

Physical data[edit]

Datasets from physical systems

High-energy physics[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

HIGGS Dataset

Monte Carlo simulations of particle accelerator collisions.

28 features of each collision are given.

11M

Text

Classification

2014

[252][253][254]

D. Whiteson

HEPMASS Dataset

Monte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise.

28 features of each collision are given.

10,500,000

Text

Classification

2016

[253][254][255]

D. Whiteson

Systems[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Yacht Hydrodynamics Dataset

Yacht performance based on dimensions.

Six features are given for each yacht.

308

Text

Regression

2013

[256][257]

R. Lopez

Robot Execution Failures Dataset

5 data sets that center around robotic failure to execute common tasks.

Integer valued features such as torque and other sensor measurements.

463

Text

Classification

1999

[258]

L. Seabra et al.

Pittsburgh Bridges Dataset

Design description is given in terms of several properties of various bridges.

Various bridge features are given.

108

Text

Classification

1990

[259][260]

Y. Reich et al.

Automobile Dataset

Data about automobiles, their insurance risk, and their normalized losses.

Car features extracted.

205

Text

Regression

1987

[261][262]

J. Schimmer et al.

Auto MPG Dataset

MPG data for cars.

Eight features of each car given.

398

Text

Regression

1993

[263]

Carnegie Mellon University

Energy Efficiency Dataset

Heating and cooling requirements given as a function of building parameters.

Building parameters given.

768

Text

Classification, regression

2012

[264][265]

A. Xifara et al.

Airfoil Self-Noise Dataset

A series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections.

Data about frequency, angle of attack, etc., are given.

1503

Text

Regression

2014

[266]

R. Lopez

Challenger USA Space Shuttle O-Ring Dataset

Attempt to predict O-ring problems given past Challenger data.

Several features of each flight, such as launch temperature, are given.

23

Text

Regression

1993

[267][268]

D. Draper et al.

Statlog (Shuttle) Dataset

NASA space shuttle datasets.

Nine features given.

58,000

Text

Classification

2002

[269]

NASA

Astronomy[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Volcanoes on Venus – JARtool experiment Dataset

Venus images returned by the Magellan spacecraft.

Images are labeled by humans.

not given

Images

Classification

1991

[270][271]

M. Burl

MAGIC Gamma Telescope Dataset

Monte Carlo generated high-energy gamma particle events.

Numerous features extracted from the simulations.

19,020

Text

Classification

2007

[271][272]

R. Bock

Solar Flare Dataset

Measurements of the number of certain types of solar flare events occurring in a 24-hour period.

Many solar flare-specific features are given.

1389

Text

Regression, classification

1989

[273]

G. Bradshaw

Earth science[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Volcanoes of the World

Volcanic eruption data for all known volcanic events on earth.

Details such as region, subregion, tectonic setting, dominant rock type are given.

1535

Text

Regression, classification

2013

[274]

E. Venzke et al.

Seismic-bumps Dataset

Seismic activities from a coal mine.

Seismic activity was classified as hazardous or not.

2584

Text

Classification

2013

[275][276]

M. Sikora et al.

Other physical[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Concrete Compressive Strength Dataset

Dataset of concrete properties and compressive strength.

Nine features are given for each sample.

1030

Text

Regression

2007

[277][278]

I. Yeh

Concrete Slump Test Dataset

Concrete slump flow given in terms of properties.

Features of concrete given such as fly ash, water, etc.

103

Text

Regression

2009

[279][280]

I. Yeh

Musk Dataset

Predict if a molecule, given the features, will be a musk or a non-musk.

168 features given for each molecule.

6598

Text

Classification

1994

[281]

Arris Pharmaceutical Corp.

Steel Plates Faults Dataset

Steel plates of 7 different types.

27 features given for each sample.

1941

Text

Classification

2010

[282]

Semeion Research Center

Biological data[edit]

Datasets from biological systems.

Human[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

EEG Database

Study to examine EEG correlates of genetic predisposition to alcoholism.

Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second.

122

Text

Classification

1999

[283][284]

H. Begleiter

P300 Interface Dataset

Data from nine subjects collected using P300-based brain-computer interface for disabled subjects.

Split into four sessions for each subject. MATLAB code given.

1,224

Text

Classification

2008

[285][286]

U. Hoffman et al.

Heart Disease Data Set

Attributed of patients with and without heart disease.

75 attributes given for each patient with some missing values.

303

Text

Classification

1988

[287][288]

A. Janosi et al.

Breast Cancer Wisconsin (Diagnostic) Dataset

Dataset of features of breast masses. Diagnoses by physician is given.

10 features for each sample are given.

569

Text

Classification

1995

[289][290]

W. Wolberg et al.

National Survey on Drug Use and Health

Large scale survey on health and drug use in the United States.

None.

55,268

Text

Classification, regression

2012

[291]

United States Department of Health and Human Services

Lung Cancer Dataset

Lung cancer dataset without attribute definitions

56 features are given for each case

32

Text

Classification

1992

[292][293]

Z. Hong et al.

Arrhythmia Dataset

Data for a group of patients, of which some have cardiac arrhythmia.

276 features for each instance.

452

Text

Classification

1998

[294][295]

H. Altay et al.

Diabetes 130-US hospitals for years 1999–2008 Dataset

9 years of readmission data across 130 US hospitals for patients with diabetes.

Many features of each readmission are given.

100,000

Text

Classification, clustering

2014

[296][297]

J. Clore et al.

Diabetic Retinopathy Debrecen Dataset

Features extracted from images of eyes with and without diabetic retinopathy.

Features extracted and conditions diagnosed.

1151

Text

Classification

2014

[298][299]

B. Antal et al.

Liver Disorders Dataset

Data for people with liver disorders.

Seven biological features given for each patient.

345

Text

Classification

1990

[300][301]

Bupa Medical Research Ltd.

Thyroid Disease Dataset

10 databases of thyroid disease patient data.

None.

7200

Text

Classification

1987

[302][303]

R. Quinlan

Mesothelioma Dataset

Mesothelioma patient data.

Large number of features, including asbestos exposure, are given.

324

Text

Classification

2016

[304][305]

A. Tanrikulu et al.

KEGG Metabolic Reaction Network (Undirected) Dataset

Network of metabolic pathways. A reaction network and a relation network are given.

Detailed features for each network node and pathway are given.

65,554

Text

Classification, clustering, regression

2011

[306]

M. Naeem et al.

Animal[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Abalone Dataset

Physical measurements of Abalone. Weather patterns and location are also given.

None.

4177

Text

Regression

1995

[307]

Marine Research Laboratories – Taroona

Zoo Dataset

Artificial dataset covering 7 classes of animals.

Animals are classed into 7 categories and features are given for each.

101

Text

Classification

1990

[308]

R. Forsyth

Demospongiae Dataset

Data about marine sponges.

503 sponges in the Demosponge class are described by various features.

503

Text

Classification

2010

[309]

E. Armengol et al.

Splice-junction Gene Sequences Dataset

Primate splice-junction gene sequences (DNA) with associated imperfect domain theory.

None.

3190

Text

Classification

1992

[293]

G. Towell et al.

Mice Protein Expression Dataset

Expression levels of 77 proteins measured in the cerebral cortex of mice.

None.

1080

Text

Classification, Clustering

2015

[310][311]

C. Higuera et al.

Plant[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Forest Fires Dataset

Forest fires and their properties.

13 features of each fire are extracted.

517

Text

Regression

2008

[312][313]

P. Cortez et al.

Iris Dataset

Three types of iris plants are described by 4 different attributes.

None.

150

Text

Classification

1936

[314][315]

R. Fisher

Plant Species Leaves Dataset

Sixteen samples of leaf each of one-hundred plant species.

Shape descriptor, fine-scale margin, and texture histograms are given.

1600

Text

Classification

2012

[316][317]

J. Cope et al.

Mushroom Dataset

Mushroom attributes and classification.

Many properties of each mushroom are given.

8124

Text

Classification

1987

[318]

J. Schlimmer

Soybean Dataset

Database of diseased soybean plants.

35 features for each plant are given. Plants are classified into 19 categories.

307

Text

Classification

1988

[319]

R. Michalshi et al.

Seeds Dataset

Measurements of geometrical properties of kernels belonging to three different varieties of wheat.

None.

210

Text

Classification, clustering

2012

[320][321]

Charytanowicz et al.

Covertype Dataset

Data for predicting forest cover type strictly from cartographic variables.

Many geographical features given.

581,012

Text

Classification

1998

[322][323]

J. Blackard et al.

Abscisic Acid Signaling Network Dataset

Data for a plant signaling network. Goal is to determine set of rules that governs the network.

None.

300

Text

Causal-discovery

2008

[324]

J. Jenkens et al.

Folio Dataset

20 photos of leaves for each of 32 species.

None.

637

Images, text

Classification, clustering

2015

[325][326]

T. Munisami et al.

Oxford Flower Dataset

17 category dataset of flowers.

Train/test splits, labeled images,

1360

Images, text

Classification

2006

[113][327]

M-E Nilsback et al.

Microbe[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Ecoli Dataset

Protein localization sites.

Various features of the protein localizations sites are given.

336

Text

Classification

1996

[328][329]

K. Nakai et al.

MicroMass Dataset

Identification of microorganisms from mass-spectrometry data.

Various mass spectrometer features.

931

Text

Classification

2013

[330][331]

P. Mahe et al.

Yeast Dataset

Predictions of Cellular localization sites of proteins.

Eight features given per instance.

1484

Text

Classification

1996

[332][333]

K. Nakai et al.

Drug Discovery[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Tox21 Dataset

Prediction of outcome of biological assays.

Chemical descriptors of molecules are given.

12707

Text

Classification

2016

[334]

A. Mayr et al.

Anomaly data[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Numenta Anomaly Benchmark (NAB)

Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.

?

50+ files

Comma separated values

Anomaly detection

2016 (continually updated)

[335]

Numenta

Multivariate data[edit]

Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.

Financial[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Dow Jones Index

Weekly data of stocks from the first and second quarters of 2011.

Calculated values included such as percentage change and a lags.

750

Comma separated values

Classification, regression, time Series

2014

[336][337]

M. Brown et al.

Statlog (Australian Credit Approval)

Credit card applications either accepted or rejected and attributes about the application.

Attribute names are removed as well as identifying information. Factors have been relabeled.

690

Comma separated values

Classification

1987

[338][339]

R. Quinlan

eBay auction data

Auction data from various eBay.com objects over various length auctions

Contains all bids, bidderID, bid times, and opening prices.

~ 550

Text

Regression, classification

2012

[340][341]

G. Shmueli et al.

Statlog (German Credit Data)

Binary credit classification into “good” or “bad” with many features

Various financial features of each person are given.

690

Text

Classification

1994

[342]

H. Hofmann

Bank Marketing Dataset

Data from a large marketing campaign carried out by a large bank .

Many attributes of the clients contacted are given. If the client subscribed to the bank is also given.

45,211

Text

Classification

2012

[343][344]

S. Moro et al.

Istanbul Stock Exchange Dataset

Several stock indexes tracked for almost two years.

None.

536

Text

Classification, regression

2013

[345][346]

O. Akbilgic

Default of Credit Card Clients

Credit default data for Taiwanese creditors.

Various features about each account are given.

30,000

Text

Classification

2016

[347][348]

I. Yeh

Weather[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Cloud DataSet

Data about 1024 different clouds.

Image features extracted.

1024

Text

Classification, clustering

1989

[349]

P. Collard

El Nino Dataset

Oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.

12 weather attributes are measured at each buoy.

178080

Text

Regression

1999

[350]

Pacific Marine Environmental Laboratory

Greenhouse Gas Observing Network Dataset

Time-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather.

None.

2921

Text

Regression

2015

[351]

D. Lucas

Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory

Continuous air samples in Hawaii, USA. 44 years of records.

None.

44 years

Text

Regression

2001

[352]

Mauna Loa Observatory

Ionosphere Dataset

Radar data from the ionosphere. Task is to classify into good and bad radar returns.

Many radar features given.

351

Text

Classification

1989

[303][353]

Johns Hopkins University

Ozone Level Detection Dataset

Two ground ozone level datasets.

Many features given, including weather conditions at time of measurement.

2536

Text

Classification

2008

[354][355]

K. Zhang et al.

Census[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Adult Dataset

Census data from 1994 containing demographic features of adults and their income.

Cleaned and anonymized.

48,842

Comma separated values

Classification

1996

[356]

United States Census Bureau

Census-Income (KDD)

Weighted census data from the 1994 and 1995 Current Population Surveys.

Split into training and test sets.

299,285

Comma separated values

Classification

2000

[357][358]

United States Census Bureau

IPUMS Census Database

Census data from the Los Angeles and Long Beach areas.

None

256,932

Text

Classification, regression

1999

[359]

IPUMS

US Census Data 1990

Partial data from 1990 US census.

Results randomized and useful attributes selected.

2,458,285

Text

Classification, regression

1990

[360]

United States Census Bureau

Transit[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Bike Sharing Dataset

Hourly and daily count of rental bikes in a large city.

Many features, including weather, length of trip, etc., are given.

17,389

Text

Regression

2013

[361][362]

H. Fanaee-T

New York City Taxi Trip Data

Trip data for yellow and green taxis in New York City.

Gives pick up and drop off locations, fares, and other details of trips.

6 years

Text

Classification, clustering

2015

[363]

New York City Taxi and Limousine Commission

Taxi Service Trajectory ECML PKDD

Trajectories of all taxis in a large city.

Many features given, including start and stop points.

1,710,671

Text

Clustering, causal-discovery

2015

[364][365]

M. Ferreira et al.

Internet[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Webpages from Common Crawl 2012

Large collection of webpages and how they are connected via hyperlinks

None.

3.5B

Text

clustering, classification

2013

[366]

V. Granville

Internet Advertisements Dataset

Dataset for predicting if a given image is an advertisement or not.

Features encode geometry of ads and phrases occurring in the URL.

3279

Text

Classification

1998

[367][368]

N. Kushmerick

Internet Usage Dataset

General demographics of internet users.

None.

10,104

Text

Classification, clustering

1999

[369]

D. Cook

URL Dataset

120 days of URL data from a large conference.

Many features of each URL are given.

2,396,130

Text

Classification

2009

[370][371]

J. Ma

Phishing Websites Dataset

Dataset of phishing websites.

Many features of each site are given.

2456

Text

Classification

2015

[372]

R. Mustafa et al.

Online Retail Dataset

Online transactions for a UK online retailer.

Details of each transaction given.

541,909

Text

Classification, clustering

2015

[373]

D. Chen

Freebase Simple Topic Dump

Freebase is an online effort to structure all human knowledge.

Topics from Freebase have been extracted.

large

Text

Classification, clustering

2011

[374][375]

Freebase

Farm Ads Dataset

The text of farm ads from websites. Binary approval or disapproval by content owners is given.

SVMlight sparse vectors of text words in ads calculated.

4143

Text

Classification

2011

[376][377]

C. Masterharm et al.

Games[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Poker Hand Dataset

5 card hands from a standard 52 card deck.

Attributes of each hand are given, including the Poker hands formed by the cards it contains.

1,025,010

Text

Regression, classification

2007

[378]

R. Cattral

Connect-4 Dataset

Contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced.

None.

67,557

Text

Classification

1995

[379]

J. Tromp

Chess (King-Rook vs. King) Dataset

Endgame Database for White King and Rook against Black King.

None.

28,056

Text

Classification

1994

[380][381]

M. Bain et al.

Chess (King-Rook vs. King-Pawn) Dataset

King+Rook versus King+Pawn on a7.

None.

3196

Text

Classification

1989

[382]

R. Holte

Tic-Tac-Toe Endgame Dataset

Binary classification for win conditions in tic-tac-toe.

None.

958

Text

Classification

1991

[383]

D. Aha

Other multivariate[edit]

Dataset Name

Brief description

Preprocessing

Instances

Format

Default Task

Created (updated)

Reference

Creator

Housing Data Set

Median home values of Boston with associated home and neighborhood attributes.

None.

506

Text

Regression

1993

[384]

D. Harrison et al.

The Getty Vocabularies

structured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials.

None.

large

Text

Classification

2015

[385]

Getty Center

Yahoo! Front Page Today Module User Click Log

User click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page.

Conjoint analysis with a bilinear model.

45,811,883 user visits

Text

Regression, clustering

2009

[386][387]

Chu et al.

British Oceanographic Data Centre

Biological, chemical, physical and geophysical data for oceans. 22K variables tracked.

Various.

22K variables, many instances

Text

Regression, clustering

2015

[388]

British Oceanographic Data Centre

Congressional Voting Records Dataset

Voting data for all USA representatives on 16 issues.

Beyond the raw voting data, various other features are provided.

435

Text

Classification

1987

[389]

J. Schlimmer

Entree Chicago Recommendation Dataset

Record of user interactions with Entree Chicago recommendation system.

Details of each users usage of the app are recorded in detail.

50,672

Text

Regression, recommendation

2000

[390]

R. Burke

Insurance Company Benchmark (COIL 2000)

Information on customers of an insurance company.

Many features of each customer and the services they use.

9,000

Text

Regression, classification

2000

[391][392]

P. van der Putten

Nursery Dataset

Data from applicants to nursery schools.

Data about applicant’s family and various other factors included.

12,960

Text

Classification

1997

[393][394]

V. Rajkovic et al.

University Dataset

Data describing attributed of a large number of universities.

None.

285

Text

Clustering, classification

1988

[395]

S. Sounders et al.

Blood Transfusion Service Center Dataset

Data from blood transfusion service center. Gives data on donors return rate, frequency, etc.

None.

748

Text

Classification

2008

[396][397]

I. Yeh

Record Linkage Comparison Patterns Dataset

Large dataset of records. Task is to link relevant records together.

Blocking procedure applied to select only certain record pairs.

5,749,132

Text

Classification

2011

[398][399]

University of Mainz

Nomao Dataset

Nomao collects data about places from many different sources. Task is to detect items that describe the same place.

Duplicates labeled.

34,465

Text

Classification

2012

[400][401]

Nomao Labs

Movie Dataset

Data for 10,000 movies.

Several features for each movie are given.

10,000

Text

Clustering, classification

1999

[402]

G. Wiederhold

Open University Learning Analytics Dataset

Information about students and their interactions with a virtual learning environment.

None.

~ 30,000

Text

Classification, clustering, regression

2015

[403][404]

J. Kuzilek et al.



Ref:



本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏Pulsar-V

Python MFCC算法

MFCC(梅尔倒谱系数)的算法思路 读取波形文件 汉明窗 分帧 傅里叶变换 回归离散数据 取得特征数据 Python示例代...

62040
来自专栏CreateAMind

分类生成统一框架Triple-GAN

https://github.com/zhenxuan00/triple-gan 阅读原文

29320
来自专栏ml

hdu 4081 Qin Shi Huang's National Road System (次小生成树)

Qin Shi Huang's National Road System Time Limit: 2000/1000 MS (Java/Others)    M...

39240
来自专栏专知

国际万维网会议WWW 2018论文列表以及会议日程,一睹为快

2018 年 4 月 23 日至 27 日,第 27 届国际万维网会议(26th International World Wide Web Conference...

672110
来自专栏专知

【最新】人工智能领域顶会AAAI 2018 Pre-Proceedings 论文列表(附pdf下载链接)

【导读】人工智能领域顶尖学术会议 AAAI 2018,暨第32届 AAAI 大会将于 2 月 2 日 - 2 月 7 日 在新奥尔良举行。AAAI 是由人工智能...

1.4K60
来自专栏机器之心

资源 | 谷歌与MIT联袂巨著:《计算机科学的数学》开放下载

选自CSAIL.Mit 机器之心编译 参与:蒋思源、吴攀 谷歌和麻省理工学院联袂出品的《计算机科学的数学》昨日已经开放下载了,读者可点击文末「阅读原文」下载。...

34470
来自专栏计算机视觉与深度学习基础

计算机视觉著名数据集CV Datasets

Detection PASCAL VOC 2009 datasetClassification/Detection Competitions, Segm...

36680
来自专栏程序生活

第三周编程作业-Planar data classification with one hidden layerPlanar data classification with one hidden l

Planar data classification with one hidden layer Welcome to your week 3 programm...

1.2K100
来自专栏专知

ACL 2018 计算语言学协会接受论文列表

43710
来自专栏程序生活

第四周编程作业(一)-Building your Deep Neural Network: Step by StepBuilding your Deep Neural Network: Step by

Building your Deep Neural Network: Step by Step Welcome to your week 4 assignmen...

905110

扫码关注云+社区

领取腾讯云代金券