Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Computer Science

Artificial Intelligence

Common Machine Learning Datasets

Flashcards

0/15

Still learning

MNIST

A dataset of handwritten digits used for image processing and machine learning, ideal for training and testing models on tasks like image classification.

CIFAR-10

A dataset consisting of 60,000 32x32 color images in 10 different classes, used for computer vision tasks such as object recognition.

LibriSpeech

A dataset of 1,000 hours of English speech derived from audiobooks, allowing for training and evaluating speech recognition systems.

Yelp Review Dataset

A dataset consisting of user reviews for businesses across 11 metropolitan areas on Yelp, useful for sentiment analysis and recommendation systems.

MS COCO

A large-scale dataset for multiple computer vision tasks such as object detection, segmentation, and captioning. Similar to the COCO dataset but often updated with new data and annotations.

Google Open Images

A dataset with millions of annotated images for a broad range of categories, useful for machine learning models requiring large scale visual recognition.

Sentiment140

A dataset containing 160,000 tweets annotated with sentiments, created for the task of sentiment analysis in the context of social media.

ImageNet

A large visual dataset designed for use in visual object recognition software research, with more than 14 million images and thousands of categories.

IMDb

A dataset containing 50,000 movie reviews for natural language processing or sentiment analysis, divided evenly into positive and negative reviews.

UCI Machine Learning Repository

A collection of databases, domain theories, and data generators widely used by the machine learning community for empirical analysis of machine learning algorithms.

COCO (Common Objects in Context)

A large-scale dataset for object detection, segmentation, and captioning, containing over 200,000 labeled images across 80 categories.

LFW (Labeled Faces in the Wild)

A database designed for studying the problem of unconstrained face recognition with more than 13,000 images of faces collected from the web.

20 Newsgroups

A collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups, suitable for text classification and clustering.

Boston Housing

A dataset containing information about different houses in Boston areas, used for regression analysis to predict housing prices.

Stanford Dogs Dataset

A dataset with over 20,000 images of 120 breeds of dogs from around the world, which is used for fine-grained image classification.

Know

Still learning

Click to flip

Know