Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Computer Science

Data Mining

Association Rule Learning

Anomaly Detection Methods

Big Data Technologies

Clustering Algorithms

Classification Algorithms

Data Dimensionality Reduction Techniques

Data Mining Project Lifecycle

Data Mining Success Metrics

Data Mining Techniques

Data Mining and Customer Relationship Management (CRM)

Data Mining for Fraud Detection

Data Mining in Finance

Data Quality Issues

Data Mining in Retail

Data Mining in Healthcare

Data Mining Research Topics

Data Mining Tools and Software

Data Mining Skill Set

Data Mining Case Studies

Data Preprocessing Techniques

Data Mining Challenges

Ensuring Data Privacy in Data Mining

Ethical Aspects of Data Mining

Frequent Pattern Mining Algorithms

Machine Learning vs Data Mining

Mining Complex Data Types

Multimedia Data Mining

Predictive Analytics Fundamentals

Time Series Analysis in Data Mining

Web Mining Methods

Social Media Data Mining

Spatial Data Mining

Text Mining Techniques

Big Data Technologies

Flashcards

0/10

Still learning

MapReduce

A programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce job usually splits the input data into independent chunks which are processed in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Commonly used for data mining tasks such as large-scale graph processing and text processing.

Kafka

A distributed streaming platform that is used to build real-time streaming data pipelines and applications. Kafka is capable of handling trillions of events a day, enabling businesses to process and analyze streaming data. Kafka is useful in data mining for real-time data feeds, log aggregation, and operational metrics.

Pig

A high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig enables people to focus more on analyzing bulk data sets and spend less time writing MapReduce programs. It is particularly good for performing data mining tasks where the data is being transformed and preprocessed.

HBase

An open-source, non-relational, distributed database modeled after Google's Bigtable and written in Java. It is designed to scale to billions of rows x millions of columns, atop commodity hardware. HBase is used in data mining for real-time read/write access to big data.

Hive

A data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop, including text files, HBase, and Amazon S3. It is used in data mining to perform queries and analysis of large datasets stored in Hadoop's HDFS.

Spark

An open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning, which makes it a powerful tool for data mining.

Flink

An open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. It excels at processing unbounded and bounded data sets, making it a great choice for data mining applications that require real-time stream processing and stateful computations.

Storm

A distributed real-time computation system for processing large streams of data. Storm is designed for use with any programming language and is often used in real-time analytics, online machine learning, continuous computation, and more. Its use in data mining is particularly valuable for streaming data and real-time analytics.

Hadoop

An open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop is commonly used in data mining for handling massive quantities of data.

Mahout

A library for scalable machine learning and data mining that is built on top of Hadoop. Using the Mahout library, data scientists can perform machine learning tasks such as clustering, classification, and collaborative filtering (recommendation) on large-scale datasets.

Know

Still learning

Click to flip

Know