Explore tens of thousands of sets crafted by our community.
Big Data Technologies
10
Flashcards
0/10
MapReduce
A programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce job usually splits the input data into independent chunks which are processed in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Commonly used for data mining tasks such as large-scale graph processing and text processing.
Kafka
A distributed streaming platform that is used to build real-time streaming data pipelines and applications. Kafka is capable of handling trillions of events a day, enabling businesses to process and analyze streaming data. Kafka is useful in data mining for real-time data feeds, log aggregation, and operational metrics.
Pig
A high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig enables people to focus more on analyzing bulk data sets and spend less time writing MapReduce programs. It is particularly good for performing data mining tasks where the data is being transformed and preprocessed.
HBase
An open-source, non-relational, distributed database modeled after Google's Bigtable and written in Java. It is designed to scale to billions of rows x millions of columns, atop commodity hardware. HBase is used in data mining for real-time read/write access to big data.
Hive
A data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop, including text files, HBase, and Amazon S3. It is used in data mining to perform queries and analysis of large datasets stored in Hadoop's HDFS.
Spark
An open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning, which makes it a powerful tool for data mining.
Flink
An open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. It excels at processing unbounded and bounded data sets, making it a great choice for data mining applications that require real-time stream processing and stateful computations.
Storm
A distributed real-time computation system for processing large streams of data. Storm is designed for use with any programming language and is often used in real-time analytics, online machine learning, continuous computation, and more. Its use in data mining is particularly valuable for streaming data and real-time analytics.
Hadoop
An open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop is commonly used in data mining for handling massive quantities of data.
Mahout
A library for scalable machine learning and data mining that is built on top of Hadoop. Using the Mahout library, data scientists can perform machine learning tasks such as clustering, classification, and collaborative filtering (recommendation) on large-scale datasets.
© Hypatia.Tech. 2024 All rights reserved.