Explore tens of thousands of sets crafted by our community.
Data Preprocessing Techniques
12
Flashcards
0/12
Dimensionality Reduction
Dimensionality Reduction techniques reduce the number of variables under consideration and can be divided into feature selection and feature extraction. Common methods are PCA and t-SNE, which help in improving model performance and computational efficiency.
Data Cleaning
Data Cleaning involves removing errors and inconsistencies in data to improve its quality, often by dealing with missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
Feature Extraction
Feature Extraction involves transforming raw data into a set of features that can be effectively used by a learning algorithm. Examples include PCA for dimensionality reduction or extracting words from text.
Missing Value Imputation
Missing Value Imputation is the process of replacing missing data with substituted values. Techniques include using the mean, median, mode, using algorithms like k-NN, or predictive modeling to estimate missing values.
Standardization
Standardization rescales data to have a mean of 0 and a standard deviation of 1, also known as Z-score normalization. Useful for methods that assume features are centered around zero and with equal variance.
Feature Encoding
Feature Encoding is transforming categorical variables into numerical format. Common techniques include one-hot encoding and label encoding, facilitating the use of mathematical models that require numerical input.
Data Transformation
Data Transformation involves converting data into a suitable format or structure for analysis. This can include aggregating data, combining features, or transforming variable types.
Noise Filtering
Noise Filtering is the process of removing irregularities and fluctuations in data that may hinder the data analysis process. Methods include smoothing techniques like moving averages or more complex signal processing filters.
Binarization
Binarization is the process of turning data features into binary (0/1) values based on a threshold. Commonly used when we need to transform probabilistic values into a crisp, boolean decision.
Discretization
Discretization converts continuous features into discrete values by creating a set of intervals or 'bins' to categorize or group the continuous values. Common in preparing data for categorical-based algorithms.
Outlier Detection
Outlier Detection identifies data points that are significantly different from the majority of the data, potentially indicating variability in measurement or experimental errors. Techniques range from IQR to clustering-based methods.
Normalization
Normalization adjusts the scale of features in the data to a standard range, often between 0 and 1, to allow for fair comparison between different scales. Used to prepare data for algorithms that assume data is normally distributed.
© Hypatia.Tech. 2024 All rights reserved.