Explore tens of thousands of sets crafted by our community.
Data Preprocessing Methods
10
Flashcards
0/10
Data Transformation
Purpose: To convert data into a suitable format or structure for analysis. Method: May include normalization, aggregation, generalization, or feature construction.
Text Tokenization
Purpose: To break down text into smaller units called tokens. Method: Natural Language Processing (NLP) techniques divide strings into words, phrases, symbols, or other meaningful elements called tokens.
Normalization
Purpose: To scale numerical data to a standard range without distorting differences in ranges of values. Method: Commonly involves subtracting the mean and dividing by the standard deviation, bringing values to a range of 0 to 1 (x_{normalized} = rac{x - ext{mean}(X)}{ ext{std}(X)}).
Data Cleaning
Purpose: To correct or remove incorrect, corrupt, or irrelevant records. Method: Can involve filling missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
Handling Imbalanced Data
Purpose: To address the issue where classes in the data are not represented equally. Method: Techniques include resampling the dataset, using synthetic data generation methods like SMOTE or adjusting classification algorithms.
Standardization
Purpose: To scale data to have a mean of zero and a standard deviation of one. Method: Computing the z-score for each feature, where z = rac{(x - ext{mean})}{ ext{std}}.
Dimensionality Reduction
Purpose: To reduce the number of input variables in the model. Method: Techniques like Principal Component Analysis (PCA) and t-SNE are used to reduce features while retaining the variance in data.
Feature Encoding
Purpose: To convert categorical data into numerical format. Method: Common methods include one-hot encoding, label encoding, and the use of binary indicators.
Discretization
Purpose: To transform continuous features into discrete values. Method: Techniques such as binning or k-means clustering can be applied to create categorical bins from continuous features.
Data Augmentation
Purpose: To artificially increase the size and diversity of the dataset. Method: Involves creating modified versions of the data, such as through image rotations, flipping, or noise injection, to enhance the robustness of models.
© Hypatia.Tech. 2024 All rights reserved.