Explore tens of thousands of sets crafted by our community.
Data Preprocessing for Machine Learning
15
Flashcards
0/15
Standardization
Standardization rescales data to have a mean (\(\mu\)) of 0 and standard deviation (\(\sigma\)) of 1 (unit variance). Example: Z-score normalization where an entry \(x\) is replaced by \((x - \mu) / \sigma\).
Handling Time Series Data
Time series data requires special preprocessing like windowing, lag variables, or detrending to prepare for machine learning algorithms.
Missing Values Imputation
Imputation fills in missing or null values within a dataset. One common technique is to replace missing values with the mean, median, or mode value of the column.
Train/Test Split
The train/test split divides the dataset into two parts: one for training the model and the other for testing its performance. A common split is 80% for training and 20% for testing.
Feature Scaling
Feature scaling brings all numerical features to the same scale. Min-max scaling is an example where data is scaled to a fixed range, usually 0 to 1.
Feature Encoding
Feature encoding transforms categorical variables into numerical values which can be used in mathematical models. For example, one-hot encoding represents categorical variables as binary vectors.
Label Discretization
Label discretization converts continuous labels into discrete classes. Techniques like binning can be used to divide the range of continuous values into intervals, turning a regression problem into a classification one.
Outlier Detection
Outlier detection identifies and removes outliers from datasets to prevent them from skewing the analysis. Example: Z-score and IQR (interquartile range) are common methods for finding outliers.
Handling Imbalanced Data
Handling imbalanced data involves adjusting the dataset so that the classes are more balanced. Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) can be used.
Text Data Processing
Text data is preprocessed by techniques such as tokenization, stemming, and lemmatization. Methods like TF-IDF are used to convert text into a numerical format suitable for machine learning models.
Dimensionality Reduction
Dimensionality reduction reduces the number of input variables in a dataset. Principal Component Analysis (PCA) is an example, which is used to reduce the dimensionality of large datasets by transforming to a new set of variables.
Data Transformation
Data transformation includes applying mathematical functions to data. For example, log transformation can be used to reduce the skewness of positively skewed data.
Feature Engineering
Feature engineering creates new features from existing ones to improve model performance. Examples include creating polynomial features from existing attributes.
Normalization
Normalization scales individual samples to have unit norm. It is often used when applying machine learning algorithms. Example usage includes preparing data for algorithms that assume data to be normally distributed, e.g., Gaussian naive Bayes.
Binning
Binning converts numerical variables into categorical counterparts by grouping them into bins. For example, age can be transformed into categories such as 'child', 'adult', and 'elderly'.
© Hypatia.Tech. 2024 All rights reserved.