Logo
Pattern

Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Data Preprocessing for Machine Learning

15

Flashcards

0/15

Still learning
StarStarStarStar

Standardization

StarStarStarStar

Standardization rescales data to have a mean (\(\mu\)) of 0 and standard deviation (\(\sigma\)) of 1 (unit variance). Example: Z-score normalization where an entry \(x\) is replaced by \((x - \mu) / \sigma\).

StarStarStarStar

Handling Time Series Data

StarStarStarStar

Time series data requires special preprocessing like windowing, lag variables, or detrending to prepare for machine learning algorithms.

StarStarStarStar

Missing Values Imputation

StarStarStarStar

Imputation fills in missing or null values within a dataset. One common technique is to replace missing values with the mean, median, or mode value of the column.

StarStarStarStar

Train/Test Split

StarStarStarStar

The train/test split divides the dataset into two parts: one for training the model and the other for testing its performance. A common split is 80% for training and 20% for testing.

StarStarStarStar

Feature Scaling

StarStarStarStar

Feature scaling brings all numerical features to the same scale. Min-max scaling is an example where data is scaled to a fixed range, usually 0 to 1.

StarStarStarStar

Feature Encoding

StarStarStarStar

Feature encoding transforms categorical variables into numerical values which can be used in mathematical models. For example, one-hot encoding represents categorical variables as binary vectors.

StarStarStarStar

Label Discretization

StarStarStarStar

Label discretization converts continuous labels into discrete classes. Techniques like binning can be used to divide the range of continuous values into intervals, turning a regression problem into a classification one.

StarStarStarStar

Outlier Detection

StarStarStarStar

Outlier detection identifies and removes outliers from datasets to prevent them from skewing the analysis. Example: Z-score and IQR (interquartile range) are common methods for finding outliers.

StarStarStarStar

Handling Imbalanced Data

StarStarStarStar

Handling imbalanced data involves adjusting the dataset so that the classes are more balanced. Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) can be used.

StarStarStarStar

Text Data Processing

StarStarStarStar

Text data is preprocessed by techniques such as tokenization, stemming, and lemmatization. Methods like TF-IDF are used to convert text into a numerical format suitable for machine learning models.

StarStarStarStar

Dimensionality Reduction

StarStarStarStar

Dimensionality reduction reduces the number of input variables in a dataset. Principal Component Analysis (PCA) is an example, which is used to reduce the dimensionality of large datasets by transforming to a new set of variables.

StarStarStarStar

Data Transformation

StarStarStarStar

Data transformation includes applying mathematical functions to data. For example, log transformation can be used to reduce the skewness of positively skewed data.

StarStarStarStar

Feature Engineering

StarStarStarStar

Feature engineering creates new features from existing ones to improve model performance. Examples include creating polynomial features from existing attributes.

StarStarStarStar

Normalization

StarStarStarStar

Normalization scales individual samples to have unit norm. It is often used when applying machine learning algorithms. Example usage includes preparing data for algorithms that assume data to be normally distributed, e.g., Gaussian naive Bayes.

StarStarStarStar

Binning

StarStarStarStar

Binning converts numerical variables into categorical counterparts by grouping them into bins. For example, age can be transformed into categories such as 'child', 'adult', and 'elderly'.

Know
0
Still learning
Click to flip
Know
0
Logo

© Hypatia.Tech. 2024 All rights reserved.