Logo
Pattern

Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Feature Engineering Techniques

15

Flashcards

0/15

Still learning
StarStarStarStar

Discretization

StarStarStarStar

The process of converting continuous data into discrete buckets or intervals, similar to binning, but with an emphasis on finding optimal bin boundaries. Use it to improve performance of classification tasks.

StarStarStarStar

Standardization

StarStarStarStar

Rescales data to have a mean (μ\mu) of 0 and standard deviation (σ\sigma) of 1 (unit variance). Commonly used in techniques that assume data is normally distributed.

xstandardized=xμσ x_{standardized} = \frac{x - \mu}{\sigma}

StarStarStarStar

One-Hot Encoding

StarStarStarStar

Transforms categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. Each unique category value is converted into a new binary column. Use it when the categorical feature is not ordinal.

StarStarStarStar

Feature Selection

StarStarStarStar

The process of reducing the number of input variables by selecting the most important features that contribute to the predictive power of the model. Use it to reduce overfitting and improve model performance.

StarStarStarStar

Polynomial Features

StarStarStarStar

Generates new feature interaction terms by raising existing features to a power or creating interaction between them. Use it to capture non-linear relationships when linear models are used.

StarStarStarStar

Feature Scaling

StarStarStarStar

Bringing different range features to a comparable scale. Use it before applying algorithms that are sensitive to the scale of data like SVM, k-NN, and gradient descent.

StarStarStarStar

Feature Extraction

StarStarStarStar

Transforming the input data into a set of new features by dimensionality reduction techniques such as PCA. Use it to reduce the number of features in high-dimensional data.

StarStarStarStar

Binning

StarStarStarStar

Converts numeric variables into categorical counterparts by grouping them into bins. Use it to reduce the effects of minor observation errors and simplify the model.

StarStarStarStar

Category Encoding

StarStarStarStar

General term for converting categorical variables into numeric format. Includes techniques such as one-hot encoding, label encoding, and binary encoding. Choose an encoding method based on model requirements and feature characteristics.

StarStarStarStar

Handling Outliers

StarStarStarStar

Involves identifying and treating extreme values that could negatively impact the model. Use techniques like trimming, capping, or transformations based on the context and the model sensitivity to outliers.

StarStarStarStar

Missing Values Imputation

StarStarStarStar

Deals with missing data by filling it with meaningful values such as median, mean, or mode. Use it to avoid errors during model training that can’t handle null values.

StarStarStarStar

Text Vectorization

StarStarStarStar

Converting text data into numerical format (vectors) so that ML algorithms can process it. Use techniques like Bag of Words, TF-IDF, or Word Embeddings depending on the task.

StarStarStarStar

Data Augmentation

StarStarStarStar

Involves artificially expanding the size and diversity of training datasets by creating modified versions of data points. Commonly used in image and speech recognition problems to increase robustness and performance.

StarStarStarStar

Log Transformation

StarStarStarStar

Applies the natural logarithm to each data point to reduce skewness. Use it on right-skewed distributions to make the data more normally distributed and stabilize variance.

StarStarStarStar

Normalization

StarStarStarStar

Rescales the values into a range of [0,1]. Often used when features have different scales and we want to ensure no feature dominates due to its scale.

xnormalized=xxminxmaxxmin x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}

Know
0
Still learning
Click to flip
Know
0
Logo

© Hypatia.Tech. 2024 All rights reserved.