Explore tens of thousands of sets crafted by our community.
Feature Engineering Techniques
15
Flashcards
0/15
Discretization
The process of converting continuous data into discrete buckets or intervals, similar to binning, but with an emphasis on finding optimal bin boundaries. Use it to improve performance of classification tasks.
Standardization
Rescales data to have a mean () of 0 and standard deviation () of 1 (unit variance). Commonly used in techniques that assume data is normally distributed.
One-Hot Encoding
Transforms categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. Each unique category value is converted into a new binary column. Use it when the categorical feature is not ordinal.
Feature Selection
The process of reducing the number of input variables by selecting the most important features that contribute to the predictive power of the model. Use it to reduce overfitting and improve model performance.
Polynomial Features
Generates new feature interaction terms by raising existing features to a power or creating interaction between them. Use it to capture non-linear relationships when linear models are used.
Feature Scaling
Bringing different range features to a comparable scale. Use it before applying algorithms that are sensitive to the scale of data like SVM, k-NN, and gradient descent.
Feature Extraction
Transforming the input data into a set of new features by dimensionality reduction techniques such as PCA. Use it to reduce the number of features in high-dimensional data.
Binning
Converts numeric variables into categorical counterparts by grouping them into bins. Use it to reduce the effects of minor observation errors and simplify the model.
Category Encoding
General term for converting categorical variables into numeric format. Includes techniques such as one-hot encoding, label encoding, and binary encoding. Choose an encoding method based on model requirements and feature characteristics.
Handling Outliers
Involves identifying and treating extreme values that could negatively impact the model. Use techniques like trimming, capping, or transformations based on the context and the model sensitivity to outliers.
Missing Values Imputation
Deals with missing data by filling it with meaningful values such as median, mean, or mode. Use it to avoid errors during model training that can’t handle null values.
Text Vectorization
Converting text data into numerical format (vectors) so that ML algorithms can process it. Use techniques like Bag of Words, TF-IDF, or Word Embeddings depending on the task.
Data Augmentation
Involves artificially expanding the size and diversity of training datasets by creating modified versions of data points. Commonly used in image and speech recognition problems to increase robustness and performance.
Log Transformation
Applies the natural logarithm to each data point to reduce skewness. Use it on right-skewed distributions to make the data more normally distributed and stabilize variance.
Normalization
Rescales the values into a range of [0,1]. Often used when features have different scales and we want to ensure no feature dominates due to its scale.
© Hypatia.Tech. 2024 All rights reserved.