Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Computer Science

Machine Learning

Activation Functions in Neural Networks

Bias-Variance Tradeoff

Computer Vision Fundamental Concepts

Cross-validation Techniques

Data Preprocessing for Machine Learning

Deep Learning Fundamentals

Data Imbalance Handling

Dimensionality Reduction Methods

Ensemble Learning Techniques

Evaluation Metrics for Classification

Feature Engineering Techniques

Generative Models

Gradient Descent Variants

Hyperparameter Tuning

Machine Learning Best Practices

Machine Learning Security and Privacy

Loss Functions Explained

Machine Learning Pipelines

Machine Learning Frameworks

Model Explainability and Interpretability

Natural Language Processing Terminology

Multiclass Classification Strategies

Model Selection Criteria

Neural Network Architectures

Performance Optimization in Machine Learning

Time Series Analysis Techniques

Reinforcement Learning Concepts

Statistical Tests for Model Comparison

Gradient Descent Variants

Flashcards

0/8

Still learning

Adagrad

Adagrad adapts the learning rate to the parameters, performing larger updates for infrequent parameters and smaller updates for frequent ones. The key difference is its unique per-parameter learning rate, which can improve performance on sparse data.

\theta = \theta - \frac{\eta}{\sqrt{G_{t} + \epsilon}} \cdot \nabla_\theta J(\theta)

Batch Gradient Descent

Batch Gradient Descent considers all samples in the dataset for a single update of parameters per iteration. The key difference is its computation on the entire dataset, which makes it very slow for large datasets but results in stable convergence.

\theta = \theta - \eta \cdot \nabla_\theta J(\theta)

Stochastic Gradient Descent (SGD)

SGD updates parameters for each training example one by one. The key difference is increased noise which can help escape local minima, but may lead to fluctuating convergence. It's faster per iteration compared to Batch Gradient Descent.

\theta = \theta - \eta \cdot \nabla_\theta J(\theta; x^{(i)}; y^{(i)})

Mini-batch Gradient Descent

Mini-batch Gradient Descent is a compromise between Batch and Stochastic, it updates parameters based on a subset of the data. The key difference is better handling of computational resources and a balance between convergence stability and speed.

\theta = \theta - \eta \cdot \nabla_\theta J(\theta; X^{(i:i+n)}; Y^{(i:i+n)})

Momentum

Momentum takes into consideration past gradients to smooth out the updates. It helps accelerate SGD in the relevant direction and dampens oscillations.

v_t = \gamma v_{t-1} + \eta \nabla_\theta J(\theta)

\theta = \theta - v_t

Nesterov Accelerated Gradient (NAG)

NAG is a variant where the gradient is calculated not at the current parameters but at the look-ahead parameters based on the current momentum. It anticipates the future gradient and can lead to improved performance.

v_t = \gamma v_{t-1} + \eta \nabla_\theta J(\theta - \gamma v_{t-1})

\theta = \theta - v_t

Adam

Adam combines ideas from both Momentum and RMSprop. It calculates an exponentially moving average of the gradients and the squared gradients. The key difference is the use of bias-corrected first and second moment estimates to update the parameters.

m_t = \beta_1 m_{t-1} + (1 - \beta_1)\nabla_\theta J(\theta)

v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta J(\theta))^2

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

\theta = \theta - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

RMSprop

RMSprop modifies Adagrad to prevent the learning rate from diminishing too quickly. It uses a moving average of squared gradients to normalize the gradient update, which helps in non-stationary objectives and complex problems.

\theta = \theta - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla_\theta J(\theta)

Know

Still learning

Click to flip

Know