Logo
Pattern

Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Gradient Descent Variants

8

Flashcards

0/8

Still learning
StarStarStarStar

Nesterov Accelerated Gradient (NAG)

StarStarStarStar

NAG is a variant where the gradient is calculated not at the current parameters but at the look-ahead parameters based on the current momentum. It anticipates the future gradient and can lead to improved performance.

vt=γvt1+ηθJ(θγvt1)v_t = \gamma v_{t-1} + \eta \nabla_\theta J(\theta - \gamma v_{t-1})
θ=θvt\theta = \theta - v_t

StarStarStarStar

Adam

StarStarStarStar

Adam combines ideas from both Momentum and RMSprop. It calculates an exponentially moving average of the gradients and the squared gradients. The key difference is the use of bias-corrected first and second moment estimates to update the parameters.

mt=β1mt1+(1β1)θJ(θ)m_t = \beta_1 m_{t-1} + (1 - \beta_1)\nabla_\theta J(\theta)
vt=β2vt1+(1β2)(θJ(θ))2v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta J(\theta))^2
m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
θ=θηv^t+ϵm^t\theta = \theta - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

StarStarStarStar

RMSprop

StarStarStarStar

RMSprop modifies Adagrad to prevent the learning rate from diminishing too quickly. It uses a moving average of squared gradients to normalize the gradient update, which helps in non-stationary objectives and complex problems.

θ=θηE[g2]t+ϵθJ(θ)\theta = \theta - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla_\theta J(\theta)

StarStarStarStar

Batch Gradient Descent

StarStarStarStar

Batch Gradient Descent considers all samples in the dataset for a single update of parameters per iteration. The key difference is its computation on the entire dataset, which makes it very slow for large datasets but results in stable convergence.

θ=θηθJ(θ)\theta = \theta - \eta \cdot \nabla_\theta J(\theta)

StarStarStarStar

Stochastic Gradient Descent (SGD)

StarStarStarStar

SGD updates parameters for each training example one by one. The key difference is increased noise which can help escape local minima, but may lead to fluctuating convergence. It's faster per iteration compared to Batch Gradient Descent.

θ=θηθJ(θ;x(i);y(i))\theta = \theta - \eta \cdot \nabla_\theta J(\theta; x^{(i)}; y^{(i)})

StarStarStarStar

Momentum

StarStarStarStar

Momentum takes into consideration past gradients to smooth out the updates. It helps accelerate SGD in the relevant direction and dampens oscillations.

vt=γvt1+ηθJ(θ)v_t = \gamma v_{t-1} + \eta \nabla_\theta J(\theta)
θ=θvt\theta = \theta - v_t

StarStarStarStar

Adagrad

StarStarStarStar

Adagrad adapts the learning rate to the parameters, performing larger updates for infrequent parameters and smaller updates for frequent ones. The key difference is its unique per-parameter learning rate, which can improve performance on sparse data.

θ=θηGt+ϵθJ(θ)\theta = \theta - \frac{\eta}{\sqrt{G_{t} + \epsilon}} \cdot \nabla_\theta J(\theta)

StarStarStarStar

Mini-batch Gradient Descent

StarStarStarStar

Mini-batch Gradient Descent is a compromise between Batch and Stochastic, it updates parameters based on a subset of the data. The key difference is better handling of computational resources and a balance between convergence stability and speed.

θ=θηθJ(θ;X(i:i+n);Y(i:i+n))\theta = \theta - \eta \cdot \nabla_\theta J(\theta; X^{(i:i+n)}; Y^{(i:i+n)})

Know
0
Still learning
Click to flip
Know
0
Logo

© Hypatia.Tech. 2024 All rights reserved.