March 1, 2025

MUHAMMAD GHIFARY

<aside> đź’ˇ

This article was written with the help of OpenAI’s deep research, an agentic AI tool that employs reasoning to synthesize extensive online information and execute complex multi-step research tasks.

</aside>

Deep learning continues to advance, with optimization algorithms playing a crucial role in enhancing model efficiency. Recent developments in optimization techniques have introduced new methods that improve convergence rates, reduce computational costs, and enhance model generalization.

This article discusses the latest advancements in deep learning optimization, including novel optimizers such as Lion and Sophia, as well as second-order techniques that contribute to more efficient training processes.

New and Improved Optimization Algorithms

1. Lion (Evolved Sign Momentum)

Lion is a recently developed optimizer identified through AutoML (Chen et al., 2023). It updates weights based on the sign of the gradient rather than its magnitude, reducing computational overhead while maintaining effective training performance. Lion has been shown to achieve improved accuracy in image classification tasks, particularly in Vision Transformer (ViT) models (Dosovitskiy et al., 2020), with reduced computational requirements compared to AdamW (Loshchilov & Hutter, 2019).

2. LAMB (Layer-wise Adaptive Moments for Batch Training)

LAMB is designed for large-batch training and has been particularly useful for transformer-based models such as BERT (You et al., 2020). It incorporates layer-wise normalization to maintain stable learning dynamics in large-scale training scenarios, enabling efficient training with significantly larger batch sizes compared to traditional optimizers.

3. AdaFactor: Memory-Efficient Optimization

AdaFactor, a variant of Adam, reduces memory overhead by storing only row- and column-wise squared gradient sums rather than full matrices (Shazeer & Stern, 2018). It reduces memory usage to sublinear in model size while delivering similar convergence as Adam. This makes it particularly beneficial for training large-scale models, such as T5, on resource-limited hardware.

4. AdaBelief

AdaBelief, another variant of Adam, adapts step sizes by how much gradient deviates from an expected trend (Zhuang et al. 2020). It treats the exponential moving average of past gradients as a prediction of the next gradient. This approach merges the benefits of adaptive methods and SGD: AdaBelief attains faster convergence like Adam but with SGD-like generalization. On ImageNet, AdaBelief achieved accuracy on par with SGD (unusual for an adaptive optimizer). It’s also noted for stability in GAN training, outperforming a well-tuned Adam on CIFAR-10 GANs.

Other Adam improvements: AdamW (Loshchilov & Hutter, 2019) decoupled weight decay from the gradient update, improving regularization and becoming a default in vision transformers. AMSGrad (Reddi et al. 2018) was introduced to handle the convergence failure by Adam in some simple cases; AMSGrad fixes this by enforcing a non-increasing second-moment term, provably restoring convergence in theory. RAdam (Rectified Adam) addressed Adam’s reliance on learning-rate warmup by analytically adjusting the adaptive learning rate variance, leading to more stable training without manual warmup (Liu et al., 2020). Lookahead is another innovation where one optimizer’s updates are periodically “averaged” into a slow-moving set of weights improving stability and often final accuracy. Each of these refinements targets specific weaknesses in Adam (e.g., instability, generalization, or need for hyperparameter tricks) to make optimization more robust.

Advancements in Learning Rate Adaptation and Stability

Momentum-based Optimizers

Momentum-based approaches, including Nesterov acceleration and sign-based methods such Lion and signSGD, contribute to improve training stability and convergence speed. These techniques help mitigate the effects of noisy gradients.

Gradient Clipping for Stability

Gradient clipping techniques, including per-layer norm clipping and adaptive clipping, have been widely adopted to prevent instability caused by exploding gradients in deep networks.