Why Nesterov Momentum Works: A Spectral Analysis Perspective
In this post, we dive into the mechanics of Nesterov momentum using spectral analysis to uncover why it accelerates optimization. We’ll start with the core iteration scheme, simplify it step by step to derive a recursive update rule, and then provide an intuitive explanation. Finally, we’ll rigorize the analysis for quadratic objectives, comparing vanilla gradient descent, the Heavy Ball (Polyak) method, and Nesterov momentum. This approach reveals how Nesterov introduces curvature-aware corrections that speed up convergence without extra computational cost.
Consider the following Lion-$\mathcal{K}$ iteration, which blends Polyak and Nesterov momentum elements:
\[\begin{align} M_{t+1} &= \beta_{\text{p}} M_t - (1 - \beta_{\text{p}}) \nabla F(X_t),\\ \tilde{M}_{t+1} &= \beta_{\text{n}} M_{t+1} - (1 - \beta_{\text{n}}) \nabla F(X_t),\\ X_{t+1} &= X_t + \eta(\nabla \mathcal{K}(\tilde{M}_{t+1}) - \lambda X_t), \end{align}\]where $\eta > 0$ is the learning rate, $\beta_{\text{n}}, \beta_{\text{p}} \in [0,1]$ are momentum coefficients, $\lambda \geq 0$ is the weight decay coefficient, $\mathcal{K}: \mathbb{X} \rightarrow \mathbb{R}$ is a convex function, and $\nabla \mathcal{K}$ denotes its gradient (or subgradient if nondifferentiable).
For simplicity, we’ll analyze the deterministic case without weight decay and with $\nabla \mathcal{K}$ as the identity map: $\nabla \mathcal{K} = I$ and $\lambda = 0$. This reduces to standard gradient descent on $F$.
When choosing $\nabla \mathcal{K}=\text{msign}$, this recovers the Muon optimizer with Nesterov momentum. We emphasize that in strongly convex setting there is no proof that Polyak momentum can achieve accelerated convergence, while Nesterov momentum does.
After algebraic manipulation, we arrive at:
\[\begin{align} X_{t+1} &= X_t + \beta_{\text{p}} (X_t - X_{t-1}) - \eta \left[ (1 - \beta_{\text{p}}) \beta_{\text{n}} + (1 - \beta_{\text{n}}) \right] \nabla F(X_t) + \eta \beta_{\text{p}} (1 - \beta_{\text{n}}) \nabla F(X_{t-1}) \\ &= X_t + \beta_{\text{p}} \underbrace{(X_t - X_{t-1})}_{\text{momentum}} - \eta \beta_{\text{p}} (1 - \beta_{\text{n}}) \underbrace{(\nabla F(X_t) - \nabla F(X_{t-1}))}_{\text{gradient correction}} - \eta (1 - \beta_{\text{p}}) \nabla F(X_t). \end{align}\]The derivation is deferred to Appendix. This recursion expresses $X_{t+1}$ in terms of $X_t$, $X_{t-1}$, $\nabla F(X_t)$, and $\nabla F(X_{t-1})$, without explicit $M$ or $\tilde{M}$ variables. It’s ideal for analyzing convergence via spectral methods.
The recursion highlights two key components: a momentum term $\beta_{\text{p}} (X_t - X_{t-1})$, which averages past updates to dampen oscillations, and a gradient correction term $-\eta \beta_{\text{p}} (1 - \beta_{\text{n}}) (\nabla F(X_t) - \nabla F(X_{t-1}))$, which adjusts for changes in the gradient.
Under the approximation
\[\frac{\nabla F(X_t) - \nabla F(X_{t-1})}{\eta} \approx \nabla^2 F(X_t) \dot{X}_t,\]where $\dot{X}t \approx (X_t - X{t-1})/\eta$ is a discrete velocity, the correction term captures second-order curvature information $\nabla^2 F(X_t)$ along the direction $\dot{X}_t$. This provides Hessian-like insights without additional computations, accelerating progress in ill-conditioned directions.
Setting $\beta_{\text{n}} = 1$ eliminates the correction (since $1 - \beta_{\text{n}} = 0$), reducing to pure Polyak momentum. Thus, $\beta_{\text{n}} < 1$ is what enables Nesterov’s “lookahead” effect.
An alternative Nesterov form, the implicit velocity scheme, evaluates the gradient at a lookahead point: $\nabla F(X_t + \beta (X_t - X_{t-1})/\eta)$. By Taylor expansion:
\[\nabla F(X_t + \beta (X_t - X_{t-1})/\eta) \approx \nabla F(X_t) + \beta \nabla^2 F(X_t) \dot{X}_t.\]This again injects directional curvature. Both forms are equivalent in convergence rate, as shown in this paper.
To quantify these effects, we’ll perform a spectral analysis for quadratic objectives $F(x) = \frac{1}{2} x^\top A x$, where $A \succeq 0$ is symmetric positive definite with eigenvalues in $[\mu, L]$ ($0 < \mu \leq \cdots \leq L$, and condition number $\kappa = L/\mu$). The minimizer is $x_\star = 0$ and $F(x_\star) = 0$, with $\nabla F(x) = A x$.
Rewrite the recursion compactly as
\[X_{t+1} = X_t + \alpha (X_t - X_{t-1}) - \beta (\nabla F(X_t) - \nabla F(X_{t-1})) - \gamma \nabla F(X_t),\]where $\alpha = \beta_{\text{p}}$, $\beta = \eta \beta_{\text{p}} (1 - \beta_{\text{n}})$, and $\gamma = \eta (1 - \beta_{\text{p}})$.
Introduce an auxiliary variable $V_t = X_t - X_{t-1}$ to form a linear system:
\(\begin{pmatrix} I & \mathbf{0}_{n \times n} \\ -\alpha I + \beta \nabla F & I \end{pmatrix} \begin{pmatrix} X_{t+1} \\ V_{t+1} \end{pmatrix} = \begin{pmatrix} I - \gamma \nabla F & I \\ -\alpha I + \beta \nabla F & \mathbf{0}_{n \times n} \end{pmatrix} \begin{pmatrix} X_t \\ V_t \end{pmatrix}.\) For the quadratic case, $\nabla F = A$, so: \(\begin{pmatrix} I & \mathbf{0}_{n \times n} \\ -\alpha I + \beta A & I \end{pmatrix} \begin{pmatrix} X_{t+1} \\ V_{t+1} \end{pmatrix} = \begin{pmatrix} I - \gamma A & I \\ -\alpha I + \beta A & \mathbf{0}_{n \times n} \end{pmatrix} \begin{pmatrix} X_t \\ V_t \end{pmatrix}.\)
Inverting the left matrix yields the iteration matrix:
\[\begin{pmatrix} X_{t+1} \\ V_{t+1} \end{pmatrix} = \begin{pmatrix} I & \mathbf{0}_{n \times n} \\ \alpha I - \beta A & I \end{pmatrix} \begin{pmatrix} I - \gamma A & I \\ -\alpha I + \beta A & \mathbf{0}_{n \times n} \end{pmatrix} \begin{pmatrix} X_t \\ V_t \end{pmatrix} = \underbrace{ \begin{pmatrix} I - \gamma A & I \\ \gamma A (\beta A - \alpha I) & \alpha I - \beta A \end{pmatrix} }_{M(\alpha, \beta, \gamma)} \begin{pmatrix} X_t \\ V_t \end{pmatrix}. \tag{$\dagger$}\]The spectral radius $\rho(\alpha, \beta, \gamma) = |M(\alpha, \beta, \gamma)|_2$ governs the linear convergence rate: smaller $\rho < 1$ implies faster convergence.
Since $A = Q \Lambda Q^\top$ (orthogonal $Q$, diagonal $\Lambda$), in the eigenbasis of $A$, $M$ decouples into $2 \times 2$ blocks $G(\lambda)$ for each eigenvalue $\lambda \in [\mu, L]$. The full spectral radius is the maximum over these blocks. The $2 \times 2$ block is
\[G(\lambda) = \begin{pmatrix} 1 - \gamma \lambda & 1 \\ \gamma (\beta \lambda - \alpha) \lambda & \alpha - \beta \lambda \end{pmatrix}.\]The characteristic polynomial is
\[\det(rI - G(\lambda)) = r^2 - [1 + \alpha - (\gamma + \beta) \lambda] r + (\alpha - \beta \lambda) = 0,\]with discriminant
\[\Delta(\lambda) = [1 + \alpha - (\beta + \gamma) \lambda]^2 - 4 (\alpha - \beta \lambda).\]The leading coefficient of $\Delta(\lambda)$ (as a quadratic in $\lambda$) is positive, so maxima occur at endpoints $\lambda = \mu$ or $L$. We have $\Delta(L) < 0$ typically when choosing $0\leq \alpha,\beta\leq 1$, and $\gamma=1/L$.
Set $\beta_{\text{p}} = 0$ and $\beta_{\text{n}} = 1$, so $\alpha = \beta = 0$ and $\gamma = \eta$. Then $G(0, 0, \eta) = I - \gamma A$, with spectral radius
\[\rho(0, 0, \eta) = \|I - \gamma A\|_2 = \max\{|1 - \gamma \mu|, |1 - \gamma L|\}.\]The standard step size $\eta = 1/L$ gives $\rho = 1 - \mu/L$. The optimal $\eta = 2/(\mu + L)$ yields $\rho = (\kappa - 1)/(\kappa + 1) \approx 1 - 2/\kappa$ for large $\kappa$.
Set $\beta_{\text{n}} = 1$, so $\beta = 0$, $\alpha = \beta_{\text{p}}$, and $\gamma = \eta (1 - \beta_{\text{p}})=1/L$. The matrix simplifies to
\[G_{\text{HB}} = \begin{pmatrix} I - \gamma A & I \\ -\alpha \gamma A & \alpha I \end{pmatrix}.\]In the eigenbasis, each $2 \times 2$ block is
\[G_{\text{HB}}(\lambda) = \begin{pmatrix} 1 - \gamma \lambda & 1 \\ -\alpha \gamma \lambda & \alpha \end{pmatrix}, \quad \lambda \in [\mu, L].\]The spectral radius is
\[\rho_{\text{HB}} = \max_{\lambda \in [\mu, L]} \rho(G_{\text{HB}}(\lambda)).\]The eigenvalues solve the characteristic polynomial
\[\det\begin{pmatrix} 1 - \gamma \lambda - r & 1 \\ -\alpha \gamma \lambda & \alpha - r \end{pmatrix} = r^2 - (1 + \alpha - \gamma \lambda) r + \alpha = 0.\]If $\Delta(\lambda)=(1 + \alpha - \gamma \lambda)^2 - 4\alpha \leq 0$ for all $\lambda$, which corresponds to $\Delta(\mu)\leq 0\Leftrightarrow 1 - \sqrt{\mu/L} \leq \sqrt{\alpha}$, the eigenvalues are complex and the modulus is $\sqrt{\alpha}$. Carefully tuning the method gives a contraction of \(\|G_{\text{HB}}(\lambda)\|=\sqrt{\alpha}\approx1 - 1/\sqrt{\kappa},\) improving over GD’s ceil $1 - 1/\kappa$.
Now include the Nesterov momentum: $\alpha = \beta_{\text{p}}$, $\beta = \eta \beta_{\text{p}} (1 - \beta_{\text{n}})$, $\gamma = \eta (1 - \beta_{\text{p}})$. Ensuring $\Delta(\mu) \leq 0$ requires
\[1 - \sqrt{\frac{\mu}{L}} \leq \sqrt{\alpha - \beta \mu}.\]If this holds for all $\lambda \in [\mu, L]$, then $\Delta(\lambda) \leq 0$ everywhere. Hence, the eigenvalues of $G(\lambda)$ are complex with modulus $\sqrt{\alpha - \beta \lambda}$. When the coefficients $\alpha,\beta$ are carefully tuned, for eigenvectors related to $G(\lambda)$, we could expect that
\[\|G(\lambda)\|=\sqrt{\alpha - \beta \lambda}\leq \sqrt{\alpha - \beta \mu}\leq 1 - \sqrt{\frac{\mu}{L}},\]which is better than Heavy Ball’s uniform $\sqrt{\alpha} \approx 1 - 1/\sqrt{\kappa}$ when $\lambda>\mu$.
This spectral lens shows Polyak momentum uniformly reduces the radius from $1 - 1/\kappa$ (GD) to $1 - 1/\sqrt{\kappa}$ by damping oscillations. Nesterov goes further, adding a $\lambda$-dependent shrinkage via the gradient correction, tailoring acceleration to each eigendirection. This curvature awareness explains Nesterov’s edge in practice, especially for ill-conditioned problems, while keeping the per-iteration cost almost identical to vanilla momentum.
Let’s simplify the system step by step to eliminate the auxiliary variables $M_t$ and $\tilde{M}_t$, yielding a recursion solely in terms of $X_t$.
Step 1: Apply the simplifications. With $\nabla \mathcal{K} = I$ and $\lambda = 0$, the update for $X_{t+1}$ becomes:
\[X_{t+1} = X_t + \eta \tilde{M}_{t+1}.\]Step 2: Express $\tilde{M}{t+1}$ in terms of $M{t+1}$. From the second equation:
\[\tilde{M}_{t+1} = \beta_{\text{n}} M_{t+1} - (1 - \beta_{\text{n}}) \nabla F(X_t).\]Step 3: Substitute into the $X_{t+1}$ equation. We get:
\[X_{t+1} = X_t + \eta \left[ \beta_{\text{n}} M_{t+1} - (1 - \beta_{\text{n}}) \nabla F(X_t) \right].\]Solving for $M_{t+1}$:
\[M_{t+1} = \frac{1}{\beta_{\text{n}} \eta} (X_{t+1} - X_t) + \frac{1 - \beta_{\text{n}}}{\beta_{\text{n}}} \nabla F(X_t).\]Step 4: Similarly, express $M_t$ from the previous step. Analogously:
\[M_t = \frac{1}{\beta_{\text{n}} \eta} (X_t - X_{t-1}) + \frac{1 - \beta_{\text{n}}}{\beta_{\text{n}}} \nabla F(X_{t-1}).\]Step 5: Plug into the first equation. Substitute the expressions for $M_{t+1}$ and $M_t$:
\[\frac{1}{\beta_{\text{n}} \eta} (X_{t+1} - X_t) + \frac{1 - \beta_{\text{n}}}{\beta_{\text{n}}} \nabla F(X_t) = \beta_{\text{p}} \left[ \frac{1}{\beta_{\text{n}} \eta} (X_t - X_{t-1}) + \frac{1 - \beta_{\text{n}}}{\beta_{\text{n}}} \nabla F(X_{t-1}) \right] - (1 - \beta_{\text{p}}) \nabla F(X_t).\]Step 6: Rearrange for a recursion in $X$ only. After algebraic manipulation, we arrive at:
\[\begin{align} X_{t+1} &= X_t + \beta_{\text{p}} (X_t - X_{t-1}) - \eta \left[ (1 - \beta_{\text{p}}) \beta_{\text{n}} + (1 - \beta_{\text{n}}) \right] \nabla F(X_t) + \eta \beta_{\text{p}} (1 - \beta_{\text{n}}) \nabla F(X_{t-1}) \\ &= X_t + \beta_{\text{p}} \underbrace{(X_t - X_{t-1})}_{\text{momentum}} - \eta \beta_{\text{p}} (1 - \beta_{\text{n}}) \underbrace{(\nabla F(X_t) - \nabla F(X_{t-1}))}_{\text{gradient correction}} - \eta (1 - \beta_{\text{p}}) \nabla F(X_t). \end{align}\]This recursion expresses $X_{t+1}$ in terms of $X_t$, $X_{t-1}$, $\nabla F(X_t)$, and $\nabla F(X_{t-1})$, without explicit $M$ or $\tilde{M}$ variables. It’s ideal for analyzing convergence via spectral methods.