Longest Path Search

Another short proof of OCO and miscellanea

Posted 2024-03-27

A short post with two silly, but ultimately useful, results. The first is yet another proof of OGD being low regret. The second is that the last iterate, or even the average iterate (or, indeed, any convex combination of the iterates) can be very far away from the optimal point, even though they are low regret, without further assumptions on the functions we are optimizing over.

Why yet another proof? No idea. OCO still feels incredibly magical to me on many levels and getting several clean proofs is usually useful for understanding that kind of thing.

On the other hand, I'm not sure this gets anyone closer to that goal, but it's (probably) worth pursuing anyways.

(Yet another) Proof of the regret of online gradient descent

Using an identical set up and notation to the previous post on OCO note that, given subgradient gtRng_t \in \mathbf{R}^n bounded by L>0L > 0, we only have to bound the quantity

t=1TgtT(xtx). \sum_{t=1}^T g_t^T(x_t - x^\star).

Here, for t=0,,Tt=0, \dots, T, we define

xt+1=xtηgt, x_{t+1} = x_t - \eta g_t,

and xRnx^\star \in \mathbf{R}^n is any point in some bounded subset with diameter at most M>0M > 0.

The new proof is almost silly. Let z2=z12++zn2\|z\|^2 = z_1^2 + \dots + z_n^2 be the sum of squares norm, then

0xT+1x2. 0 \le \|x_{T+1} - x^\star\|^2.

But, xT=xT1ηgtx_{T} = x_{T-1} - \eta g_t, so

0xT+1x2=xTx22ηgTT(xTx)+η2gT22==x0x22ηt=1TgtT(xtx)+η2t=1Tgt2. \begin{aligned} 0 \le \|x_{T+1} - x^\star\|^2 &= \|x_{T} - x^\star\|^2 - 2\eta g_T^T(x_T - x^\star) + \eta^2 \|g_T^2\|^2 = \dots\\ &= \|x_0 - x^\star\|^2 - 2\eta \sum_{t=1}^T g_t^T(x_t - x^\star) + \eta^2 \sum_{t=1}^T \|g_t\|^2. \end{aligned}

Rearranging gives

t=1TgtT(xtx)12ηx0x2+η2t=1Tgt2. \sum_{t=1}^T g_t^T(x_t - x^\star) \le \frac{1}{2\eta}\|x_0 - x^\star\|^2 + \frac{\eta}{2} \sum_{t=1}^T \|g_t\|^2.

The rest is mechanical of course, but it's worth writing anyways. Using our bounds on the diameter of the domain and the gradients, we get

t=1TgtT(xtx)M22η+ηTL22, \sum_{t=1}^T g_t^T(x_t - x^\star) \le \frac{M^2}{2\eta} + \frac{\eta TL^2}{2},

and, setting η=M/(LT)\eta = M/(L\sqrt{T}) gives the result

t=1TgtT(xtx)MLT2. \sum_{t=1}^T g_t^T(x_t - x^\star) \le \frac{ML\sqrt{T}}{2}.

Funnily enough, note that this bound is slightly tighter than the previous one by a factor of 2, even though the previous proof relied on an exact rewriting.

Projected online gradient descent

Another nice point is that this proof immediately carries over to the projected case, unlike the previous one. Say the optimal point xx^\star lies in some closed, convex set CRnC\subseteq \mathbf{R}^n with diameter at most MM, then, let Π(x)\Pi(x) denote the projection of xx onto CC, defined

Π(x)=arg minzCzx2. \Pi(x) = \argmin_{z \in C} \|z - x\|^2.

(We can say the projection, since this is unique from the convexity of CC.) A fun exercise is to note that, for any two points x,yRnx, y\in \mathbf{R}^n the projection satisfies

Π(x)Π(y)2xy2. \|\Pi(x) - \Pi(y)\|^2 \le \|x - y\|^2.

In other words, the projection Π\Pi is nonexpansive.

Anyways, in this case, note that the gradient update step is given by

xt+1=Π(xtηgt); x_{t+1} = \Pi(x_t - \eta g_t);

i.e., we project xtx_t into the set CC after every gradient update. The proof, using the fact that Π\Pi is nonexpansive, is essentially identical to the previous proof since

0xT+1x2=Π(xTηgT)x2xTηgTx2. 0 \le \|x_{T+1} - x^\star\|^2 = \|\Pi(x_T - \eta g_T) - x^\star\|^2 \le \|x_T - \eta g_T - x^\star\|^2.

(Note the second inequality, there.) The rest of the proof proceeds identically as the previous one, except instead of exact equalities, we have inequalities coming from the nonexpansiveness property of Π\Pi.

No bounds on distance

The above proof is kind of interesting: we used the (trivial) inequality

xT+1x20, \|x_{T+1} - x^\star\|^2 \ge 0,

to get a fairly tight bound on the regret. This 'suggests' (to me, at least) that this bound is likely somewhat tight in many cases, might it be 'close to tight' in general? By this I mean, is it the case that

xT+1x20, \|x_{T+1} - x^\star\|^2 \approx 0,

for some notion of \approx?

Certainly, this is true in the case where gtg_t are the subgradients of some fixed function evaluated at the xtx_t (that is, when gtf(xt)g_t \in \partial f(x_t) for some ff) and xx^\star is, say, the unique minimizer of ff. Might it be true in the more general scenario where the gtg_t are adversarial?

Unfortunately the answer is no.

Indeed, we can show that there is a (silly) set of gradients gtg_t such that the distance between the last iterate (or, really, any iterate, or convex combination of the iterate, such as the average) and the xx^\star that maximizes the regret bound is the diameter of the set.

Let CC be any compact convex set with diameter MM, then there are points x0x_0 and yy such that x0y=M\|x_0 - y\| = M. Now, let gt=0g_t = 0 for t=1,,T1t=1, \dots, T-1 and gT=α(x0y)g_T = \alpha(x_0 - y), where α=L/x0y\alpha = L/\|x_0 - y\|, which we can do as adversaries. Then the xx^\star that maximizes regret is the one that solves

minzC(t=1TgtTz)=minzC((x0y)Tz), \min_{z \in C} \left(\sum_{t=1}^T g_t^Tz\right) = \min_{z \in C} \left((x_0 - y)^Tz\right),

with minimizer x=yx^\star = y. Since xT=x0x_T = x_0 then it's game over since x0x=M\|x_0 - x^\star\| = M. It's clear that any convex (indeed, affine) combination of the xtx_t is always equal to x0x_0, so, unlike in the stochastic case, this does not help either.

Linearity is hard

The main problem with the above is that (a) we have to choose a move before the last gradient is revealed to us, and (b) the maximizers of linear functions are extremely sensitive to the parameters of the problem. This means that even a very small change in the gradients can have drastic movements in xx^\star, which prevents us from getting control over these conditions.

One 'simple-ish' technique for getting rid of this sensitivity is to assume that the functions ftf_t are strongly convex. It is not quite 100% clear to me if, in this case, the last iterate (or, say, the average iterate) is close to the unique minimizer xx^\star, but it does seem likely.

I unfortunately cannot find an easy source for this on my (admittedly cursory) glance, but if there is one, that would be great!

Built with Franklin.jl and Julia.