Funnily enough, note that this bound is slightly tighter than the previous one by a factor of 2, even though the previous proof relied on an exact rewriting.
Another nice point is that this proof immediately carries over to the projected case, unlike the previous one. Say the optimal point lies in some closed, convex set with diameter at most , then, let denote the projection of onto , defined
(We can say the projection, since this is unique from the convexity of .) A fun exercise is to note that, for any two points the projection satisfies
In other words, the projection is nonexpansive.
Anyways, in this case, note that the gradient update step is given by
i.e., we project into the set after every gradient update. The proof, using the fact that is nonexpansive, is essentially identical to the previous proof since
(Note the second inequality, there.) The rest of the proof proceeds identically as the previous one, except instead of exact equalities, we have inequalities coming from the nonexpansiveness property of .
The above proof is kind of interesting: we used the (trivial) inequality
to get a fairly tight bound on the regret. This 'suggests' (to me, at least) that this bound is likely somewhat tight in many cases, might it be 'close to tight' in general? By this I mean, is it the case that
for some notion of ?
Certainly, this is true in the case where are the subgradients of some fixed function evaluated at the (that is, when for some ) and is, say, the unique minimizer of . Might it be true in the more general scenario where the are adversarial?
Unfortunately the answer is no.
Indeed, we can show that there is a (silly) set of gradients such that the distance between the last iterate (or, really, any iterate, or convex combination of the iterate, such as the average) and the that maximizes the regret bound is the diameter of the set.
Let be any compact convex set with diameter , then there are points and such that . Now, let for and , where , which we can do as adversaries. Then the that maximizes regret is the one that solves
with minimizer . Since then it's game over since . It's clear that any convex (indeed, affine) combination of the is always equal to , so, unlike in the stochastic case, this does not help either.
The main problem with the above is that (a) we have to choose a move before the last gradient is revealed to us, and (b) the maximizers of linear functions are extremely sensitive to the parameters of the problem. This means that even a very small change in the gradients can have drastic movements in , which prevents us from getting control over these conditions.
One 'simple-ish' technique for getting rid of this sensitivity is to assume that the functions are strongly convex. It is not quite 100% clear to me if, in this case, the last iterate (or, say, the average iterate) is close to the unique minimizer , but it does seem likely.
I unfortunately cannot find an easy source for this on my (admittedly cursory) glance, but if there is one, that would be great!