This operator is called the 'shrinkage' operator because of its action on its input: if is greater than our given , then we shrink it by that amount. Note then, that successively applying (in the same manner as SGD) the update rule
correctly yields the minimum of the given convex function, i.e. 0. Of course, this isn't particularly surprising since we already know how to optimize the -norm function, (just set it to zero!), but it will help out quite a bit when considering more complicated functions.
Now, given a problem of the form
where is differentiable, and is convex, we can write down a possible update in the same vein as the above, except that we now also update our objective for and at the same time
Here, is defined to be the step size for step . It turns out we can prove several things about this update, but, perhaps most importantly, we can show that it works.
Anyways, this is all I'll say about the proximal gradient step as there are several good resources on the proximal gradient method around which will do a much better job of explaining it than I probably ever will: see this for example.
I assume some familiarity with SVMs, but the program given might require a bit of explanation. The idea of an SVM is as a soft-margin classifier (there are hard-magin SVMs, but we'll consider the former variety for now): we penalize the error of being on the wrong side of the decision boundary in a linear form (with zero penalty for being on the correct side of the decision boundary). The only additional thing is that we also require that the margin's size also be penalized such that it doesn't depend overly on a particular variable (e.g. as a form of regularization).
The usual quadratic program for an SVM is, where are the slack variables indicating how much the given margin is violated, is some arbitrary positive constant, and is the hyperplane and constant offset found by the SVM (e.g. by allowing the first feature of a positive/negative sample to be ):
we can rerwite this immediately, given that the class that our data point corresponds to is to a much nicer form
and noting that the objective is homogeneous of degree one, we can just multiply the constraints and all variables by which yields the immediate result (after flipping some signs and inequalities)
which, after changing the regularizer to an -norm regularizer (which is equivalent for approriate regularization hyperparameter, say [1]) yields
This is the final program we care about and the one we have to solve using our proximal gradient operator. In general, it's not obvious how to fit the inequalities into a step, so we have to define a few more things.
For now, let's define the set indicator function
which is convex whenever is convex; we can use this to encode the above constraints (I drop the for convenience in what follows) such that a program equivalent to be above is
which is exactly what we wanted! Why? Well:
so now, we just need to find the proximal gradient operator for (which is not as nice as one would immediately think, but it's not bad!).
Now, let's generalize the problem a bit: we're tasked with the question of finding the prox-gradient of such that is given by some set of inequalities for some given .[2] That is, we require
which can be rewritten as the equivalent program (where the is dropped since it's just a proportionality constant)
it turns out this program isn't nicely solvable using the prox-gradient operator (since there's no obvious way of projecting onto and also minimizing the quadratic objective). But, of course, I wouldn't be writing this if there wasn't a cute trick or two we could do: note that this program has a strong dual (i.e. the values of the dual program and the primal are equal) by Slater's condition, so how about trying to solve the dual program? The lagrangian is
from which we can derive the dual by taking derivatives over :
and plugging in the above (and simplifying) yields the program
from which we can reconstruct the original solution by the above, given:
This program now has a nice prox step, since (the 'positive' part of , in other words). This latter case is left as an exercise for the reader.
Putting the above together yields a complete way of optimizing the SVM program: first, take a single step of the initial objective, then find the minimum projection on the polygon over the given inequality constraints by using the second method, and then take a new step on the original, initial program presented.
Of course, this serves as little more than an academic exercise (I'm not sure how thrilled either Boyd nor Lall would be at using dual programs in 104, even in a quasi-black-box optimizer), but it may be of interest to people who take interest in relatively uninteresting things (e.g. me) or to people who happen to have really freaking fast proximal gradient optimizers (not sure these exist, but we can pretend).
[1] | We're also finding this by using cross-validation, rather than a-priori, so it doesn't matter too much. |
[2] | Note that this is equivalent to the original problem, actually. The forwards direction is an immediate generalization. The backwards direction is a little more difficult, but can be made simple by considering, say, only positive examples. |