Instructor (Stephen Boyd):Hey, I guess weíll start. Let me just make a couple of announcements. I guess weíve combined the rewrite of the preliminary project proposal with the mid-term project review, and I think thatís due this Friday. Good, okay, itís due Friday. So thatís going to be that. Please youíre welcome to show me or the TAs if you want a format scan. the TAs are now as nasty as I am, and they can scan something, just walk down it and say, ĎThatís typesetting error. Iím not going to read any farther.í Or theyíll read down a half-page and say, ĎWhat are the variables?í and then throw it at you with a sneer.
So Iíve trained them pretty well. The danger, of course, is then they start applying that to the stuff I write, which has happened in the past. They say things like, ĎThis isnít consistent. Use this phrase on this page and this one on the other one.í And you look at the two, and you say, ĎYeah, thatís true, all right.í The executive decision was made. Weíre going to switch around.
Itís not the natural order of the class, in fact, but it fits better with people who are doing projects. So a bunch of people are doing projects that involve non-convex problems, and so today we switched around, and weíre going to do sequential convex programming first, and then weíll switch back to problems that are convex problems and things like that. Weíll do large-scale stuff next.
The good news is that we can actually pull this off, I think, in one day. So weíre going to come back later. Thereíll be several other lectures on problems that are not convex and various methods. Thereís going to be a problem on reluxations. Weíre going to have a whole study of L1 type methods for sparse solutions. Those will come later. But this is really our first foray outside of convex optimization.
So itís a very simple, basic method. Thereís very, very little you can say about it theoretically, but thatís fine. Itís something that works quite well. Donít confuse it Ė although itís related to something called sequential quadratic programming. Thatís something youíll hear about a lot if you go to Google and things like that. Those are things that would come up. But I just wanted to collect together some of the topics that come up.
Okay, so letís do sequential convex programming. Letís see here. There we go. Okay, so I guess itís sort of implicit for the entire last quarter and this one. I mean the whole point of using convex optimization methods on convex problems is you always get the global solution. I mean up to numerical accuracy. So marginal and numerical accuracy, you always get the global solution, and itís always fast. And thatís fast according to theorist. If you want a complexity theory, there are bounds that grow like polynomials and various things, problem size, log one over epsilon, which is the accuracy. And actually, in practice as well, these methods are very fast and extremely reliable.
Okay, now for general non-convex problems, you really have to give up one of these. Either you give up always global or you give up always fast. And these are basically the big bifurcation. I mean there are things in between, but roughly, this is the idea. As long as you have local optimization methods, these are methods that are fast. Theyíre plenty fast. Theyíre just as fast as convex optimization methods, but they need not find Ė they relax what they mean by a solution. So they find some kind of a local solution to a problem. It might not be the global one Ė can well be the global one, but you wonít know it. Thatís the local optimization.
At the other extreme, you have global optimization methods, and weíll talk about these later in the class. These actually find the global solution, and they certify it. So when you solve the problem, when it stops, it basically says, ĎI have a point with this optimal value, and I can prove that Iím not more than 1e Ė 3 away from the solution period. Right? So now, in convex optimization, we do that very easily with the duality certificate.
So you take a dual feasible point. So when you finish your problem, you say, ĎHereís this point. Itís feasible, and it achieves its objective value. Hereís a dual feasible point, which proves a lower bound, which is epsilon away from the objective of your feasible point, and thatís how you prove it. For non-convex problems, the proofs, the certificates as weíll see, are bigger and longer. Weíll see what they are.
Okay, now what weíll talk about here is a specific class of local optimization methods, and theyíre based on solving a sequence of convex programs. It should be understood, but the semantics of everything weíre going to do today is. Weíre not solving any problems at all. When we Ďsolveí a problem. Iíll stop saying quote, unquote, but youíll hear it. Iíll aspirate the q and theyíll be an air puff after the t, and thatís the quotes.
So it has to be understood; when we say we Ďsolveí this problem or Ďhereís the result we get,í these are not the solutions of the problems, as far as we know. They might be, and they are pretty good. But itís to be understood that we are now in the world of non-convex optimization, and there may be bounds you get, but thatís a separate story.
Sequential convex programming goes like this. Itís going to solve a sequence of convex problems, right. And in fact it fits in this big hierarchy, right, where if someone says to you, ĎHow do you solve this convex problem?í Itís actually a sequence of reductions, and the reductions go like this. I just want to put this whole thing in this main hierarchy.
Reduction goes like this. Someone says, ĎWell how do you solve this problem with inequality constraints and all that kind of stuff?í And you say, ĎNo problem. I use a barrier method, which basically reduces the solution of that problem with constraints to a smooth convex minimization problem.í And some one says, ĎWell, yeah, how many of those do you have to solve?í And you say, í20.í And they say, ĎReally 20?í And you go, ĎOkay, sometimes 40.í Thatís the right way to say it.
Then itís, ĎHow do you solve a smooth minimization problem?í And you say, ĎWell, I solve each of those by solving a sequence of quadratic convex problems.í So thatís how that works. Thatís exactly what Newtonís Method is. And someone says, ĎYeah, how many of those do you have to solve?í And you go, ĎI donít know, 5 each time,í or something roughly. And then you say, ĎHow do you solve a quadratic minimization problem with equality constraints?í The answer is, ĎI solve linear equations.í If they keep going, then you can explain all the methods about exploiting structure in linear equations.
So anyway, you can see that each level is based on reducing the solution of that problem to solving a sequence of the ones below it. And ultimately, it all comes down to linear equations, and in fact, if you profile any code Ė almost any code Ė all itís doing, solving linear equations, nothing else. Okay? So people who work on numerical linear algebra are always very happy to point that out.
Okay, so this fits on top of that hierarchy. So you have a problem thatís not convex and you want to Ďsolveí it. And you do that by reducing it to a sequence of convex programs. Okay, all right. Now the advantage of this is that the convex portions of the problem are handled exactly and efficiently because in a lot of problems Ė and weíll see just from our baby examples Ė a lot of the problem is convex. Huge parts of the objective are convex, lots of constraints; I mean just very typical constraints, just upper and lower bounds on variables. So a lot of these are going to be convex.
And those are going to be handled exactly. Thatís the good news. On the other hand, Iíve already said this a couple of times. We have to understand. This is a euristic. It can fail to find even a feasible point, even if one exists. Even if it finds a point and does something, thereís absolutely no reason to believe itís the global optimum. So I think Iíve said that enough times, but in the context of this course, itís worth repeating it a bunch of times.
Now, in fact the results also can and often do depend on the starting point. You can even look at that from the half-empty or half-full glass approach. The half-empty glass approach says, ĎWell it shows you this is all stupid and euristics. If it depends on where you started from, then why should I trust you?í I sort of agree with that. Hereís the half full. The half-full approach says, ĎOh, no problem. Thatís fantastic. Weíll run the algorithm many times from different starting points, and weíll see where it converges to.í
If it always converges to the same point and the person weíre defending this to isnít too snappy, we can say, ĎOh, we think this is the global solution because I started my method from many different points, and it always came to the same thing.í If youíre an optimist again, if it converges to many different points, you say, ĎNo problem. Iíll run it 10 times. Iíll take the best solution found in my 10 runs, and therefore, you see itís an advantage. Itís a feature that you do this.í
Now actually, these methods often work really well. And that means it finds a feasible point with good Ė if not optimal, actually often it is optimal, you donít know it Ė point. Thatís the background. So weíre going to consider a non-convex problem, standard optimization problem, and you can have equality constraints that are not affine, and you can have inequality constraints and so on. And the basic idea of sequential convex program is extremely simple. I mean itís very simple.
It goes like this. At each point, youíre going to maintain an estimate of the solution. So weíre going to call that Xk. The superscript K is going to denote the iteration counter. So you can have a counter, which is the iteration. And whatís going to happen is youíre going to maintain something called a Ďtrust region.í And a trust region is Ė Iíll get to what it is in a minute, or weíll see how it acts. Itís basically a little region, which includes Xk. So it surrounds Xk, and itís the region over which you propose to find a slightly better solution. Thatís the idea.
So if the trust region is small, it says youíre only looking very locally, and Xk + 1 is going to be local. If itís bigger, then it means youíre willing to take bigger steps. So thatís the idea. So hereís what you do. Youíre given a trust region, and each inequality function and your objective; you ask to form a convex approximation of FI over the trust region. And weíll see. Thereís going to be lots of ways to do that, some simple, some complex, and so on. And then for the equality constraints, youíre going to ask each possibly non-affine constraint here to generate an affine approximation of itself.
Now obviously, if the objective or a bunch of the inequalities are convex, a perfectly good approximation of itself is itself. Right? So in that case, thatís very Ė so forming a convex approximation of a convex function is very easy, same for affine. So I simply replace all of the objective, inequality constraint functions, and the equality constraint functions what their appropriate approximations. And I impose this last condition, which is the trust region constraint. And now I solve this problem. Thatís a convex problem here. So thatís the idea.
And this is not a real constraint in the problem, obviously, because TK is made up by us. So itís not a real constraint. This is here to make sure that these approximations are still roughly speaking valid. Thatís the idea. So the trust region, a typical thing, although this is by no means Ė by the way, this is really a set of ideas. So donít Ė the algorithm we show here is going to be simplified. When you actually do these things, depending on what your particular problem is, youíll switch all sorts of things around. Youíll use different approximations, different trust regions. It all depends on the problem. So itís really more of an idea.
A typical one would be a box. Now, if a variable appears only in convex inequalities and affine equalities, then you can take that row phi to the infinity because you donít need to limit a variable that appears only in convex equalities and inequalities. Convex equality, by that of course I mean an affine equality.
How do you get these approximations? Well, letís see. The most obvious is to go back to the 18th century and look at things like Taylor expansions. Itís the most obvious thing. In fact, thatís all of calculus. Thatís what calculus is. Itís just nothing but an answer to the question, ĎHow do you get an affine approximation?í Roughly, right? ĎHow do you get an affine approximation of a function in an organized way?í And the answer is you torture children for 14 years memorizing stupid formulas long after Ė centuries after anyone has remembered what any of it is for. So thatís the answer to that question, how do you get that approximation.
So hereís your basic 18th century approximation. If you want a convex approximation, then thereís something interesting you can do here. Of course, the second order Taylor expansion would have the Heschen right here. Thatís this. But if the Heschen is indefinite and you want a convex approximation, you can simply drop the negative part of a matrix. By the way, how do you get the positive and negative parts of a matrix? In this case, itís just talking about expansion. Itís the SV, but not quite. Itís the ERC iGAn expansion. So you take the IGAn value expansion. You set the negative IGAn values to 0, and thatís the projection onto Ė thatís what P is, this PSD part.
Now, these are very interesting. These are local approximations right. I mean thatís the whole point of calculus right. Thatís this thing. And they donít depend on the trust region radii. But thatís kind of the idea. So calculus is vague, and it works like this. If you say, ĎYeah, how good an approximation is this?í You say, ĎOh, yeah. Itís very good.í And you say, ĎWell, what does that mean?í Well, very good means if this is small, then the difference between this and this is small squared. You say, ĎYeah, but which constant in front of the small?í You say, ĎI donít know. It depends on the problem and all.í Itís vague.
It basically has to do with limits and all that kind of stuff. Actually, Iíll show you something more useful, and this is more modern. And in my opinion, this is the correct way. So not to knock calculus or anything, but this is. Actually, the reason I like to emphasize all that is because all of you have been brainwashed for so long, and especially when you were very young calculating derivatives. So that if someone says affine approximation, whatís the little part of your brain right in the center that controls breathing and stuff like that?
This comes out just out of that part, depending on when and how long you were tortured learning how to differentiate T2 sin T. Okay, so thatís why I like to emphasize that thereís new methods. Theyíre relatively modern. They came up in estimation first and other areas. Theyíre called particle methods for a bunch of reasons because they came up first in estimation. Iíll show you what they are.
Hereís a particle method. This is the modern way to form and affine or a convex approximation, and it goes like this. What you do is you choose some points. There are problems with this, and Iíll talk about that in a minute. Iím not saying throw the calculus out. But what you do is you choose a bunch of points in the trust region, and thereís a big literature on this. Itís not even a literature because they donít really say anything. Itís like a lore, and itís like a craft actually. People even talk. Iíve heard people say things like, ĎWhatíd you use?í ĎOh, I used the sigma points or something.í Whatís a sigma point or something? And itís some euristic for determining some points or something.
Anyway, so here are the types of things you could use. You could use all vertices, depends on the dimension of the problem. If the dimension of the problem is like 5, all vertices are 32 points. You might do 3. You could do the center and all vertices and stuff because you want a sample of function in a little box. This is actually much better done when each function, even if the terminal variables are large, each function only involves a couple of variables. Thatís what youíre hoping for.
Other ones, you can do some vertices. You can use a grid. You can generate random points. Thereís a whole lore of this. And you simply evaluate your function, and now you have a pile of data that looks like this. Itís the argument and the value of the function. And what you do is, you have a pile of these and you fit these whether convex or affine function depending on what was asked. So thatís the picture.
Of course, itís going to come up. Thatís a convex optimization problem as well, not surprisingly. There are a lot of advantages of this. One is that it actually works pretty well with non-differentiable functions or functions for which evaluating the derivatives are difficult, right.
So for example, if you want to do an optimal control problem with some vehicle or something like that, some air vehicle or whatever, and someone says Ďhereís my simulator,í and itís some giant pile of code. I guarantee you it will have all sorts of look-up tables, horrible polynomial approximations and things like that. Youíll look deep, deep, deep into this differential equation or something, and no one will even know what it is, some function that calculates the lift or the drag as a function of angle of attack and dynamic pressure or something like that, and itís going to be a look-up table obtained from wind tunnel tests. Thatís what itís going to be.
If itís anything else, itís because someone else fit it for you, but you shouldnít let someone else fit something for you because they didnít care about convexity, and you do. So thatís why you shouldnít just let them fit things. The joke is actually a lot of times people fit things, and by accident, the fits are convex. Often that happens.
So this works really well especially for functions for which evaluating derivatives is difficult or given by look-up tables and all that kind of stuff. Now, when you get a model this way, itís actually very interesting. You shouldnít call it a local model. A local model basically refers to calculus. If someone says, ĎHereís my local model.í And you say, ĎWell how accurate is it?í You can say, ĎExtremely accurate provided youíre near the point around which itís developed.í So itís kind of this vague answer.
This one is not a global model. A global model says, ĎNo, thatís the power.í Thatís an accurate expression for the power over three orders of magnitude of these things. Thatís the power. Iím telling you. Thatís a global model. I call these regional models because it depends on the region. And so your model actually Ė let me just draw some pictures here. We should have done this for the lecture, but oh well. We just wrote these.
Letís draw a picture. Could you raise this just a few inches here? I think weíd be better off. Cool, thatís good. Even better, thatís perfect.
Hereís some function that we want to fit. It doesnít even really matter. So if you take this point and someone says, ĎMake me an affine model.í Then of course, the calculus returns that approximation, right. So a regional model would do something like this. In a regional model, you have to say what the region is, and the model is going to depend on the region asked for. So if you say, ĎWould you mind making me a regional model over Ė thatís too small, sorry. Letís make it here to here. So we want to make a regional model over that range.
I donít know what itís going to be, but itís going to look something like that, okay. Now, hereís the cool part. It doesnít even have to go through that point, okay. And you can see now that youíre going to get much better results with this, in terms of getting a better point. Now, it also means that those trust regions need to tighten before you can make some claim about high accuracy or whatever. But thatís the idea. Is this clear? I think these are totally obvious but really important ideas.
So how do you fit an affine or quadratic function data? Well, actually affine is 263. So itís least squares. I mean in the simplest case itís 263. So affine model is just least squares, and Iím not even going to discuss it. By the way, you donít even have to do least squares. Once you know about convex optimization, you can make it anything you like. So if you want to do mini-max fit, if you want to allow a few weird outliers, throw in an L1 Huber fit or something. So you know what to do. If you donít care about accuracy, if an error of +/- 0.1 doesnít bother you and then you start getting irritated, put in a dead zone.
All this goes without saying, so use everything you want to use. I should add that every iteration in every function appearing in your problem, youíre going to be doing this. So is that going to be slow? Yes, if itís written in mad lab, right. But if itís done properly, youíre solving extremely small convex problems. If this is done correctly, if itís written in C, these things will just fly. These are sub-millisecond problems here. Thatís also not a problem unless you do it in some very high level interpretive thing.
This is an example of showing how you fit a convex quadratic. And that you would do this way. Itís an SCP because you have a positive semi-definite constraint here, and then this here is, of course, the convex that youíre fitting. Your variables are P, Q, and R. These are data, and this objective here is, of course, convex quadratic in the data. So thatís a least squares problem with a semi-definite restraint.
Now, another method, which we will see shortly, is quasi-linearization. You can certainly say itís a cheap and simple method for affine approximation. To be honest with you, I donít know why Ė I canít think of anything good about it except it appeals to one, the peopleís laziness. I canít think of any other good reason to do this except laziness. We did it later in an example, but here it is. Itís kind of dumb. It basically says this. You write your non-affine function as ax + b, but you allow a and b to vary with x. Now thatís kind of stupid because one example is just to take this 0 and say that b is h.
As you can see right out of the gate here, this method is not looking too sophisticated. Itís not. Itís not a sophisticated method. But then what you do is you say, ĎWell look, if x hasnít changed much, Iíll just replace a of x. Iíll just use the previous value, and b will be the previous value here.í So this is like even dumber than calculus because this isnít even a local approximation. This is not a good approximation, but itís often good enough.
So hereís an example. Hereís a quadratic function. Iím going to rewrite it, many ways to do this, but Iíll rewrite it this way as ax + b. So if I quasi-linearize it, I simply take the last version here. Whereas, the Taylor approximation, the correct way to do this is to take this thing out, and these are not the same if you multiply them out. Itís quite weird, but theyíre not the same.
Letís do an example. Our example is going to be a non-convex quadratic program. So weíre going to minimize this quadratic over the unit Ė by the way, itís a famous NP hard problem here. It doesnít matter. Itís a famous hard problem. Here P is symmetric but not positive semi-definite. By the way, if P is positive semi-definite, this is a convex problem, and itís completely trivial, and you get the global solution. So itís only interesting if P has a couple of negative iGAn values, one will do the trick.
So weíre going to use the following approximation. This is going to be the Taylor expansion, but we truncate, we pull out, we remove the negative part of P because that would contribute a negative curvature component. So hereís an example in R20. We run SCP, the sequential convex programming, where the trust region rate is a 0.2. Now the whole, all the action goes down in a box +/- 1. So the range of each x is from -1 to 1, which is a distance of 2. So this thing basically says that in each step, you can only travel 10 percent of your total range.
So in fact, in the worst case, it will take you at least 10 steps just to move across the range here, to move from one corner of the box to the other of the feasible set. You start from 10 different points, and here are the runs. This is the iteration number. Thatís maybe 20 or something. And you can see that basically by or around 25 steps these things have converged. And you can see indeed they converge to very different things. They converge to different.
So here, if we were to run 10 things, that would be the best one we got. Itís around -59 or something. Itís nearly -60. Thatís this thing. Now, we donít know anything about where the optimal value is. This is a lower bound. Iíll show you how to derive that in just a second, but thatís a lower bound from Lagrange Duality. But in fact, what we can say at this point is that we have absolutely no idea where the optimum value is, except itís in this interval. Itís between -59 and whatever this thing is, -66.5. So thereís a range in there, and we have absolutely no idea where it is. My guess is itís probably closer to this thing than this one, but you know, itíll be somewhere in there.
You want a better lower bound? Wait until the end of the class, and weíll do branch and bound and weíll show you how to bring the gap to 0 at the cost of computation, and the computation will be a totally different order of magnitude. It wonít be 25 convex programs. It will be 3,000 or something like that, but youíll get the answer, and youíll know it. So thatís it.
Let me just show you how this lower bound comes up. You know about this. Itís just Lagrange Duality. So you write down the box constraints as xi2<1, and you form a Legrangian, which looks like this. This you can minimize. This is a quadratic for any in x, and if you minimize it, this is roughly what it is. There are some stupid boundary conditions, which donít matter, and anyway, you get the same answer. This matrix here has to be positive definite. Thatís actually not quite true because they can be positive semi-definite and some other condition hold, but ignoring that boundary. If this matrix here has a negative iGAn value, then for sure the infimum of this quadratic is -8. Thatís for sure.
Now, if itís positive though, you can just set the gradient equal to 0 and you evaluate it, and you get this expression here. And you can see this expression is, as has to be, this expression is concave because we know G is always concave. We know that by the way, the dual of a non-convex problem, the LaGrange dual is convex. Therefore, again roughly speaking, itís tractable. We can solve it, and when you solve it, you get a lower bound on the original problem. So this is the dual problem, something. And this is easy to solve. You can convert it to a SCP or why bother, let CBX do it for you because that function exists there. So itís literally like one or two lines of CBX to write this down. And then you get a lower bound.
And if you try it on this problem instance, you get this number here. And in a lot of cases actually, when these things are used, you donít even attempt to get a lower value. When youíre doing things like sequential convex programming in general, youíre doing it in a different context. So you donít have a lot of academics around asking you, ĎIs that the absolute best you can do? Can you prove it?í and stuff like that. Youíre just basically saying, ĎCan I get a good trajectory? Can I get a good design? Does the image look good?í or something like that. Or, ĎDo I have a good design?í or whatever.í
Okay. Thatís it. Now you know what sequential convex programming is. But it turns out to actually make a lot of it work, there are a few details. Weíre going to go over the level 0 details. Thereís level 1 ones as well and level 2, but weíll just look at the level 0. These are the ones youíre going to run into immediately if you do this stuff.
By the way, how many people here are doing project that involve non-convexities or something like that? Okay, so a fair number. These are the highest-level ones that you have to know about immediately. The first is this. You can form this convex approximately problem. So letís go back to what it actually is. There you go. You can form this thing, but what if itís not feasible? Which could happen. Basically, you start with a point thatís way off. I mean way, way in the wrong neighborhood, just totally off, and you have a small trust region. So you canít move very far. That means probably this is not going to be feasible.
So right now, if you were to run this, depending on the system, you run, for example CVX, it will simply come back and tell you itís infeasible. Instead of returning you an approximate x, it will return a bunch of nads. It will return you some dual information certifying that the problem you passed in was infeasible. So thatís just in case you didnít trust it. It will return you a certificate thatís not interesting.
So what youíre really going to have to deal with is what you really want to do is you want to make progress. So thatís the first thing. Now the other issue is the following: even if this problem is feasible and you step, how do you know youíre making progress for the main problem, right. So to evaluate progress, youíre going to have to take into account a bunch of these. First of all the objective, thatís the true objective not the approximate objective.
So obviously, you want this to go down, and if this point were feasible, weíll see cases where that has to happen. But if this point were feasible, then itís easy how to measure progress, by the objective because thatís the semantics of the problem. The semantics of the problem is literally if you have a two feasible points and you want to compare them, the semantics of the problem is if one has a better objective, itís better, end of story. Okay. So thatís how you do that.
The problem is what if the approximate problem is infeasible, what if the exact problem is infeasible. You have to have some measure of something it tells you about making progress. So you might say that making progress has something to Ė you would like these violations to go down. That would be one Ė thatís sort of one measure of making progress. This is like having a Liaponal function. There are lots of other names for these things. Merit function is another name used to describe a scalar valued function that tells you whether or not youíre making progress. Thatís a standard and coherent method.
Now, if youíre not making progress, it could be for lots of reasons. It could be that your trust region is too big. And so whatís happening is your function is wildly non-linear over the trust region. Youíre least squares thing is providing some least squares fit, but youíre making horrible violations. By the way, when you call the method on the functions that says return an affine approximation, it can also return the errors, right. And if the errors are huge, thereís probably no point in forming and solving the convex program. Iím talking about a more sophisticated method. Instead, the collar will actually say, ĎOh, thatís too big reduce your trust region by a factor of 2 and try it again,í or something like that.
One reason is that your trust region is too big. So youíre making poor approximations. And then what happens, you form a linearized system, and in the linearized system, you have to make progress because itís a convex problem. Unless it just says youíre optimal at that point. It will suggest a new point, which will make progress. Any way you form a measure of progress that is convex, it will make progress period. But thereís no reason to believe that true non-linear problem is.
Now on the other hand, if the trust region is too small several things happen. One is the approximations are good because youíre basically taking little mouse steps in x. Youíre approximations are quite good. But the progress is slow. Youíre going to solve a lot of convex programs. And thereís one other problem, and this is a bit weird, but itís this. You are now more easily trapped in a local minimum because youíre just going to basically slide downhill and arrive at one point.
If youíre trust region is big at first and your function has all sorts of wiggles in it, since youíre looking down on your function from high up and getting a very crude approximation, youíre actually going to do better if you have really tiny small rows, youíre going to get caught in the first local minimum. As youíre figuring, out all of this non-convex optimization is some combination of art, euristics, and other stuff.
Student:[Inaudible] for the row then?
Instructor (Stephen Boyd):Yes, weíre going to get to that. There is, yes. Iíll show you actually what would be the simplest thing that might work. And the truth is none of this works. I mean in the true sense of the word. But if you put quotes around it, then I can tell you what Ďworks,í and that means something comes out and you get something thatís reasonable, close to feasible. Is it optimal? Who knows? But you get a nice picture, a nice trajectory, a nice estimate of some parameters or whatever it is you want.
So a very simple way of constructing a merit function is this. This is the true objective not the approximated one. To the true objective, you add a positive multiple of Ė this is the violation. Thatís the total, true violation, and this is the inequality constraint violation. Thatís the equality constraint violation. Something very important here, these are not squared.
Theyíre not squared, and let me say a little bit about that. They would be squared if the context of what you were doing was conventional optimization where differentiability is a big deal. They would be squared. And someone would say, ĎWell why are you squaring hi, and you day because now itís differentiable. But if you know about convex optimization, you know that non-differentiability is nothing to be afraid of. Thatís why we do this.
But thereís also something else very important about this. It turns out; this is whatís called an exact penalty function. So this is by the way called a penalty method. Itís the opposite of a barrier method. A barrier method forces you to stay in the interior of the feasible set by adding something that goes to 8 as you approach the boundary. So you donít even dare get near the boundary. Well, you will eventually get near the boundary. Your level of daring is going to depend on how small this is.
In the barrier method, by the way, some people write it here, or we would put it over here with the T and make T get big, but itís the same thing. Thatís the barrier method. A penalty method is the opposite of barrier method. It allows you to wander outside of the feasible set, but itís going to charge you, and this is the charge. So thatís a penalty. Now, this is whatís called an exact penalty method, and let me tell you what that means. It means the following. It means that Ė itís kind of intuitive, and it can be shown that if you increase lambda here, the penalty for being outside the set, youíre going to violate the constraints less and less. Itís obvious, and you can show this.
But hereís the cool part. For some penalties, the following occurs. For lambda bigger than some lambda critical, some finite critical value, the solution, the minimum of this thing, is exactly the solution of the problem. Itís not close. Itís not like, oh, if I charge you $50.00 per violation unit, you get close, and then 80 you get closer, but youíre still violating a little bit. Then I say, ĎOkay, now itís a thousand dollars per violation unit, and you violate 1e Ė 3. No, it works like this. Itís $50.00 per violation unit, you violate 80, and I get it up to 200, and your violation is 0, and you have therefore solved exactly the original problem.
Everybody see what I am saying here? By the way, itís very easy to show this for convex problems. This is non-convex, and in any case, itís irrelevant because nobody can minimize this exactly. So all of this is a euristic on top of a euristic on top of a euristic, but anyway. So thatís an exact penalty method. By the way, this is related to L1 type things. Youíve seen this there. If you add an L1 penalty, we did homework on it didnít we? Yeah. So if you have an L1 penalty and you crank up the lambda big enough, itís not that the x gets small.
You go over a critical value, and the x get small, but for a finite value lambda, itís 0. Itís just 0 period. So this is the same story. Now, we canít solve this non-convex problem. No easier or harder than the original problem, so instead, weíll use sequential convex programming and weíll simply minimize this problem. Now the cool part about that is you canít be infeasible here.
I mean assuming the domains of all the Fs and Hs are everywhere, which is not always the case, but roughly speaking, you cannot be infeasible now. You can plug in any old x. You can take a tight trust region method. If you move just a little bit, the hope is that this violation will go down, right. But the point is anything is possible. You can violate constraints, equality and inequality constraints. You just pay for them. Thatís what this is.
So that deals with the feasibility thing, and in fact, a lot of people just call this a phase 1 method in fact. So itís a good method. I should add Ė Iím going to ask you a couple of questions about it just for fun to see how well you can do street reasoning on convex optimization. So here it is.
When you minimize this thatís a convex problem, and hereís what I want to know. Iíve already told you one fact. If lambda is big enough all of these will be less than or equal to 0, and all of these will be exactly 0. Itís just like L1. So let me ask you this. When lambda is not big enough and you solve these, and some of these are non-zero, what do you expect to see?
So lambda is not big enough to cause all of these to be less than or equal to zero and all of these to be zero. What do you expect to see, just very roughly what?
Student:A small violation for most of the Fiís but just a few phi hanging around then.
Instructor (Stephen Boyd):Thatís a good guess. So a small violation for all of these guys and a few Ė now why did you say a few here?
Instructor (Stephen Boyd):Cool, L1 exactly because for example, if these are affine, and this really is, thatís really an L1 norm. And so L1 and that part of your brain is next to your sparsity part. I mean the neurons are growing between the sparsity and the L1 thing. Thatís a very good guess. Actually, Iíll tell you exactly what happens. What happens is a whole bunch of these will be 0, and youíll have sparse violations. That comes directly from L1, and you get exactly the same thing here.
So itís actually very cool. Youíll have a problem. This is good to know just out of this context just for engineering design. You have a convex problem. You have a whole bunch of constraints. Itís infeasible. T hatís irritating. At that point, one option, if you say sorry itís just not feasible. Itís not part of your method if itís convex optimizations because there is no feasible solution. Not thereís one and you failed to find it.
So if you minimize something like this, what will happen is really cool. If you have, letís say, 150 constraints, 100 of these and 50 of these. Hereís what will happen if youíre lucky. This thing will come back, and it will say, ĎWell, Iíve got good news and bad news. The bad news is itís not feasible. There is no feasible point. The good news is of your 50 equality constraints, I satisfied 48 of them, and of your 100 inequality constraints, I satisfied 85 of them. Everybody see what Iím saying?
So this is actually a euristic for satisfying as many constraints as you can. By the way, thatís just a weird aside. It doesnít really have to do with this particular problem, but itís a good thing to mention.
The question was; how do you update the trust region. Letís look at how this works. So hereís whatís going to happen. Iím going to solve a convex problem. T hat will suggest and xk + 1. I will then simply evaluate this thing. This might have gone down. These might have gone up. Who knows? The only thing I care about is this phi, and the question is if I went down I made progress, and roughly speaking, I should accept the move. If phi doesnít go down, if it went up, that was not progress, and it could be not progress for many reasons.
It means that your trust region was too big, you had very bad errors, you solved the approximate problem, it wasnít close enough. It gave you bad advice basically. So hereís a typical trust region update. By the way, this goes back into the Ď60s. Itís related to various trust region methods, not in this context but the same idea. Itís identical to back in the í60s. So what happens is this. You look at the decrease in the convex problem. This you get when you solve the convex problem. And it basically says; if those approximations you handed me were exact, you would decrease phi by some amount delta half. This is always positive here, always.
It could be 0, which means basically that according to the local model youíre optimal. Thereís nowhere to go. Okay? But this is positive. This predicts a decrease in phi. Then you actually simply calculate the actual exact decrease. Now, if your model of phi, if this is very close to that, these two numbers are about the same. So if your approximations are accurate, these two numbers are the same. By the way, that generally says your trust region is too small, and you should be more aggressive. So if in fact your actual objective is some fraction alpha, typically like 10 percent. If you actually get at least 10 percent of the predicted decrease, then you accept this x ~, thatís your next point. And you actually crank your trust region up by some success factor.
By the way, this is just one method. There are many others. Thatís the nice part about euristics. Once you get into euristics, it allows a lot of room for personal expression. So you can make up your own algorithm and itís fun. I guess. Notice that in convex optimization thereís not much room for personal expression if you think about it carefully right. You really canít say use my LP code because itís better. ĎWhat do you mean?í ĎOh, youíre going to get way better results with mine, way better.í Because you canít because any LP solver thatís worth anything gets the global solution. So theyíre always the same. I mean you can have a second order fights about which one is faster and slower and all that, but thatís another story, but you certainly canít talk about quality of solution because they all get the exact global.
In non-convex, you can actually talk about the quality. You can actually stand up and say, ĎMy algorithm is better than yours.í And it will mean actually that Iíll sometimes get better solutions. Here you increase it. Typically, you might increase it 10 percent or something like this. Now if the actual decrease is less than 10 percent of the predicted decrease, then what t says is you better crank your trust regions down. The typical method is divide by =2. This is just a typical thing. As I said, thereís lots of room for personal expression, and you can try all sorts of things.
So this is a typical thing to do. By the way, one terrible possibility is that delta is negative. Now if delta is negative, it means this thing Ė it says I predict you will decrease phi by 0.1, and then when you actually calculate this, itís entirely possible that phi not only did not go down, it went up. Thatís definitely not progress. But thatís the way to do this. This is a trust region update. Like I say, these are things that are easily 40 years old, 50. I mean not in this context, but in the context of other problems.
So now, weíre going to look at an example. Itís an optimal control problem. Actually, I know we have a couple of people working on some of these. And so itís for a two-link robot, and itís not vertical. Itís horizontal because thereís no gravity here. That doesnít matter. Itís just to kind of show how it is. So thereís two links here, and our inputs are going to be a shoulder torque and an elbow torque. So these are the two inputs weíre going to apply t this, and weíre going to want to move this around from one configuration to another or something like that. Weíll get to that.
So hereís the dynamics. The details of this not only donít matter, but thereís a reasonable probability, maybe 5 percent, that theyíre wrong because we donít know mechanics, but somebody who knows mechanics can correct us if itís wrong. Anyway, so itís an inertia matrix multiplied by the angular acceleration, and thatís equal to the applied torque. Thatís a two vector. Thatís towel one and tab two. Here is allegedly the mass matrix. I can tell you this. It has the right physical units. So that part we can certify. Itís probably right.
Now the main thing about these is that m of theta and w of theta are horribly non-linear. They involve product of sines and cosines of angles and things like that. So itís not pretty. So this is a non-linear DAE, differential algebraic equation. Of course, to simulate itís nothing to do because we know how to numerically simulate a non-linear DAE. Itís nothing. Letís look at the optimal control problem. Weíre going to minimize sum of the squares of the torques, and weíll start from one position, at rest. Thatís a two-vector, and we want to go to a target position, again, at rest.
Weíll have a limit on the torque that you can apply. So thereís a maximum torque you can apply. This is an infinite dimensional problem because tau is a function here, but letís look at some of the parts. Thatís at least convex functional. These are all cool. This is cool too. So the only part that is un-cool is the dynamics equality constraints, which is non-linear, right. If it were linear, this would be not a problem. So weíll discortize it. Weíll take the N times steps, and weíll write the objective as something like approximately that, and weíll write the derivatives as symmetric derivatives.
By the way, once you get the idea of this, you can figure out how to do this in much more sophisticated ways where you use fancier approximations of derivatives and all that, but this doesnít matter. So weíll approximate the angular velocities this way, and weíll get the angular accelerations from this. Thatís approximately, and those initial conditions correspond to something like this. For two steps, youíre at the initial condition, and for the final two steps, youíre at the final one. The two-step says that your velocity is 0. That guarantees the velocity being 0, and weíll approximate the dynamics this way.
Now, thatís a set of horrible non-linear equations because remember this mass matrix. If this were a fixed matrix and if this were fixed, this would be a set of linear equations. So itís written like that, and the first and lazy thing to do is quasi-linearization. Thatís kind of the obvious thing to do there. So letís do that. Hereís our non-linear optimal control, and by the way, this kind of give you the rough idea of why things like sequential convex programming make sense. Thatís convex. These are convex. These are all convex. The dynamics are not convex. Thatís the problem.
Weíll use quasi-linearized versions. Youíd get much better results if you actually used, I believe Ė we havenít done it, if you used linearized versions. And if you used trust region versions, youíd probably get even better still, but in any case, weíll just simply Ė and by the way, the physical meaning of this is very interesting. The physical meaning is this. This is a set of linear equality constraints on theta. But it says itís not quite right. It basically says itís the dynamics, but youíre using the mass matrix and the coriolis matrix from where the arm was on the last iteration because thatís what this is.
Now, hopefully in the end, itís going to converge. And the new theta is going to be close to the old theta, in which case, youíre converging into physics consistency or something like that. Thatís the idea. Weíll initialize with the following. Weíll simply take the thetas. We want to go from an initial point to a target point, and weíll just draw this straight line. By the way, thereís no reason to believe such a line is achievable. Actually, just with the dynamics, thereís no reason to believe itís achievable and certainly not with torques that respect the limits. I mean this is no reason to believe that.
Okay, so this is the idea. So weíll do the numerical example now with a bunch of parameters and things like that, and this says that the arm is start like this, and then with this other one bent that way or something like that. And then, thatís going to swing around like this, and the other one will rotate all around. Thereís a movie of this, but I donít think Iím going to get to it today. Anyway, itís not that exciting unless your eye is very, very good at tracking errors in Newtonian dynamics. There are probably people who could do that. They could look at it and go, ĎOw, ooh, ahh.í And you go, ĎWhat is it?í They go, ĎYouíre totally off.í And then at the end, they could look at it and go, ĎYep, thatís the dynamics.í
So the various parameters are; weíll bump up the trust region by 10 percent every time we accept a step. Weíll divide by = 2 if we fail. The first trust region is going to be huge. Itís going to be +/- 90 degrees on the angles. All the actionís going to go down over like +/- p or +/- 180. And then, we take lambda = 2. That was apparently large enough to do the trick. So the first thing you look at is the progress, and remember what this is. This combines both the objective and then a penalty for violating physics. You have a penalty for violating just physics, and in fact, that second term, Iíll say what it is in a minute.
But that second term, the equality constraints are in Newton-meters. At every time step, itís two equality constraints on two torques. The residual is in Newton-meters. Itís basically how much divine intervention you need to make this trajectory actually happen. Itís the torque residual. Itís how much unexplained torque there is in it. So this says yes, itís progressing. By the way, this doesnít go to 0 of course because it contains that first term. Weíll see how that looks.
So hereís the convergence of the actual objective and the torque residuals. So letís see how that works. You start off with an objective of 11. Thatís great. In a few steps though, itís up to 14. You think thatís not progress. The point about this 11 is yes, thatís a great objective value. Thereís only one minor problem. Youíre not remotely feasible, right. So itís a fantastic objective value. Itís just not relevant because youíre not feasible.
Presumably what happens, in fact we can even track through this and figure out exactly what happens. So the first two steps our trust region was too big. We were proposed a new thing. We evaluated the progress. We didnít make enough progress. Weíll see exactly what happened. We divide by 2 twice. So in fact, it was +/- 90, +/- 45, +/- 22.5. At that point, that trust region was small enough that we actually got a step to decrease things. And what it did was since phi goes down, this thing went way down, and yet, your objective went up. What that strongly suggests is that your torque residual went down, and in fact, it did. This is on a log plot, so you canít see it, but that little bump there is a big decrease in torque residual.
You can see here. This is the torque residual going down like that. So when we finish here after 40 steps, we have something that Ė surely by the way, this is much smaller than the error we made by discortizing. So thereís an argument that we should have stopped somewhere around here, but itís just to show how this works. You can guess that this is the value of the objective that we get in the end. You can see that by 20 steps you actually had a reasonable approximation of this. So this can be interpreted as the following. This is making progress towards feasibility. Youíve got rough feasibility by about 15 or 20 steps, and this is just keeping feasibility and fine-tuning the objective.
What you see here are pictures of the actual and predicted decrease in phi, the blended objective. And you can see here. On the first step, you predicted this decrease. You predicted a decrease of 50, but in fact, when you actually checked, what happened? Not only did that phi not go down, it actually went up. Thatís negative. It went up by 5. So you reject it. On the next step, you divided again. The trust region goes from +/- 90 to +/- 45. You try again on +/- 45 here. You try again, and once again, itís a failure. Now you go down to 22.5. You predict a decrease of 50. You got a decrease of 50. That was accepted.
This shows how the trust region goes down. It starts at +/- 90 degrees. You fail once, twice. Youíre down to 22.5 degrees. Now you go up to 26 degrees. I guess you now fail three times in a row, something like that. I canít follow this. Thatís about three times in a row maybe. Then you go up, and fail once, and thatís the picture of how this goes. And then, this is the final trajectory plan that you end up with. These are not obvious. I donít know of really many other ways that would do this.
So you end up with this torque, and it kind of makes sense. You accelerate one for a while, and then everywhere itís negative over here, thatís decelerating. And you accelerate this one to flip it around, and you have a portion at the end where itís decelerating like that. And then, hereís the final trajectory. The initial trajectory just went like this. It was a straight line from here up to here, and this one is a straight line from here to here. So I guess theta 2 actually turned out to be pretty close to our crude initial guess.
Thatís an example. Let me say a little bit about this example. Once you see how to do this and get the framework going, anything you can do with convex programming can now be put in here. And thatís the kind of thing you couldnít do by a lot of other methods. You can do robust versions now. You can put all sorts of weird constraints on stuff. You can ask for Ė since the taus actually appear completely convexly. Thatís not a word, but letís imagine that it is. In the problem, there are no trust region constraints. So you can put a one norm on the taus instead of a sum of the norm of tau2. In which case, youíd get little weird thruster firings all up and down the trajectory.
That would be something that classical methods wouldnít be able to do for you. This method you could actually do by Ė this specific problem right here, you could do by various methods involving shooting and differential equation constraint problems. You could do it. Itíd be very complicated by the way. This would be up and running like way, way faster than those other methods. But anyway, when you start adding other weird things like L1 objectives and all that kind of stuff, itís all over. All that stuff just goes away, and this stuff will work just fine.
Weíre going to look at just a couple of other examples of this. Theyíre special cases really or variations on the same idea. Hereís one thing that comes up actually much more often than you imagine. Itís called difference of convex programming. Although, this has got lots and lots of different names. One is called DCC or something like that but anyway. So here it is. You write your problem this way. You write it as all of these functions. Forget the equality constraints. Just leave that alone. We write it this way. Every function we write is a difference in a convex and a convex program and a function. And you might as by the way, ĎCan you express every function as a difference of two convex functions?í
What do you think? The answer is you can. I mean we can probably find five papers on Google that would go into horrible details and make it very complicated, but you can do it, obviously. This is utterly general, but itís not that useful except in some cases. Itís only useful in cases where you can actually identify very easily what these are. By the way, this comes up in global optimization too. Weíll see that later in the class.
This is weird because it doesnít make any sense. I donít know a better name for this. We can help propagate it by the way if you can think of a name. Difference of convex functions programming, thatís closer, but itís not really working. So I donít know what. Anyway, if someone can think of. Thereís a very obvious convexification of a function thatís a difference of convex function. It goes like this. If I ask you to approximate f(x) Ė g(x) with a convex function. We can do a little thought experiment. If I came up to you and I said, ĎPlease approximate f by a convex function. What would your approximation be? F. Itís convex.
Now suppose I ask you to approximate a concave function? Letís go back and think locally. How do you approximate a concave function by a convex function? You can guess. Linear. Are you sure you couldnít do any better? Well, you could easily show the best. It depends on your measure of best or whatever, but you could write some things down, and surely, people have written papers on this and made a big deal out of it.
Very roughly speaking, depending on how you define best, the best convex approximation of a concave function is an affine function. If youíre approximating something that curves down, if you put anything that curves up, youíre going the wrong way, and youíre going to do worse than just making it flat. Youíre allowed to do that actually. In the second half of an advanced 300-level class, Iím allowed to say stuff like that.
What you do is simply this. You simply linearize g and you get this thing. This of course is convex because itís convex, and thatís constant. Well, this is affine. Thatís the affine approximation of g. Now hereís the cool part. When you linearize a concave function, your linear function is above the concave function at all points. Everybody agree? Your function curveís down. Your approximation is like this. Youíre a global upper bound. And what that means is the following. You have replaced g with something that is bigger than g everywhere, and that means that your f hat is actually Ė have I got this right?
Instructor (Stephen Boyd):Minus g, so this means that f hat is bigger than f(x) for all x. Thatís what it means. Let me interpret what this means. Roughly speaking, a convex function is one where smaller is better in an optimization problem. So itís either an objective or itís a constraint function. If itís an objective, smaller means better. If itís a constraint function and youíre positive, it means your closer to your feasibility roughly. If youíre negative, it means youíre more feasible. Now that doesnít make any sense, but thatís good enough. It certainly means youíre still feasible.
So this is really cool. This says there are no surprises with the trust region. Thereís no trust region. If you simply globally minimize, form the function with this, hereís whatís cool. You just take a full step, follow the logic. There is no trust region. You just remove it entirely. This says that when you optimize with f hat and then actually plug in the true one, your results with the non-convex problem can only be better. All your constraint functions are actually small. They go down, and your objective goes down, right.
So thatís how this works. This is much simpler, this method. And this is sometimes called convex-concave procedure. It has been invented by many people independently and periodically, and surely will be invented roughly every 10 years. I have no idea who invented it first, but I think I can guess where. I wonít do it. Iíll check actually. I do know. It was invented there, Moscow of course. I know that because itís a part of potential methods.
So hereís an example out of the book in the chapter on approximations. Hereís the problem. Youíre given a bunch of samples from a distribution. Youíve subtracted off the mean or something. They come from something with a covariance matrix sigma true. Our job is given a bunch of samples to estimate the covariance matrix. So the negative log likelihood function, which we are going to minimize, is you just work this out. Itís right out of the book anyways. Itís log debt sigma + trace sigma inverse Y. And Y is the empirical covariance of your samples. So itís that. Thereís no reason to believe that the Yiís sum 0. They wonít.
You want to minimize this function, and we look at the different parts. This is good because thatís a convex function of sigma. Unfortunately, that is bad. This is a concave function of sigma. Now, the usual approach in this is to not estimate sigma but to actually estimate sigma inverse. Youíre welcome to. Thereís a change of variable. So if you look at sigma inverse as the variable. This is trace times matrix R. This is Tr[RY]. Thatís linear in R. This is log det R inverse. Thatís convex. Everybody got this?
If you want to do covariance estimation, thereís a very fundamental message. The message is this. You shouldnít be doing covariance estimation, at least from the optimization perspective. You should be doing information fitting. T hatís the inverse of the covariance matrix is the information matrix. Thatís what you should be optimizing. At the end, you can invert it.
Now the problem with that is that the only constraints you can handle now are constraint, which is convex in the inverse of the covariance matrix. By the way, that includes a lot of constraints and some really interesting ones by the way. One is conditional independence. If you take samples of the data and you want a base network, you want sigma inverse to be sparse because a zero in an entry of an inverse of a covariance matrix means conditional independence.
So how many people have taken these classes? A couple, okay. How come the rest of you havenít taken these classes? Whatís your excuse? You just got here or something this year?
Instructor (Stephen Boyd):Thatís your excuse? All right, youíre excused. Have you taken it?
Student:Same excuse came here last fall.
Instructor (Stephen Boyd):Just got here, I see, but youíre going to take this next year.
Instructor (Stephen Boyd):Good, right. Itís vase networks. All right. That was all an aside. So this problem you solve by just working with sigma inverse. However, weíre going to do this. I want to say the following. I want to say, ĎNo, no, no. Please solve for me this problem, but Iím going to add the following constraint. All entries in the covariance matrix are non-negative.í Now of course, the diagonals are obviously non-negative. Thatís obvious. But this says all of the entries of the variable y are positively correlated.
I pick this just to make the simplest thing possible that is not a convex function in the inverse of the matrix as far as I know. So the inverse of element-wise non-negative covariance matrices, thatís not a convex set. Weíre back to that. That means we have to solve this problem. It is not convex, although itís very cool. Thatís convex, this constraint. This term is convex. This is concave. So a concave is negative convex, so itís difference of convex programming. We canít use DCP, thatís discipline convex programming. DCFD, thatís not working. Someone has got to figure that out, a name for this.
So this is a difference of convex functions. Thatís convex minus convex. And so if you linearize the log determinate, you get the trace of the inverse times the difference. This is now constant. This is affine in sigma, and thatís convex in sigma, and so weíll minimize that. Hereís a numerical example. This is just started from five problems, and these are the iterations, and this is this negative log likelihood. Now, by the way in this problem, Iím willing to make a guess that this is actually the global solution. To be honest, I donít know it and didnít calculate a lower bound or anything like that, but itís just to show how these things work.
Instructor (Stephen Boyd):Yes.
Student:[Inaudible] sample of the various [inaudible].
Instructor (Stephen Boyd):It depends. You donít want to be too smart with these methods in initialization. You can start with an immediate initialization, but then you want to rerun it from different ones just to see which regime are you in. Is it always finding the same one? In which case, you can suspect itís the global solution, maybe. Who knows? Or if you get different solutions, you just return the best one you find.
Yes, you could initialize it with Y if you wanted and go from there. How is this one initialized?
Instructor (Stephen Boyd):Random, for these 10, there you go. Thatís how this was done. And itís actually; you want to do that anyway. You donít want to run any of these once. You try them and see what happens, see if you get different ones. I think weíll quit here and then finish this up next time.
[End of audio]
Duration: 78 minutes