Instructor (Stephen Boyd):Weíll start in Ė weíll finish off interior point methods today. So if you go down to the Pad, weíll start, and I mentioned this last time, but Iíll Ė I think I did it a little bit quickly, so weíll do Ė talk about feasibility in phase one methods.
In these methods, a feasibility method or a phase one method, the job is to figure out whether a set of convex inequalities and equalities is feasible or not. So thatís Ė and this is sort of the very traditional way, where you divide it into a phase one and the phase two.
Youíve already had a glimpse at methods that donít need that. For example, the infeasible Start-on-Newton method does not require you to start with a feasible point. You only start with a point that is in the domain of the functions, but it doesnít have to satisfy the equality constraints.
Okay, so the feasibility method. The classic one is this, is you introduce a new variable, s, and you minimize s subject to f i of x is less than s. And this can easily be put in the same Ė in the standard format, except now you can Ė now you can easily come up with an initial point that is strictly feasible here, because, basically, you take any x here at all, and then you simply take s large enough, and you have a strictly feasible point for this problem.
Now, when you solve this problem, the optimal value is either positive or negative. Well, it could be zero. If itís negative, it means youíve actually produced a strictly feasible point. If itís positive, it proves there is no strictly feasible point. If it Ė if the outcome value is zero, thereís no strictly feasible point, but itís tricky at that point. You donít know whether or not thereís a feasible point. Thatís right on the boundary.
Okay. Now this is Ė in some sense, you can interpret s here, if itís a positive number, you can interpret s as the maximum infeasibility, because thatís what it is here. In fact, when you minimize this, and the number is positive, youíre going to interpret that number as the minimum possible value of the maximum constraint violation. Thatís what this s Ė for example, if s star was .1.
If itís .1, it means that all of your constraints are satisfied within .1. Now, of course, thatís sort of pushing the maximum infeasible Ė constraint violation down and, in that case, you would not be surprised, I would hope, to find out that a whole bunch of inequalities are violated by about .1 if s star is .1. So thatís what you get here.
And sure enough, thatís exactly the case. So hereís an example where this is reversed. This is showing you the margin. So here you have a violation of, I donít know, something like .2, and you can see that 50 variables are slammed right up against whatever that minimum maximum Ė that minimum value of maximum margin is here, right here. So there Ė itís just slammed up against it here.
And the number of constraints that are satisfied is something like 39 out of 100 here, and thatís all of these. Everything to the right of zero is an inequality that is satisfied, not violated. So thereís 39 of them.
Now another method is this. If you minimize the sum of the infeasibility Ė so you introduce Ė here there was a global scalar infeasibility variable, s, which of course, if one of these is tight, is the maximum violation.
Here each constraint gets its own private violation parameter or Ė weíve seen this many, many times before. This is the first line in the support vector machine, for example, and in fact, whenever you have a bunch of inequalities or something like that that youíd like satisfied, but you want it handled gracefully, when in fact they cannot all be satisfied, you would put in Ė youíd put in a little fudge factor here or something like that that would allow you to satisfy the inequality.
Then the hope would be to drive these to zero. Now when you drive these to zero Ė these are positive, so one transpose s you should even read as Ė well, I donít know. You can write it this way. I mean, itís true. Itís what it is. And this should fire up a whole section of your brain, not yet identified, which should be devoted to sparse solutions and things like that.
So what youíd expect in such a phase one problem, here, you should expect exactly the following. That a small number of constraints will be violated, assuming itís not feasible. If itís feasible, then of course, you ended up with a feasible point, obviously, because you just said s equals zero and youíre done.
But you should Ė would expect, in this case, to end up with a small number of constraint violations and, in fact, thatís exactly what happens, and hereís what you get. This is for the exact same problem instance over here. Youíve actually satisfied Ė a whole lot of them are right at zero. We expect that, but over to the right here is about 80 percent of Ė so actually only 20 are violated here.
Then use Ė is there no solution that satisfies more than 80? Actually, we donít know. Thatís a hard problem to tell. So Ė I mean is there Ė there probably is a Ė there probably is a point that violates, letís say, instead of 21, it violates 17 or something like that, but the point is, as a heuristic, this works very, very well to do this.
By the way, you might ask here or you might say that, between these two, you might say, well now wait a minute. Here we introduced one new variable, here. Here we introduce m new variables, and so if youíre trained in old school thinking, which is, as I say, wrong thinking, you would say, wait a minute, that has more variables. This has n plus Ė this phase one has n plus one variables. This one has n plus m new variables. N plus m is bigger than n plus one, and thatís bad.
Okay, so Ė to which Iíll just say this. I wonít go through the details, but Iíll say this. If you know what youíre doing, thatís the operative clause, this problem can be solved with exactly the same complexity as this one, okay?
By the way, itís the same reason that l one regularization, if you add that onto the end of a problem, it look Ė you introduce all these new variables and you say, my God, look at all those variables flying all around.
It turns out, if you know what youíre doing, which is to say, you do smart linear algebra in computing the search direction, complexity is identical. Thereís no cost. No additional cost.
So here you can even guess why. Letís talk about sparsity in here, and then Iíll quit on this topic, but letís talk about sparsity. S three. Thatís a variable. Please tell me where s three appears in the KKT system? Does it appear in the equalities? No.
Okay, letís see. S three appears where? Itís coupled with just f three here, so that tells you something about the sparsity pattern here, and youíll find out, if you do this correctly, you will indeed get the KKT majors when you write it out here, itís going to look much bigger, but if you eliminate the s row, the sí first, youíll get the Ė I should say the dsí, because youíre actually calculating a step, youíll find out that the complexity is the same as this one.
Okay. Now another thing thatís kind of obvious is, if youíre solving a phase one problem, the amount of time it takes Ė itís not unfair to say that, basically, if someone asks you, how long does it take an interior point method to solve a problem? The basic answer is like 30 steps.
Someone says, really? All the time? And if theyíre looking nervous, you say, okay, fine, it can be 60. And if they look really nervous, then you say, fine, it can be 80, but donít go Ė you donít need to go about 80.
If itís an LP or a QP or something like that, you can say 25. These are just the numbers you say, and someone says, really, it doesnít matter like if itís image processing or finance or a machine? And you go, no, itís just 25, roughly.
Now, for feasibility, thatís also the correct zerothor to answer. S if someone says, how long does it take to detect whether a set of matrix inequalities or whatever is feasible, the answer is 25 steps, right? And youíre safe. So it could be 50 or something like that. But if you think carefully about it, it cannot be uniform Ė that cannot uniformly be the case, because, as you get problems where the feasible set is super tiny, there has to be something in the complexity that relates to that.
And, indeed, that is the case empirically. Itís also the case theoretically, so if you look very, very carefully at sort of the complexity and analysis of interior point methods, hidden deep in the wonderful results and everything, thereíll be a little constant, and youíll say, well whatís that constant?
And then the person explaining it will say, donít worry about that. The constant turns out to be something related to how close to infeasible the problem is, right? Of course, if you think carefully about that, entirely violates the whole contract of complexity theory, because complexity theory says, how many steps does it take to solve this in the worst case? And you canít slip Ė youíre not supposed to slip in a number that depends on the data, because once you slip in a number that depends on the data, then I can make anything I like. I could say, I can solve it in log n steps, and theyíd say, really? And you go, yes, no problem.
You just put a little constant in front of it, and youíd say, whatís that constant? And you go, itís just a constant. They go, what does it depend on? Youíd say, the problem data. And the constant would be the following. It would be the actual number of steps required to solve it divided by log n. Everybody following this?
So the whole thing is itís a bit fishy. Now they would protest this vigorously and say that they can something about that constant, but the truth is, these complexity analyses are not as pure as you might imagine they are at first glance.
Okay. So there is something that involves how close to infeasible a problem is and how many steps it takes to solve it. I mean, both Ė and Iím talking both in theory and in practice. Now, in practice, though, itís quite tricky.
Hereís a picture showing where you take a parameter, you take a set of inequalities, and you change this parameter, gamma, where, at some point, if you crank gamma down small enough Ė well, if delta b is positive, if I make gamma small enough here, then this will be infeasible. If gamma is large enough, this will be feasible. We can adjust the boundary to be zero. Weíll make delta b positive, okay?
So delta b is positive. I adjust gamma so that this set of inequalities is feasible but not strictly feasible, okay? That set. This is strictly feasible for gamma positive and infeasible for not. Now what happens is this. As gamma varies, you get numbers of steps like this, and you can see that, basically, up to very close to this point where the set goes from infeasible Ė empty to Ė strictly feasible to infeasible or whatever, you can see that the number of steps, it grows slightly.
By the way, this is just a sloppy method. These numbers can actually be much smaller here. And you can see that, as you get very close, you could look at the numbers here. Theyíre very, very close. You can see that it actually grows like this.
Couldnít even find a case where it grows faster than this, because then you get into very small numerical things. You have things where itís feasible, but with something like only in the sixth or eighth digit, or something like that, at which point it seems not to be a practical question anyway, at that point.
So this is just Ė this is this thing blown up on a different scale. By the way, I should ask you something, here, about this. What do you think the Ė an infeasible Start-on-Newton method looks like for this?
Well, itíll never tell you Ė it will never tell you infeasibility. It can be adjusted to, but letís skip that. So letís very gamma, and this is the strictly feasible range, and letís do infeasible Start-on-Newton, right?
Which simply applies Newtonís method to the problem. Infeasible Start-on-Newton. And the minute is takes a full step, itís feasible, okay? And then we stop. Then we terminate.
So the question is, what do you imagine this looks like? This would be the number of steps required in an infeasible Start-on-Newton method.
Well letís think of it this way. Letís start with a problem that is like super duper Ė super feasible, over here. Itís like the huge, huge, vast feasible set. How long do you think itíll take in this case? Itís just a Ė itíll be just a couple of steps, right?
So itís going to look Ė itís going to be like this. Then this is gonna grow much more strongly, I think, than in this case. Itís gonna look like that. So if you get down here, this is already things like Ė this could be 1,000 steps or something like that. So it will grow Ė this curve will look quite different from this phase one. The phase one will actually look like this, right? And just settle out at 20 steps. I guess this goes lower than 20 steps, like that. Itíll look something like this. Okay.
Not that this matters, but just so you know. Okay. Now weíll look at the complexity analysis of these methods. And this is something like, I guess, some people would call this the crowning achievement of the Ď90s or something like Ė I donít know. There are probably people who feel that way. I mean I think itís very cool.
What effect it has on practice is not entirely clear. Iíll say something about that when we Ė at various times, but first letís look at it. So letís go back to the whole idea of the barrier method, right? So the whole idea of the barrier method is something like this.
You minimize t Ė letís just take a linear objective. C transpose x plus this barrier Ė you minimize this for some Ė for one value of t, and then you crank up t higher and do it again. So this is what weíre Ė I guess this is what weíre doing.
Now, you can show all sorts of things. For example, you can show that the Hessian, here, of this thing, is actually going to be Ė itís actually becoming singular, okay? So if t gets big enough, the Hessian here is Ė the condition number is going to infinity.
By the way, that should set up all sorts of alarms right there, okay? And in fact, what you wouldnít be surprised of and you should not be surprised is, if when you solve this, it worked something like this. For some value of t, like t equals one, this took eight Newton steps. Then you said t equals t, and it takes like 12 Newton steps. Then t equals four, and it takes 20 Newton steps. And the number of Newton steps is growing. And someone will say, wow, your Ė the number of work per centering is growing. Why?
And youíd say, well, look, come on. Itís getting a Ė youíre getting a harder problem to solve. The higher t is Ė I guess maybe I should do it this way. The higher t is, the more this thing looks like Ė you remember this picture. Looks something like this, right? It starts looking like that, and itís got more Ė itís got a higher third derivative, which, indeed, you can check. As t gets small, this thing has a higher third derivative. Maximum third derivative. Higher Liepschischzt Gaussian on the Hessian.
So you could say, look, itís completely consistent with everything we know that, as you crank t up, the problem Ė these problems become harder to solve. It takes more steps.
Now, in fact Ė in theory Ė sorry, in practice, it is observed that this does not hold and, in fact, you should have observed it by now. Actually, how many people have finished their homework? Okay, so they wouldnít check out like numbers of Newton steps as you crank t up. Did it grow?
By the way, if you make t super-huge, itís going to grow, but for another reason. But what are the numbers of centering steps, typically? Five? 10? Did it grow as t got bigger?
Instructor (Stephen Boyd):How much.
Student:It went up to 10.
Instructor (Stephen Boyd):It went from five to 10. Okay. Well Iím not Ė Iím talking about catastrophic. It doesnít grow catastrophically, or you didnít push it that hard. Okay. Well, okay. So Ė all right, so itís observed, basically, here, that what Ė that, here, when you increase t by some factor, the number of Newton steps doesnít increase.
So Ė I mean thatís good. Itís an empirical Ė itís a great thing. Itís sort of just an empirical observation and, in fact, thatís the key to it. By the way, if youíd used another barrier, it could well have. So if you wanna know why log barrier and all that, and why Ė this is one of the reasons.
What weíre gonna do now is actually see if the same thing is predicted by the theory. The numbers are totally different and, I think, fairly useless, the numbers, but nevertheless, itís always nice to have the theory agree with the practice, at least qualitatively if not in numbers.
So letís see how that works. Weíre going to assume the following. That f Ė the easiest assumption is f zero and phi are self-concordant, okay?
So Ė thereís at least one case where you actually have to look at the whole thing, the composite. Thatís the entropy maximization. But, in this case, weíll just look at t. Weíll assume f zero and phi are self-concordant.
Now, what that means is this. If t is bigger than or equal t one, t f zero is self-concordant and, therefore, so is t f zero plus phi. Thatís self-concordant. Okay. And then itís going to include a lot of problems.
Now we can actually say what the Newton iterations per centering step is, right? And remember what this is. This is five, which is my symbol for log log one over epsilon, okay? So thatís what this is. Itís six, whatever. Something like that.
And the number of steps is this. Itís equal to your initial starting point, and weíre minimizing, of course Ė well, letís see. Where are Ė yes, weíre minimizing mu t f zero plus phi, this thing, starting from the solution of the problem for t f zero plus phi, because in one update we take t star equals mu, and thatís what we do. So we multiply that by mu.
Okay, so this is the new f minus f star. F star is gonna be this thing here. Divided by gamma. Gamma is some absolute constant. I canít remember what it was, but I think it was one over 345 or something, if we do our analysis. I think, if you work very, very hard, you can make this one over .09 or something like that. Okay.
Now if you do this, this all very nice, but thereís one minor problem, which is this. This is a problem, generally, in the self-concordant analysis complexity of Newtonís method. If someone says, how many steps does it take? Itís f minus f star divided by gamma plus c. And someone says, but whatís f star? Or how would you find f star? Youíd find f star by running Newtonís method, but once youíve run Newtonís method, you donít need bound on the number of Newton steps, because now you know the exact number of Newton steps required. Everyone see what Iím saying?
So one could easy Ė you could easily imagine that this is not too useful. However, as usual, you wanna lower bound a convex function, whatís going to come to the rescue is duality.
So what weíll do is this. This thing here, thatís this top thing, thatís equal to, and you can write it out this way, and thatís less than or equal to, and Iím not going to go through the details or whatever, but this is what you get. This is mu t Ė actually, here you can argue it Ė it is duality, but you can argue it just completely directly.
I think weíre using the fact that weíre bounding this thing Ė thereís a simple inequality, and actually Iím not gonna go through the details here. Actually, it just bounds log mu and one and mu squared over two and all that kind of stuff, but Iím not gonna go through it.
So what happens is you bound this thing by Ė I guess, when all the smoke clears, you get this, and this is an upper bound. So this is the Ė this is f of the starting part. That was the previous Ė that was f Ė itís t f of x plus mu t f of x plus phi of x minus phi of x plus Ė minus the same thing for x plus. You get this thing, and itís less than this. And then you stop here. Thatís the dual function. And you end up with a very, very simple thing, which is m times mu minus one minus log mu.
If you plug this in here, you find out that the number of Newton steps is less than this thing right here. So thatís it. I guess you could do this if you wanted. Okay.
All right. Itís a very, very simple thing, this thing. It just says that the number of Newton steps Ė and, actually, the interesting part is this. It depends only on mu. Mu is your multiplicative update factor in your parameter here. So if mu is 10 or whatever, it simple says thereís a number of iterations, of Newton steps, that you can bound, that thereís a bound on it, and it will never take more than that.
So this whole idea of the problem getting harder and harder as you crank t up, now the theory tells you thatís false. The upper bound does not grow. It depends on nothing, notice here. Depends Ė has nothing to do with the dimension of the problem. It has nothing to do Ė it has to do with the number of inequalities, and mu, which is how aggressively you update. Thereís a question?
Student:What is the m again?
Instructor (Stephen Boyd):M is the number of inequalities. Inequality constraints. N is the number of variables. So it depends on absolutely nothing else. C and gamma are absolute constants. This Ė I think this oneís one over 345 or Ė and this oneís like five or six or something like that. Theyíre just absolute constants, right?
So thatís nice, and actually itís kind of cool. If you look at this, you will see that it is in fact Ė did we plot it? No, we didnít plot this, but this is an increasing function of mu, and the mu is small. And, in fact, this thing looks like the following. Near zero, it looks like mu and then plus one half of mu squared or something like that Ė sorry. In fact, it even cancels the log, because the expansion of this is log mu is about one minus Ė mu minus one plus Ė then the next term is a minus one-half mu Ė so this thing here looks like one half mu squared for mu small.
And that says Ė actually, that makes perfect sense. I wish I had plotted this. It looks like this, this function. It looks like that. The Ė so what it says is, if you take mu small, the bound is extremely small, because this is sort of like small squared. Of course, then thereís the c, so the extremely small hardly matters, right?
So if mu is chosen small enough Ė this thing is small. As mu goes to infinity, this thing grows, and it grows kind of like mu. Linearly. Thatís the upper bound. Okay?
So this thing is growing with mu, and that makes perfect sense. Thatís the number of Ė bound on the number of Newton steps required per centering step. Mu is the Ė is how Ė is the parameter that sets how aggressively you update t, so if mu is large, this thing should be large, because you are aggressively updating your parameter and should expect to spend more Newton steps on it.
This thing, in contrast, is decreasing with mu. Here, if mu increases, this thing goes down. Why? Well this is simple. This just tells you itís the number Ė this is, basically, mu is literally the duality gap reduction you get per centering step, and so the number of times youíll have to do that is just this number. Itís very simple.
You multiply these two together, and you actually the total Ė a bound on the total number of Newton steps to solve the problem, and sure enough, hereís what it looks like. And this is for some typical values of gamma and c. M equals 100, so 100 inequalities, and weíre going to use something like a 10 to the five reduction in duality gap here. Actually, sorry, thatís the absolute gap is gonna be 10 to the minus five.
So what happens is it looks something like this. And, sure enough, itís kind of cool. It shows you that, for mu small, what happens is this is really small, but the constant is Ė this is basically negligible, and thatís a constant, and then this thing gets big.
On the other hand, for mu large, this gets big even though this goes down, and this goes down like a log sort of a thing, and this goes up linearly, so thatís not a winning combination to make mu large. And, sure enough, you plot this and you get this. You get something like that.
Whatís wrong here, of course, are the numbers. So what this says is that the Ė this would tell you that the optimal value of mu, for this problem, is 1.03 Ė I donít know, two? It looks like 1.02 to me. And it says that the maximum number of steps is three Ė letís see, I donít know, 8,000. Everyone get approximately that?
So the numbers are wrong, but it says, if youíre a complexity theorist, it says, the best thing to do here, by far Ė I mean, for a complexity theorist, this is that mu equals 1.02. Thatís a real homotopy method. Thatís a 2 percent increase in t, and then, presumably, at that point, youíd do some number of iterations.
And the total number of iterations it will take is no more than 8,000, okay? So thatís great. Actually, if you put Ė you can put mu as a function of m and weíll actually get a nice complexity bound. Of course, this is off. You know what good values of mu in practice range from like two to 50 or something like that, for all of these problems, and the number of steps here has a minimum value thatís around 25, 30, something like that, and it does Ė itís not sharp either.
So, in practice, well itís on a totally different scale, right? On a totally different scale. Well, youíve seen what it looks like. We had it earlier, so you can see what it Ė so this is sort of the theory scale, and hereís the practice scale.
And, in fact, whatís not shown here is that this would actually go up eventually, like that. So this is the practice Ė this is, in practice, the number of steps required, as a function of how aggressively you update, looks like that, and the theory scale looks like this. And everything is different about them. I mean, they have the same qualitative factors. Roughly qualitative, right? This settles out at about 30 steps, but any mu, from whatever it is, five or two to a couple of hundred, itís gonna work pretty well here.
And here, this makes you think thereís a sharp minimum. Thereís not. I mean, obviously. Okay. Now if you plug in, here, mu is one plus one over a square root m, and work out what happens and do some bounding, itís not too hard, but what you get is this. That the total number Ė this whole number, right here, if you plug that in, can be bounded by Ė itís a constant times squared m log of this thing.
Now, by the way, this ratio is exactly Ė is basic Ė well, it can be interpreted this way. This is your ignorance reduction factor, is what it is, right? Because, once youíve centered with t zero, your estimate of the duality gap is m over Ė your duality gap is m over t zero. Thatís your initial duality gap. Your duality gap is a measure of how ignorant you are as to what the optimal value is.
This thing runs until that ignorance is pushed below the level epsilon, so this quantity is exactly interpreted as a Ė as an ignorance reduction ratio.
Log of that, of course, tells you the number of bits of information you have learned. Which is to say Ė yes, thatís what it means. If there were log two, this would literally be the number of bits of information you learned about it, because if someone, at the beginning, says, please tells me about the optimal value of the problem, youíd say, well, itís between my current value, my current f zero of x, here, and, and then youíd have to draw an interval whose width was equal to Ė and whose width is this, thatís the current duality gap, there. And youíd say, whereís the optimal value in here? And youíd have to say, I donít know. So thatís the width of your ignorance.
When you terminate Ė well, of course itís gonna have to be somewhere in here. Itís gonna look like Ė youíre gonna terminate this way. Thatís gonna be your suboptimal point, and the width of this will be less than epsilon, so the ratio of these is the reduction factor. And if this were log two, it would be something like the number of steps Ė the number of, like, by section steps it would take to get there. Okay. So thatís what this is.
And hereís the interesting part. It grows like square root m. So these are Ė this is, by the way, the best complexity estimate people have, to date, on these things. No one has done any better.
So if anyone Ė if someone asks you, again, whatís the Ė how many steps is an interior point method, you look at them, you look at their books, you kind of figure out what kind of a person they are. If they wanna know how well does it work in practice, what do you say? How many steps?
Instructor (Stephen Boyd):25, 30. And you look at them a little bit. If theyíre a little on the nervous side, you say 50, right? No, no. Because you want it to always be less than that, otherwise, you say 25, itís 40, and they go like, you lied to me. So you look at their personality. You could say 20 to 80 if you want.
If theyíre a complexity theorist, what do you say? You say square root m is the answer, right? And they say, doesnít it depend on the accuracy? And you go, yes, it does, but weakly, by a log here. Right?
By the way, this number here, if you want, you could just stick that in with the o, as far as Iím concerned, because basically you could just take Ė you could say, look, Iíll reduce the ignorance by 10 to the eighth. 10 to the tenth. I mean, thatís fine. Something like that is gonna be just fine, and this will be a very small number then. 25 or something. I donít know, whatever it is.
Okay. All right. So thatís the final story. Let me ask you a couple of questions about it. N, the number of variables, doesnít come in. Why not? I mean thatís the first thing you see here, is it has absolutely nothing to do with the dimension of the problem, in terms of the number of variables. Where does n comes in?
Student:It comes in in computing the Hessian.
Instructor (Stephen Boyd):Absolutely, right. So this is the number of Newton steps, and itís actually really cool that you could have this separation, where the number of Newton steps, your bound, is completely independent of n. It can be 10 variables, 10 million variables. Same complexity estimate. And so it doesnít depend on n at all.
N comes in, you multiply this to get the total number of flops or computational effort or whatever you wanna call it. You multiply this times the number, the number of flops or whatever that your computation measure is, per Newton step. Everybody got that? So thatís the total effort. So no one is saying it takes the same amount of time to solve a 10 million variable problem as a 10 variable problem.
What this is saying Ė actually, where the complexity theorists and the people who do this in practice agree is that, in both cases, it takes a number of steps. Both of them agree that that number of steps has nothing Ė that you get the same number for 10 variables and 10 million. The practical person will just say 30 in both cases, and the complexity theorist will say square root m in both cases, and it will have nothing to do with n.
So letís do a couple of Ė we can do a Ė well, in the simplest case itís gonna grow something like n cubed. Thatís a very rough number, but n cubed is a good number, because if you just blindly form the Hessian or the KKT system and solve it or something like that.
If you wanted a really Ė an upper bound, I guess you could do something like this. Just solving the KKT system by your favorite method would be this.
So if you really wanted the full thing, youíd multiply this by Ė this is something Ė youíd have something like this, and then youíd write n plus m cubed. Okay? So this would be the full case. That would be the Ė and by the way, people would then say, if you look in the literature, if you go to Google and type complexity solving, SOCP or LP or something like that, youíll find lots of papers that have an m to the 3.5, or they may call it n, because they reverse the m and n, but anyway. Youíll find a complexity Ė a 3.5 complexity algorithm or something like that. Okay?
Just a couple of comments here, just Ė I should mention. The people who Ė everyoneís fooling around trying to make a method over here, where this thing goes below square root m or something like that, where the complexity goes below square root m, and people argue about methods where this is square root m and itís other thing Ė that kind of thing over here.
However, between these two terms, considering what weíve been talking about the last three weeks, which do you think is the one where the greatest opportunities for making your method faster appear?
Iíll give you a hint. Itís not this term. Okay. Itís here, obviously. So, in fact, if you wanted to work on something, donít sit down and try to figure out some new self-concordant, exotic barrier, blah, blah. This kind of thing, where you can reduce the theta factor. I mean, unless this is the kind of thing you like to do. Thatís fine. Just donít imagine it has anything to do with solving problems faster, anything, or large ones, or anything like that.
The real Ė all your efforts should go into this, and this is just a baseline, stupid KKT matrix backslash whatever. I guess, in the simplest case, minus g zero. Thatís what this is. So everything goes in here. This can be brought down hugely by doing intelligent, linear algebra, by exploiting the structure. All right. So that finishes that up.
Letís look at how this works with generalized inequalities. So for generalized inequalities, thatís a problem where you have a conic inequality. You have vector inequalities, and these are with respect to a cone.
Now the ones that weíre actually gonna really Ė the only ones that really matter to us, in fact, are semi-definite programming and second order cone program. So these are the only two Ė the only two cones that really matter. These are the Ė the Laurents are second order cones, and semi-definite programming.
By the way, there are ways to write semi-definite programming in a classical form, but theyíre not Ė they seem to just make things harder. It can be done, however. You want me to show you how? Iíll just show you. It doesnít matter, but youíd find papers written on this.
If you have an inequality like, letís say, a Ė a of x is some affine function, and you have an inequality that, for example, says this, that thatís positive semi-definite, right? So this would be a Ė this would be an STP. What youíd do is this, is youíd take LII of a of x. This is the ith element on the diagonal of the Cholesky factor of a. It can be shown that thatís concave.
So I can write Ė actually, not only is it concave, but itís Ė I can write it this way. I can write this inequality in a classical way this way, and I can apply a barrier method to this, just like youíve done. Of course, now you have to find out the Hessian and everything of the Ė the Hessian of the ith element of the Cholesky factor of a positive definite matrix. It could be done, but you will get a headache. And actually, it turns out these methods turn out to be what weíre talking about now anyway. Okay.
But let me go back to this. Okay. So Ė and weíll assume everything holds like itís strictly feasible and Slaterís condition holds is a zero duality gap and all that kind of stuff.
So, to generalize this, what we need to do is the following. We need, since we have a generalized notion of positivity, namely itís with respect to a cone, we need to have something like a log barrier.
To make a log barrier, you need to have a logarithm on this cone. Now, if the cone is, for example, the positive semi-definite cone or positive definite cone, you need something that acts like a logarithm, and so these are called generalized logarithms, and itís this.
Itís gonna have Ė itís gonna generalize any of the logarithm, so itís gonna have a domain which is the interior of the cone, the same way the log has a domain which is the interior of r sub-plus, which is r sub-plus plus, which is the positive ray.
Okay, so the domain here is gonna be the interior of the cone. Itís got to be concave, like log is, and Ė so itís gotta be concave, so itís gotta be concave, and it has to have the following property. If you look at on a ray Ė so if you look at Ė if you examine it on a ray, so you multiply by y by s, it should look like this. It should look Ė the value of that point, it should grow like the log on any ray.
So if you go on a ray, it should actually go like the log on s, like this. Okay? This number, theta, by the way, is called the degree of the Ė is the degree of the generalized logarithm.
So an example would be something like this. If you have the non-negative orthant Ė the normal log is a generalized log with theta parameter one, for example. Itís kind of obvious.
Non-negative orthant, if you take the sum of the log it says we had before, the degree is just n. Now, the interesting thing is what plays the role of the logarithm on positive Ė a positive semi-definite cone?
And the answer is log determinant. I mean it was Ė it could hardly have been otherwise, right? So log determinant plays the role. And for the second order cone, the logarithm is the log of the last entry minus the sum of the squares of the others.
Of course, this is on the point where the sum of these things is strictly less than or equal to y n plus one squared, okay? So this is the picture, and itís on this Ė this is defined on this set.
Actually, this is defined Ė this formula is defined Ė the formula, at least, makes sense when y n plus one squared is negative, but the domain of this is restricted to this, okay?
So thatís Ė these are the logarithms, is log det. Okay. Now, you can show a bunch of things here. Iím not gonna do this, but weíll Ė Iíll just say what they are. The first one is Ė letís see. The gradient of a generalized logarithm is positive with respect to the cone, so the Ė for the ordinary logarithm, the gradient is one over x or one over y, and thatís obviously positive, but this is true in general.
So the gradient of log det Ė you should remember what that is. The gradient of log det is the inverse of the matrix, okay? So I think youíve seen that somewhere, but the gradient of log det is the inverse of the matrix and, indeed, the inverse of a positive definite matrix is positive semi-definite. Itís also positive definite.
The other one is this one, which is very interesting. It says that y transpose times the gradient should equal theta, and we should actually check that for log det. For log det, what is the gradient here? Itís actually y inverse, right? And y is a matrix, and how do you form the inner product of two matrices?
I could put a big black dot in the middle, but thatís kind of just a notational Ė what, in fact Ė how do you Ė whatís the inner Ė whatís the standard product?
Instructor (Stephen Boyd):Yes. Itís trace of this. You can put the transpose there or not, but these are symmetric, so I elect not to put the transpose. Anyway, it wouldnít make any difference. And what is it?
Instructor (Stephen Boyd):Whatís that?
Instructor (Stephen Boyd):Whatís trace y y inverse?
Student:Sigma y Ė sigma m [inaudible] so m.
Instructor (Stephen Boyd):M, thank you. Good, thank you. Okay. So thatís m. Great. And which is, indeed, the order of the logarithm, okay? So Iím just saying, these look fancy, these look like fancy formulas, but in fact weíre only interested in a handful of them, so itís kind of silly.
And for there, where itís completely trivial to verify these things. Okay. All right. So this is fine, and this all just works. By the way, the second order cone has a theta value of two. If I remember, Iíll give you a Ė I can tell you what the second order cone Ė what Ė thereís another meaning to theta, a rough meaning, geometric meaning. Iíll get to it in a minute.
Okay. Now weíre just gonna generalize everything from the interior point method. Just absolutely everything. So will log barrier? Same thing. It just looks like this. Itís Ė I form five xs. Instead of sum, minus sum log of the margins Ė negative margin Ė well, margins, I form Ė I just replace ordinary log with this kind of vector log or whatever Ė however you wanna call this. A generalized logarithm.
I do this, and this function is gonna be Ė itís gonna be convex. Itís gonna Ė that follows from the standard composition rules. This is in a more complicated case, because this is convex. This thing is concave, with respect to Ė minus f is concave with respect to this generalized cone k i, and thatís k i convex, and so on.
Okay. And then the same Ė you can define the central path, which looks like this, and this looks very much like what we had before, where this was the sum of the Ė minus the Ė itís the sum of minus log minus f i.
Actually, it doesnít look very much like it. Itís identical, this thing. So thatís good. Okay. Now you start asking questions. You can define the central path, you could Ė then you look at the original one, and you say, look, on the central path you can actually find dual points. Letís see if we can Ė if that works here too.
If you write out the gradient of this, with respect to x, you have to write out Ė the gradient of that is easy. The gradient of that is a pain, but you can get it from this thing, and the gradient of that is equal to this, and thatís equal to zero.
Now, the argument here is absolutely identical. What happens is you take lambda i is this thing. Thatís exactly the same as what we had before, except now itís in the form of a general logarithm whereas before it was a specific logarithm.
All of this will Ė is the same, and in fact, you find out, then, that if you have centered x Ė in other words, if youíve computed a point on the central path then, whether you like it or not, you have actually constructed dual feasible points.
So youíll get a dual feasible point. Thereís a formula for the dual feasible point. It looks like that. And, sure enough, if you work out what the gap, what the duality gap is, associated with this dual feasible point, you will get that Ė remember what it is in the general Ė in the specific case, the scalar case. Itís exactly m over t. Itís the number of inequalities divided by this parameter t.
Here itís gonna be the same thing. Itís sum theta i one over t. By the way, this generalizes perfectly. If each inequality is scalar, theta is one, and when you add these up, you get m over t, so you get the same thing. So everything just works beautifully.
By the way, this has lots of implications. This means that the code you just wrote for linear programming, or will have just written by the end Ė well, actually, I donít know that I announced that, but everyone got this email. If you want, you can turn in your homework like tomorrow or something. I guess that went out, didnít it? It went out. Yes.
So Ė but what it means is that it would take you about five minutes to change that code to be a semi-definite programming solver. Then itís kind of cool, because youíll actually have a solver for something that, 15 years ago, was considered highly nontrivial, and 20 years ago, would have been not recognized as an easily solved problem, and youíll have written one. Itíll be 50 lines, with a lot of comments, and itíll work really, really well. So the same thing you wrote would work there.
Okay. So letís look at semi-definite programming. Letís see how this works. Hereís the problem. Youíre gonna minimize c transpose x subject to this LMI, linear matrix inequality. It looks like that. Now, the log barrier is log det minus f inverse, here. The central path minimizes this, and if you take the derivative of this with respect to x i, youíd get the following.
Here youíd get t c i. Thatís the component here. And when you take the derivative of this, with respect to x i, you get the following. You get minus f of x inverse times, and then f Ė since f of x is this thing, you get f i, and you get a trace in there, like that, and I elect the Ė no, thatís what you get. You just get just that.
So you get this thing, and then this has to vanish. If this vanishes, here, then itís the same at that. It gives you this. Then youíd wanna get a dual point, somehow, from this, and it turns out itís nothing more than minus one over t f of x x star of t inverse. Thatís gonna be feasible for the dual. This is the dual of this Ė the dual STP. And the duality gap on the central path is exactly p over t. P is the size of the matrices involved here, so f and z.
So thatís it. The barrier method is very straightforward. In fact, letís look at it. Hereís the barrier method. You start with a strictly feasible x, some given value of t, mu positive, some tolerance, some tolerance, and youíd do a centering step, and you update, and you do a stopping criterion, and then you t times equals mu if you want. Okay?
Now, if you look at this, youíll find that itís identical, absolutely identical, to the barrier method in the ordinary case, with one change. This had been m and is now sum theta i of t. Okay?
So that means, actually, if you wrote your solver nicely enough, it means it solve Ė you can write a general cone solver, and if you wanted to, you could write something that is like STP T3 or SeDuMi or whichever of the cone solvers you choose to use or whatever, for example, when you use CBX. So you could write one, and it wouldnít be long.
It wouldnít work as well as the real thing, which has thousands of person years Ė person hours, not years. Sorry. Person years of development effort into it, but it wouldnít be bad Ė it would be shockingly Ė youíd be shocked at how well it would work, actually, if you made your own cone solver.
Just from the LP solver you just wrote. It would work not badly at all. Okay. Everything else is the same, so Iím not gonna go through the detail. I mean, the complexity analysis is the same. You just Ė this just becomes m, and nothing changes.
Then you might ask, well how does it work? Well, hereís the way it works. Hereís a second order cone program here with 50 variables, 50 SOC constraints in R6, right? So who knows, whatever. Actually, I think this might be to make it parallel to Ė no, itís even bigger. It doesnít matter. The plots always look exactly the same.
And hereís what happens. Hereís the duality gap, and you can see itís the same thing. In fact, it looks shockingly Ė it almost looks like weíre reusing figures here, but Ė because it looks too close to Ė like that, okay? So thereís a GP, and hereís an STP. Okay. And you can see things like this that, depending on what the mu update is, it can take on the order of what, 30 steps or something like that. Thatís the picture.
Hereís what happens as you vary mu. And, indeed, we didnít do this long enough, but this would have gone up at some point, like that. Hereís an STP, and it looks the same. So, again, itís like 30 steps if you optimally choose Ė if you choose mu well or something like that. Itís the same story.
Actually, itís a good Ė this is a good time to stop and actually think about what the meaning of these two is. You have to stop and understand, these are problems that, 20 years ago, would not have been recognized as even possible to solve, in some sense. Right? These would be Ė these were very complicated problems. If you absolutely had to solve one, there were exotic methods. They would take a very long time, but it was considered difficult to do.
I promise you, although we wonít force you to do it, I promise you that your LP code would be adapted with a very small number of line changes, and you would see things like this, and you would then be Ė you would have written, from scratch, a solver that solves something that, 20 years ago, was not even recognized as an easily solved problem.
Itís actually pretty impressive. So these look very simple. I mean, itís not a big deal or anything like that. Now it looks simple. It was not considered simple 20 years ago.
And thereís, by the way, still plenty of fields where people donít know about second order cones or matrix inequalities.
Student:So whereís the side on the mu?
Instructor (Stephen Boyd):On the generic ones? You mean like, for example, SeDuMi or STP T3? Iím gonna tell Ė Iíll say a little bit about how they do this. They donít Ė in fact, itís not exactly this algorithm, right? So Iíve shown you the simplest possible barrier method.
If you actually Ė what theyíre actually doing something much more complicated, which is homogenous primal duals, but in fact, if you go through it, I promise, if you look at Ė if you read any of these papers or anything like that, or actually look at the code, itís all open source, you can Ė thereís no mystery in there.
If you look at it, you will find all the things youíve been seeing. Youíll see the KKT matrix, and things like that. It might even Ė some of it Ė itíll be a little bit more complicated, and there is something effectively equivalent to their choice of mu.
For linear programming, itís done by something called the Merotra Predictor Corrector, something or others. They actually calculate two things, and itís this Ė then thereís a magic formula thatís proved to work unbelievably well, so Ė across a huge variety of different problems, right? So thatís how itís done.
I think someone else asked about that at one point. They actually kept repeatedly asking what Ė how do the high-end things update mu. And I said adaptively. Itís using that method. So theyíll look Ė you will not find, actually, if you go and look for a barrier Ė if youíd like barrier method, something like that, youíll only find things that, basically, are in my website or something like that.
So you wonít find a production Ė a fancy production code that uses a simple barrier method. In fact, if I go and tell people, this is what we use, or something, theyíre very Ė theyíre like, thatís so early Ď90s or I donít know. I mean Ė so theyíre snobby about it, because theyíre moved on to the super-fancy homogenous embedding primal dual blah, blah, blah methods.
Which, by the way, guess how many steps they take? Kind of the same number, like 20 or whatever. So itís funny. Theyíll say, well yes, but our methods take 20 and yours take 35, but do you remember that second term? The second Ė thatís Ė the second term is related to how smart you are in linear algebra. Of course, that makes the huge difference, right?
Okay. Hereís a family of STPs, just generating STPs of different sizes. By the way, this thing, I think itís added here just to show you that we wonít lie. That shouldnít be there. I have no idea what that is. It should not be there. So Ė but youíre seeing it. It should kind of look like this. It should settle out at like 30 or something like this, and there should be Ė and the variant should not be high. Here.
So this is just a picture of what Ė how the STP complexity grows. And remember what a complexity theorist says. A complexity theorist says, this grows like the square root. Like that. Thatís the bound.
We will finish up. Weíll answer your question. And this is Ė Iíll just make some vague comments about it. I claim, if you fully understand the barrier method, which it seems to me you should, because you will have looked at it and you will have implemented one, at which point, it seems to me, you understand it, assuming it works. Youíre actually ready to look at Ė I mean, if you care to, youíre ready to look at sort of the world of advanced methods.
So the real advanced methods, the ones that are currently in fashion are primal dual interior point methods. These are described in the book. You can look at them. Theyíre actually even shorter. They donít have inner and outer Ė they donít have inner and outer Ė you donít have centering steps and then update t. Itís just, literally, one iteration, and the idea is pretty simple. They just Ė itís basically the same as updating t at each iteration instead of what weíre doing.
What weíre doing, in a barrier method, is something like this. Is weíre starting here. We set a target point, we donít know where it is, and now we Newton for a while and end up over here. Then, once weíre there, we reset our target. We times equal Ė we t times equals 10, and we go on a new Newton adventure and land here, and thatís what weíre doing.
Now itís kind of obvious that those last four Newton Ė those last couple of Newton steps here are a complete waste of time, because weíre now polishing off, weíre getting accuracy in the sixth through twelfth digits of the centering point, which has nothing to do with what we really wanna do.
What we really wanna do is calculate this point. So itís completely obvious that this Ė that our method Ė whatís shocking is that it still works, is that you could shave like a whole bunch of iterations off right away, and you could do this by incomplete centering, for example.
In other words, you stop when you get close here, or another way to do it is actually to increment t as you go. And, in fact, if you read fancy methods about it, which you are now, I claim, fully prepared to do if youíre so inclined, you can read Ė you can read about all the fancy methods, and all the fancy methods will have a picture that looks like this. If they donít, they should. Actually, they donít, but some will. Okay?
So thatís the central path, and the goal Ė this is Ė youíll have some measure of being off centrality, and theyíll have all sorts of measures. You can look at these things. Again, if you want to, I claim you will see everything youíve seen. All the log barriers, the gradient, the Hessians, everything, the KKT systems. Itíll all be there, and theyíll Ė this, they call a neighborhood of the central path, and the idea, now, is to just sort of make steps that go like that.
And so the idea is you stay in the Ė in the neighborhood of the central path. So thatís what this looks like. Something like that. And so youíll see these things in the fancier methods, and the primal dual ones as well. They sound fancy and all that, but in fact, if you look at them, it Ė the effort, each step, is identical to the effort youíre solving a KKT system. Itís not a big deal.
If you profile a code that is running one of these methods, they take 20, 25 steps, all itís really doing at each step is solving a KKT system 20 times. Thatís it. Maybe 30. So thatís it.
Okay. By the way, youíve been using one or two all quarter, because youíve been using CBX, which compiles your problem to a cone program with second order cone and STP constraints, and then it calls STP T3 or SeDuMi, one or the other. Each of those is just one of these fancy primal dual interior point methods. Thatís all it is.
And you could probably even look at some of the diagnostics coming out of it, which you probably havenít looked at before, but you can actually start Ė now youíll know what it means when it Ė youíll see all sorts of things. Youíll see dual, residual dual. Youíll have the primal feasibility residual. Youíll also see things like when it Ė at least, when it reports the number of steps, it shouldnít be more than 60 or 80.
I donít Ė it can be up to 50 or something like that, and itís often 18 or something like that if itís a simple problem. But now you can start looking at the stuff that was spewed out.
Okay. So what weíre gonna do now is Ė actually, Iím just gonna Ė weíre gonna finish, actually, today, and Iím just gonna give some of the conclusions, and Iíll just say, I donít know, wrap a lot of this up.
I think some of this was Arguis did in the section yesterday. So I mean the main thing here is Ė I mean the focus, my focus, has been on modeling, so thatís, to me, the Ė I think the interesting part is basically what are the problems you can solve using these ideas.
Well so it should be completely clear by now if it wasnít before that thereís lots of problems in engineering design, statistics, machine learning, economics. These Ė they can be expressed as mathematical optimization problems. I mean thatís Ė thatís clear.
A lot of people take that and say, no, no, no, thereís Ė everything is multi Ė whatever, multi-objective. And well, so what, okay, no problem. So thatís fine. So the main complaints people mostly make about this is something like, if you go to some field where theyíve had bad experiences with optimization, and by the way, there are a lot of them. A huge number of them. So itís extremely common to arrive in a practical field, talk to people, and they go, like, we tried optimization in the Ď60s. It really, really sucked. And then they give you a long story, all of which is kind of misconceptions, but it wasnít helped by the people from optimization who went to try to help them, right?
So it would be Ė theyíd say things like, well, we had Ė our constraints are not fixed, and youíre like, well, we can handle that, and then they say things like, well, the other thing is thereís uncertainty in our data, and when we optimize, then we plugged it back into the real thing, and when you change coefficients, the whole thing doesnít work anymore.
Well thatís just called bad optimization. Okay. Now, the important part is actually tractability here, so you can write down any Ė anybody can write an optimization problem down. It is a normal part of intellectual development to realize this and to go into kind of a Ė to go into a state where, for the next two weeks, you look up, and you go like, my God, everything is an optimization problem. Absolutely everything. Itís amazing. Every class Iím taking, every lesson Iíve ever taken, everything I might ever wanna do is an optimization problem.
At some point, you calm down, or not, because some people are arrested in that cognitive state. But if you Ė you calm down when you realize, yes, but so what? It turns out, most of those problems you canít solve.
So this is very rough, but I would say tractability kind of requires convexity. I mean thereís Ė this is very rough. Thereís people who would argue with this. Itís extremely easy to come up with Ė come up with counterexamples either way, okay? So non-convex problems that you can solve with low complexity. I mean, one would be this, you know.
I mean, if I asked you to maximize x transposed p x subject to x transpose x equals one where p is a symmetric matrix, I mean, thatís a perfect example where itís not convex by a long shot because I didnít say p was positive Ė negative definite, if Iím maximizing, and this constraint is, for sure, not convex, and yet you can solve this very easily. The solution is whatever the maximum eigenvalue. The maximum eigenvector of p, okay?
So thereís an example of a non-convex problem that you can Ė thatís tractable. The point is that thereís very Ė thereís just an isolate Ė a very isolated, small number of these things, okay?
And there are other ones that are combinatorial optimization problems that are solvable and so on. Okay. So Ė but itís not too bad to say something Ė to make a statement like that. As a zero authoritor statement, I would stand by this.
Now, if you have a method for non-convex problems, these find local or sub-optimal solutions, or are very expensive. This is called Ė this is global optimizations, if you guarantee to find the one Ė theyíre very Ė they can be very expensive.
By the way, these absolutely have their uses. I mean so these are extremely useful in lots of areas. If youíre doing circuit design or something like that, obviously, if you can have a convex formulation and assert that what you just found is not just the best circuit design you could find, but the best anyone could find, thatís a better thing Ė thatís better than not being able to make that assertion, but itís very useful to be able to say, I just reduced the power of your circuit 30 percent.
And if someone says, is that the global solution, youíd go, I donít know, but thatís Ė 30 percent is 30 percent. Itís great. So itís to be understood that these are very, very valuable.
The other real thing is this. Until, I donít know, 15, 20 years ago, it was kind of presumed that very few problems were convex. Thatís because no one was looking for them, but it turns out this is just not the case.
So letís say a little bit about the theoretical consequences of convexity. Sort of the one you learn on the street would be this, right? Local optima are global, okay? Thatís the thing youíd learn on the street.
The interesting thing is duality theory for Ė in convexity actually is now useful. You have a very extensive duality theory, and this is just a systematic way to drive lower bounds on optimal value, so -- by the way, thatís even for non-convex problems. That Ė thereís a lot of stuff going on on that right now.
You get beautiful necessary and sufficient optimality conditions that extend the usual ones, the Lagrange multiplier things you see in your Ė whatever your second class on Calculus or whatever.
And you get interesting ideas, like when a problem is infeasible, you actually get a certificate proving infeasibility, and then these are related to sensitivity analysis, all the duality.
Okay. More interesting, perhaps, are the practical consequences. So I would even put this Ė I would actually put that back in the theory, and since this is practical consequences, Iím gonna go like this. Iím gonna go like that, and Iím gonna even be stronger and say that. 30 to 50 steps. Thatís the number of steps it takes, period. I mean thatís fine.
Now the other thing you should know is that things like basic Newton, theyíre very easy to implement, and they work for small and larger problems, and they work for larger problems if you exploit structure. In fact, the whole point, obviously, if youíre doing this implementing an LP solver, is just to demystify it.
Would you ever use your Ė the one you just developed? No, because you put two hours into yours. Is that fair? Three? Two? Two. I think less. I mean, I think if you didnít make any mistakes, itís less than two. So you have one per Ė well between one and two person hours into it, and trust me, STP T3 has a whole lot more.
On the other hand, actually, there are cases where your little LP solver, you might wanna use. You might wanna use your little LP solver, for example, if youíre doing a real time implementation, and you wanted to solve an LP, letís say, in a millisecond. A little small LP, right?
The problem is that thereís so much junk in all these big ones made for solving huge, large-scale problems, that by the time all the libraries and things are loaded in, a millisecond is gone. Thatís an exaggeration, but itís close. By the time all the data structures are set up and all the memory has been allocated, a millisecond is gone. Not quite, but anyway. You actually might use your little baby thing, because in fact it would Ė if you wrote that in C, and all that, and maybe LA PAC calls and all that, you would have something that would run super-fast.
So, there actually Ė thereíd be no reason not Ė in that case, you might actually use it. However, the point was just to demystify it. To say that Ė to point out to you itís Ė these are not very hard Ė very simple methods actually work shockingly well.
Okay. So let me Ė letís talk a little bit about how to use convex optimization. So letís see. So what you would do is something like this. If youíre in some applied Ė I guess, in this class, weíve focused on problems that actually were posed more or less exactly as convex problems, because you start there, right? Thatís where you start.
Then, later, you branch off. Once youíre kind of comfortable with that Ė I mean, we donít give you problems that are not convex or whatever. I mean, with a couple of Ė well, we donít. So, after a while though, donít get the idea that it only works if someone throws in some non-convex constraint that youíre Ė that thereís nothing you can do and all that kind of stuff, and throw everything out, because thereís all sorts of cool stuff you can do.
You can ignore Ė you can comment out constraints you donít like. Thatís got the fancy name of relaxation. Thatís one option. So if thereís a non-convex constraint you donít like, you could relax it, comment it out. And the other thing is, of course, you can just do approximate modeling, right?
If something looks like Ė if something looks like that and has this small section where itís non-convex, you can fit a convex thing to it. And of course, what should you do is start with simple models and small problem instants, and basically an efficient solution. That is, donít attempt to exploit any structure.
So if youíre doing image processing, you start with 30 by 30 images, obviously. Because if you canít make what you want happen in that 30 by 30 image or something like that, then thereís no point worrying about how to make it run for a one k by one k image. I mean just for example.
Now itís always useful, when you work on one of these, is to work out, simplify and interpret the optimality conditions in dual. Actually, what very often happens is you get a beautiful, commutative diagram, and it wonít be the one you think of, or youíll get a really interesting thing.
Youíre working on some problem. Youíll work out the dual, just mathematically, just you turn the crank. Form the Lagrangian, minimize it from the dual function, write the dual down. You stare at the dual for a long time. It is very often the case that the dual has a practical Ė has some kind of other practical interpretation.
So, for example, this is universally true in areas like finance. Itís true in mechanical engineering, electrical engineering, almost any application I can think of, and some of it is something really cool. Like youíre doing experiment design, and the dual turns out to be one geometric problem, like a minimum volume ellipsoid thing, and itís just kind of Ė so whether or not that helps you with your original one doesnít matter. Itís just kind of cool to know that this is the case.
So, for example, if youíre doing some maximum likelihood problem with an exponential family, you work out the dual, and youíll find out itís a maximum entropy problem or something involving Collac Liveler diversions, and youíll just say, well that Ė the last you could say is just, thatís cool, because youíve just connected one practical problem to another.
Whether or not it has any practical advantage for you is another story. Okay. Now the other important thing I wanna mention is the following. Even if the problem is quite non-convex, you can use convex optimization. You can use this in sub-problems, and you can actually repeatedly form and solve a convex approximation turn point.
And Iíll just say one little bit about that, because these things work shockingly well. I would have snuck one of these into a homework exercise, but the class ended, and I was told I couldnít assign anymore homework, so Ė but I will Ė Iíll show you what I mean by that.
Letís just take Ė letís just suppose Ė yes, I mean, itís something like Newtonís method, but itís kind of cool. So thatís it. You have a set of Ė a huge set of non-linear equality constraints, okay? But we also have all the other things. I have a polyhedron like this. I wanna minimize, letís say, f zero of x. Letís say thatís convex. These are convex, and the only problem is I have some set of non-linear equality constraints.
Now, you know this. This takes you outside of convex optimization instantly, because the only equality constraints allowed are linear, okay? So this is not convex.
You encounter this and you say, well I canít solve that, and you go, yes, but you took that class. It was 10 weeks, and it was like 10 hours a week of work, and you canít solve that? You know what Iím saying.
All right. So the ways to do it Ė I mean this is Ė what Iím talking about now, these are street fighting methods, so no Ė youíre not gonna get a theorist stamp of approval on this, but Iíll tell you how this Ė what you would do. Itís very, very simple.
You start with an initial point x zero, okay? You go over to f, and you get an approximation of f near x zero, okay? And you could do this many ways. You could take the Jacobian. That would be Ė thatíd be kind of the 17th century approach, and indeed, all the way through the 20th century approach, 21st even in some departments.
Okay? So you could Ė Calculus has been used for 400 years. Why not? Calculate the Jacobian and make an affine model of f. What you Ė you replace this with the affine model, okay? And you add one constraint very important to this, which is this. So I replace this with f of x zero plus d f of x zero, thatís a matrix, times x minus x zero, and I say thatís equal to zero.
Now whatís cool about that is thatís a linear constraint, and I add one more thing, which is this. I write x minus x zero in some norm is small, like something like that. Okay? Everybody see what Iíve done here?
This is a trust region constraint. This says, donít consider x very far from where you are, and the reason is, although such xs could be quite good, they Ė this model will no longer be good. So youíd do this. Now, whatís gonna happen here is this might be infeasible, and so you might add a residual term or something like that. I wonít get into that, but youíd be shocked at the number of highly non-linear problems that will yield to 15, 20 steps of this.
Now whatís nice about this is the following. All the convex stuff, like all these inequality constraints and the f, thatís being handled for you by convex optimization, so you donít even have to worry about that, okay?
So these things work Ė can work really well. Have you computed the global solution? Does it always work? No. Have you computed the global solution? No. And so on, and so forth, but they work really well. This is called sequential convex programming. Thatís one way youíve heard of it. You may hear of it from that Ė with that title.
Okay. So Iíll just say a little bit about some of the last topics, and then weíll quit Ė or topics that we didnít cover. So we didnít talk about methods for very large-scale problems, so thereís a whole world of that. Some if itís covered in 364 b, so sparse stuff will take you up to 10,000 variables if itís Ė if the Gods of sparse matrix orderings are smiling on you and your problem, you can take that up to 100,000.
Now, of course, if a problem is banded, for example a signal processing problem or something like that, you can take that to a million variables. You might even sometime soon.
Anyway, so banded Ė banded problems, you would need to Ė I mean thatís this class. Thatís not an advanced topic, okay? However, thereís really Ė thereís big ones for iterative things, and theyíll take you to the million variable, 10 million variable realm. Thatís for Ė thatís 364 b.
We didnít talk about sub-gradient calculus and convex analysis too much. Thereís other huge families of methods that are quite pretty. Theyíre very nice, but they donít seem to me to be in a course where the focus is on modeling and actually using things. They do absolutely have their uses, so these are sub-gradient methods and things like that.
And we didnít talk about distributive convex optimizations, so everything weíve been talking about is localized. So, in other words, they run on Ė of course, you can distribute your linear algebra, and that would be fine, but in fact, thereís beautiful classes of methods that work, for example, for network flow and things like that, and you just get distribute Ė they distribute perfectly. These are in 364 b.
So I think, at this point, weíll quit. Iíll thank the crew of TAs, some of Ė several of whom are sleeping, I presume, so Ė at the moment, but I guess theyíll get the message later. So, boy, the put a lot of effort in. So just remember, when youíre doing the final, if you think youíre putting a lot of time into it, trust me. They put a lot more time into it. Anyway, so I have to thank them, and thank you for putting up with 10 weeks of this, and weíll quit here.
[End of Audio]
Duration: 75 minutes