Instructor (Stephen Boyd):Well today, weíll finish up Newtonís method, probably wonít take up much of the day though. But Ė then, weíll move on to our absolute last topic, which is Interior Point Methods for Inequality Constraint Problems. So weíre studying methods, Newtonís method, for solving the following problem. You want to minimize f of x, which is smooth, subject to a x = b. So weíre assuming here that a x = b is feasible. I mean, otherwise, the problem is, the whole problem is infeasible; and that we have a starting point x zero that satisfies a x zero = b. So we assume we have a feasible point.
And weíre actually gonna Ė the first method weíre gonna look at, which is the basic Newtonís method, is a feasible method. That means that every iterate will be feasible. Theyíll all satisfy a x = b. The reason for that is simple: The Newtonís step, at a point x is the solution Ė itís actually the Ė this is the Newton step. Itís found by solving this set of equations here and thatís the Newtonís step; we looked at what it means last time. Now the second block equation says a delta x Newton = zero. So it basically says that this search direction is in the Null space of a.
Now what that means is very simple. It says that, if you have a point that satisfies a x = b, and you add any multiple, thatís the step-length times an element in the Null space of a, obviously you continue to satisfy a x = b. So feasibility is guaranteed, all iterates are going to be feasible no matter the step-lengths, and so on. This is the basic Newton method. Okay. So Newtonís method looks like this, with equality constraints. I guess youíll be implementing one; some of you may already have. In fact, we know some of you already have because youíve written with questions. So. But all of you will be shortly writing up a Newtonís method with equality constraints.
So it works like this. Youíre given a point in the domain of f, of course, with a x = b and you compute this Newtonís step, thatís by solving this set of equations. This set of equations has got lots of names; itís Ė I guess, mostly this KKT system, is what this is because thatís Ė nah, itís not the KKT system, this thing. I guess this is called the linearized KKT system; thatís called the KKT matrix because thatís where it comes up. A lot of people would just call it the Newton system. Do you solve? You compute the Newton step. If the Newton decrement is small, you quit. Otherwise, youíd do a line search by Ė you just do a standard backtracking line search.
So you try t = 1, if that gives you enough decrease, fine. Otherwise, youíd choose beta times t, beta squared and so on Ė beta squared times 1, otherwise known as beta squared. And then, you finally update. Now this method has the following attributes. The first is, of course, itís always that point x is always feasible because every step, everything direction, all of these things are in the Null space of a. And if the original a satisfies a x = b, thatís preserved because of this. So itís a feasible method. And of course, itís a descent method, so f goes down with each step. And thatís ensured by the line search.
Okay. Now this method is affine invariant, thatís easy to show. If you change coordinates, itís Ė you generate exactly the same iterate, so you get a communicative diagram there. It has all the features youíd expect to see of Newtonís method, which is the affine invariants. Whatís interesting about is, in fact, this is identical to applying unconstrained Newtonís method to an eliminate Ė after you use elimination of variables, linear elimination of variables, so itís absolutely identical. That means Ė it means we donít have to have a new convergence theory or anything like that. You donít have to see a new proof because youíve already seen it.
It also means that all the practical inform Ė stuff youíve seen about Newtonís method, which is like the fast convergence, the quadratic, all that sort of stuff, thatís absolutely preserved here. Okay. So you might ask if itís the same, why do you have to go to all the trouble of formulating it this way, and not by elimination of variables. And actually, itís a good question and Iíll answer it right now. The reason is this: Thereís no reason Ė thereís a lot of people who would think eliminating variables is better, and thereíre actually cases where it is better. And Iíll tell you, right now, when and why it is not.
If this matrix here has structure and then you eliminate variables, eliminating variables can destroy the structure. So actually, if thereís one thing you should take Ė a couple of Ė you should take away, like I donít know Ė weíll make a little tri-fold card. Thatís too much, even. Weíll make a little pocket Ė a wallet card of the things you should remember when you leave this class. But one of them should be it is not always better to solve a smaller set of equations. In fact, in many, many cases, itís far better to solve a much larger set of equations, provided A.) Those equations are structured and B.) You intelligently exploit the structure.
Okay. So I mean, these are kind of obvious, but youíd be shocked how many people donít know these things, or know them, but kind of donít apply them to their everyday life, or something like that. Okay. So all right. This is Newtonís method and youíll be implementing one that works and so on and so forth, has all of the attributes youíve seen, Selfkin coordinates all identical. Then you get an absolute complexity-bound on the number of steps required, and so on.
Thereís a variation on Newtonís method thatís very interesting. It wonít play a big role for us, but itís actually important to point out. So here it is. Thereís a variation on Newtonís method called an infeasible Newton method. And in an infeasible Newton method, itís actually very, very interesting how it works. You want to minimize f of f x subject to a x = b, something like that. In the feasible Newton method, x Ė the original x is feasible and then every iterate after that is feasible. So you will always satisfy a x = b.
Thereís another version of this, which works like this. Youíre actually allowed, for intermediate steps, to actually not be feasible. So I can start with any x I like. So hereís an example. Youíre solving this one, or you will solve shortly, on your homework, this. Youíll minimize minus sum log x i Ė I think. I canít remember; is this what youíre solving? I think it is. Yeah, something like subject a = b. Okay? So actually, finding a Ė I think this is what your solving, right? Yeah, it is.
So here, in an unfeasible method, youíre allowed the following. By the way, in an infeasible method, you cannot be outside the domain of f. So you can have Ė you must have x positive, actually, with a feasible method as well, obviously. You must have x positive at each step. However, you are absolutely allowed to have a x = b violated. So in other words, you can start with x zero itís just like 1. Okay? Then youíre nicely in the domain here, but for example, unless youíre very lucky, you donít satisfy a x = b.
Now if you have a method like that, thereís a couple of things that get trickier. The convergence gets a little bit trickier, not much. But the first thing is you cannot have a descent method. In other words, in this problem, you cannot expect the objective to go down because the objective actually is meaningless until youíre feasible. So some kind of combination Ė what should go down is gonna be some combination of the objective or some measure of sub optimality and the residual. By the way, once x is feasible, then it makes sense to talk about the objective.
But until then, it hardly matters. And indeed, itís clearly not a feasible, sorry, a descent method because if you start with an x for which f is very small, smaller than the optimal value, itís gonna have to increase as iterations go. Okay? So watching this algorithm, youíre gonna want to look at two things. Youíre gonna want to look at things like a x minus b. Thatís the so-called primal residual, and the other thing you look at is the dual residual, which will give you the optimality conditions. So here, the way it works is this.
In this case, itís no different, you simply linearize the optimality conditions and you get a set of equations that look like this. By the way, itís identical, if youíre feasible, this is identical to the Newton direction for a feasible point because in that case, the primal residual is zero. Okay? So whatever this is, this Newton step at infeasible point, it is a generalization of the Newton step when youíre feasible. Okay? However, it allows a non-zero point here and this Ė so you can actually Ė letís see; we can actually work out a couple of interesting things.
We can see the following: If it always the case that x plus delta x Newton is feasible and you can see that directly by this bottom row. This bottom row says x delta x = minus a x minus b. So in other words, if you multiply Ė well, like well, multiply it by this, I get a x plus a delta x Newton. And from this bottom equation here, a delta x Newton is minus a x plus b, and Iím done. Okay? So a x is b. So if you take a full Newton step, you will be feasible. Okay? By the way, in this method, once Ė if you ever take a full Newton step, the following iterate will be feasible. Once itís feasible, the Newton step youíre calculating by the infeasible start is identical to the Newton step for the feasible Newton method.
Okay? So all right, thatís actually a good thing to know. Iíll say a bunch more about that. So in other words, what happens is Ė we havenít even gotten to the infeasible-start Newton method Ė but whatíll happen is, if you ever take a step of size 1, a full-undamped step people call that, then the next iterate will be feasible and all future iterates after that will be feasible. So youíll achieve feasibility. Okay. And letís see what Ė and people, you can interpret this many, many ways.
You could say, simply, that youíre linearizing the non-linear optimality conditions. You can also Ė maybe thatís the best one; thatís maybe the simplest way to do it. Right. Some people would also call this a primal-dual Newton method because youíre updating both Ė youíre updating both the primal variable and this dual variable at the same time. Of course, thatís true of Newtonís method, as well. But still, I guess thatís the idea. So this infeasible start Newton method looks like this. You start with a point; it has to be in the domain of the function, but it does not have to satisfy a x = b. Thatís Ė it does not.
And you start with some initial Legrange multiplier; actually, it turns out the initial Legrange multiplier is totally irrelevant, but that doesnít matter. You compute the primal and dual Newton steps; you compute these. Then, hereís the important part. You cannot do a line search on the function value; makes absolutely no sense. In fact, in the feasible Ė infeasible start Newton method, the objective can, and will and, in fact indeed in some cases, it must increase. So it canít go Ė and the reason is very simple. Well here, Iíll just give you a Ė letís just do a really, really simple example.
Here. Hereís your equality constraint. Gives you Ė thatís the feasible set. And hereís your Ė here are the sub-level sets of your objective. So here, Iíll just draw the optimal point; thereís the optimal point. Okay? What if you start right there? Well f is now at its unconstrained minimum; thatís the optimal point. So f has to go up, going from here to here. So obviously, a descent method is not gonna do the trick, I mean, at all. Right. By the way, in a problem like this, if the function were quadratic, what would the Newton step be here? The infeasible start Newton method Ė step, what would it be from here?
Instructor (Stephen Boyd):What would it be? It would be Ė the step youíd take would be 1, but the Newton step, which is actually the increment, would point exactly to the optimal point because what youíre doing is, youíre making a quadratic approximation. The functionís quadratic, itís not an approximation. Then youíre solving the quadratic problem by the linear equations, and you will get this. You will take a step of one, and youíll converge to global solution in one step. If itís non-quadratic, it might not point there, it might point you know, over here. By the way, the infeasible start Newton method will always point to point on this line because if you make a single step of size 1, you are feasible, and therefore, youíre on this line.
So it might even point Ė I mean, I donít know, seems unlikely, but you know, it could point over here. I mean, I could make some sick function that would do that. It would point over here. But if you made a full step, you are then feasible and all subsequent steps will be on this line Ė will stay in this line. You make a half-step; you might be here, and so on. So thatís the picture. You must Ė the line search is actually done on the norm of the residual and let me tell you what that is. The residual is r dual and r primal. Itís like this. Itís this norm. Okay?
So this is; I guess people call it a Marek function or Leethanol function for this process. Thatís a function, which proves convergence of something by Ė in this case, it proves that r d and r p goes to zero. It certifies r p and r d going to zero because itís a function that is, you know, non-negative, non-negative positive only and zero only if youíre zero, and goes down at each step. So this is what goes down. Itís not very different. And then r primal and r dual, thatís just these. No, Iím missing, thatís not quite it. Sorry. This Ė you have to have plus Ė there it is. Sorry, itís right there.
Thatís Ė this is r dual and thatís r primal. Okay? So when you run one of these, youíll want to look at r primal and r dual. Hereís what you want to see. You want to see r primal go to zero and stay zero. Actually, if it goes to zero and later becomes non-zero, you have a bug. And then, r dual will then, very rapidly, go to zero. Okay? Itíll, in fact, quadratically in the end-game. So you have the norm. By the way, itís the norm, it is not the norm squared, here. So just to make a point here. Obviously, both of those go down but if you worked out the following, if you worked out the d d t of the norm of the residual as a function of the step-length, at t = zero and in the Newton direction, you will actually get, I believe itís minus 1 or something like that. Or itís actually minus the residual. There you go, itís that. I havenít Ė this is worked out in the book, but you can work this out; itís this thing. What this says is that, if you take a step of alpha, you ex Ė sorry, if you take a step of t, the predicted decrease is simply this. It just scales the residual down; thatís your predicted decrease. You have to degrade that because this is a lower bound by convexity; itís a lower bound. And you scale it back and alpha can be any of the constants you like, point 1, point 0 1. Actually, it appears to not have any diff Ė make any difference what value of alpha you use, over a huge range. So this is what goes down as a residual. Thatís your line search. By the way, sometimes when you get feasible, you can switch your line search back to f because now, it coincides precisely with the original Newton method and so on.
Okay. This is what I was trying to Ė this is Ė I was trying to recover this, here so that was the Ė this is the descent cond Ė thatís the directional derivative of this at t = zero plus is this thing. Okay. Actually, you donít have to worry about all these things. This is just sort of to get a rough idea of how the method works. It is Ė this is actually a very, very good method choice, in some cases. If you have a point thatís like not quite feasible, or something like that, but itís easy Ė for example, if youíre doing a so-called warm start. Warm start means youíve solved a problem just like the last one, except now it changes a little bit. So youíre doing Ė doesnít really matter what youíre doing Ė but you just solved a problem just like the last one. The data changes a little bit; one more piece of information comes up in Ė comes in, in an estimation problem, something like that. One more data point becomes available; one more time-step has evolved or something like that. Then what you do is; you can use this with warm start. And in fact, if youíre lucky, it would be something like, you know, just two Newton steps, or one would get you right back where you want, incorporating the new data. So if you want to do an online estimation, update stuff as new information becomes available, for example in maximum likelihood, this would work very, very well. Okay. Letís talk about solving KKT systems. This is very important because well, I guess a lot of people donít seem to know it. I guess people who do these things know it, but then Ė I donít; they donít get around much or something like that. So to solve a KKT system, and thatís for infeasible start Newton method or for standard Newton method, with equality constraints, you have to solve a system of equations that looks like that. And thatís called a KKT system just because this thing comes up. By the way, it doesnít only come up in optimization; it comes up in mechanics; it comes up in Ė comes up in tons of places. Iíve seen it Ė I guess I opened the first book of some book on simulating flows in porous media and, like on Page two, I saw this. So I mean, they didnít call it a; they didnít call it h; but it was, basically, this system. So this system of equations is probably worth knowing something about. So hereís how you could solve it. I mean, the first thing is just treat the whole thing as dense; itís m plus n. Right? Because this is, thatís m by n and thatís m high. And so in the worst case, you just commit to this and solve that, just using absolutely standard methods. Which is to say, in other words, this is the h Ė letís see if I can get it right Ė h, you know a prime a and then zeroes of the right size backslash v w. So thatís youíre Ė well, that was all wrong. Minus and this should be g h. Okay? So this is Ė thatís your just default basic method. This will simply treat this matrix as dense. It is true thereís a block of things here that are zeroes and this is fine. Itíll do whatever pivoting it needs to make this stable and itíll Ė it may even do the right thing. One thing it cannot possibly do is, it canít do a Cholesky factorization on this This matrix is symmetric but it is, for sure, not positive definite because the zeroís on the diagonal here. And in fact, your homework which we Ė I forgot to announce that. Weíve backed off and said that you donít have to do prob Ė is it 10.1b, I think it is? Yeah, so. Some people have already done it and we heard from them, last night. Anyway. Were you one of them?
Instructor (Stephen Boyd):No, okay. So all right. So itís certainly not positive that you canít use a Cholesky factorization because this has negative eigen values. Okay so if you Ė one way to do this, probably the simplest, is just an LDL trans Ė I mean sort of the correct method, if youíre just gonna solve the equations, is LDL transpose factorization. Thatís a generic factorization for a matrix which is symmetric, but not positive definite. If it was positive definite, of course youíd use a Cholesky factorization and youíd use a LDL transpose. Now in fact, there are Ė thereís a whole sub-field now on KKT solvers. And you can go to Google and type ďKKT solver,Ē youíll get maybe 30 papers on it, and codes if you want.
So this comes up enough that itís studied, carefully. But if you wanted to use a generic methods, youíd use something like an LDL transpose factorization. By the way, this might not be a bad idea, if this is sparse like crazy, here. So if your a is huge and sparse, if h is sparse Ė and what does h sparse mean, in terms Ė about f? I mean, actually itís not all uncommon that h is diagonal in these systems. In fact, I guess thatís your homework. In your homework, h will be diagonal. Okay? So then this thing has just got tons of sparsity; itís a screa Ė I mean structure; itís just screaming at you. If you have a di Ė you know, ignoring it, thatís a little block of zeroes. Youíd ignore that; thatís maybe okay.
But a giant, like diagonal matrix just sitting, there and if you ignore that structure, thatís kind of irresponsible. Thatís over the line, I think. Anyway, LDL transpose factorization will probably work quite reliably here and would take you up to some very, very, very large systems at this work. You can also do it by elimination and the elimination is interesting. It works Ėitís Ė well you eliminate the first block and you get this. So the first thing you have to solve is this equation. This is the Ė one of these youíll see is the Schur complement is going to come up, somewhere. Well, I guess this is the Schur Ė well, one of these is the Schur. Thatís the Schur complement.
This is zero minus Ė itís zero minus this times the inverse of that times that. So that is the Schur complement. Actually, itís the negative Schur complement because thereís a minus sign there. Hereís whatís interesting if you do this method. This is interesting in cases where h is easily inverted. If h is dense, and thereís 1,000 variables and 100 inequalities, this is actually not at all a good method to use. You might even just use LDL transpose, here. Well, I donít know. This would be equivalent to it. However, if h is, for example, diagonal, meaning that f is separable, then this is Ė h inverse is a non-issue, if h is banded.
If h is sparse enough, h inverse is something very, very easily computed. Actually, I should say correctly. Although, thatís what people would say; they donít mean it. You donít actually compute h inverse, obviously. What youíd do is youíd have a Cholesky factor of h sparse Cholesky factor, hopefully. And you would back substitute the columns of a transpose. Okay, but you know that, so I wonít say that. So people would still say form this, when youíre not really Ė you do form this matrix. What you donít form is you donít form that, for sure.
Okay. I mean, unless h is diagonal, then you would form it. But if h is banded, you would not because h inverse is full. Okay. So youíd form this. The interesting thing is that, the Ė when you solve the reduce system, you get a negative definite system of equations. I flipped the sign on it because if itís negative definite, it means you flip the sign, itís positive definite. And this is the Schur complement Ė well, assuming h is invertible, then this can be solved by Cholesky. So in fact, if you solve this by elimination method, you will have two calls to a Cholesky factorizer.
The first one is the h, itself. By the way, that might not even be a real call, if h is diagonal then thereís no call because diagonal is trivial. But in other cases, thereíd be a call that youíd do a Cholesky on h. Youíd form this matrix; thatís the Schur complement and then, youíd actually solve this by some kind of Cholesky factorization. So thatís the Ė so I guess people Ė so thatís another way. So people would call that, maybe, the elimination in Cholesky, would call Ė but youíre gonna make Ė you may make Ė you may be making up to two calls to Cholesky factorization.
Okay. Now, in some cases, h is singular. But in this case, you can do things like this: You can add any multiple of a transpose a with a q in the middle like this. Solve this equation, and itís the same. So sometimes, itíd be something like h would be all positive and, like, one or some little block of zeroes, it doesnít matter. But this can be done. Thatís all subsumed in solving this set of equations intelligently. So letís look at an example. It happens to be exactly the example that youíre gonna solve on your homework. Of course, that does imply that all of the source code for this, you can find because itís all posted. So. I think every example in the book, all the source code is there for you. If you want to see how we did it, you can but weíd encourage you to try to do it yourself. But thereís no reason for you to bang your head on the wall, so youíre welcome, at some point, to look at ours. See if you like ours. Actually, you can find it anywhere; itís tons of places. Okay. So letís look at Ė so here, you want to minimize minus the sum log x i. Thatís the so-called log barrier for x strictly positive subject to a x = b. We have a point x, which satisfies a x = b and x is positive. So we start with that, if we do the prime, the whatever you call it, the standard Newton method. Now, the dual problem for this is this. Itís maximize minus b transpose new plus sum log a transpose new plus n. So. The n, of course, doesnít matter but if you want the optimal value of this equaling the optimal value of that, then it matters. And in this case, the function in the Legrangean, which involves log Ė itís a x plus Ė itís actually the sum of the logs plus nu transpose a x minus b. That Legrangean is strictly convex in the x i and that means you can, if you solve the dual, you can recover the primal optimum from, by minimizing the Legrangean. Okay? So thatís how that works here. So I this case, you can solve either one Ė either one you can reconstruct the solution, in fact basically, if you solve the primal problem, youíre gonna solve the dual one, whether you like it or not and vice-versa. By the way, this is quite interesting, this topic because Ė in fact, I even for a long time, was sort of ignorant about this. A lot of times, people get all excited or bent out of shape over, like the dual or something like that, because what will happen is this. Theyíll have a problem. I donít know; theyíll have this problem here. X i will have size, you know, I donít know, 10, 000 Ė x will have the size 10,000 and thereíll be 100 equality constraints. And theyíll say, ďWow, you could solve a problem with 10,000 variables, or look at that, you could solve this dual and it only has 100 variables. Boy, I would really rather, like, Iíd really rather to solve a problem with 100 variables, than 10,000.Ē Okay? And that would be true, if you donít know what youíre doing, speaking with respect to numerical linear algebra. Itís absolutely true, for sure. In that case, if everything cost you n cube or n is the size of the system, thereís no doubt you want to be solving 100 variable systems, not 10,000. Okay? Whatís actually gonna work out here is, it turns out is, itís actually absolutely identical. Solving the primal, dual, any of these, all of these methods, the order is the same. Itís only the same if you know what youíre doing with respect to linear algebra. If you donít, if you donít exploit sparsity, then you will find gross differences. By the way, if youíre in that Ė if you make that mistake, and Iíve been there myself, so I speak as a reformed someone who once didnít know the subtleties. You can make stunning claims. You can say things like, ďWow, I found the best way to solve this problem. Itís amazing. You solved the dual instead. Isnít that genius, isnít it?Ē Right? Now by the way, there are cases where there might be some convenience to solving the dual, like it might Ė you might end up with a distributed algorithm. But in terms of numerical linear algebra, youíre Ė itís just wrong. Solving the dual is the same as the primal, if you know what youíre doing. Okay. So. These are the two problems. Youíll solve this one. Well, you can solve whichever one you like. So hereís some examples. The first is Ė and they have different initialization requirements, and these are quite interesting. And in fact, that is actually the only thing that should drive the choice of method, assuming youíre gonna do everything right. Initialization. So if youíre gonna do Newton method with equality constraints, that requires a positive x that satisfies a x = b. Now by the way, this is checking whether a linear program has a strict Ė a set of linear Ė well, a certain, a standard form of linear program has a strictly feasible starting point. That can be hard as Ė that can be hard. Thereís no reason to believe thatís simple. If I just walk up to you and say, ďHereís a; hereís b; please find me a positive x that satisfies x = " Thatís as hard as a linear program, basically. Because Iím basically saying, ďFind me Ė check feasibility of the standard form linear program.Ē So on the other hand, it depends on what the problem is, right? If this is some resource allocation problem, or something like that, it might be completely easy to say what Ė to get a point like that. For example, if itís a portfolio allocation problem, I mean, it might be that, hereís an initial x zero. Put Ė if you have ten assets, $1.00 to invest, ten cents on every asset. It might be a stupid portfolio allocation but one thing is certain; itís feasible. Iím just saying that, you know, depends on the situation there. It can easily be Ė in some cases, itís easy to get such an x zero and others to stop. Okay. So if you do this, hereís what happens. This is a plot of f minus p star so itís, in fact, this is the sub-optimality. And this scalar, you can see is extremely broad; it goes from like, whatever, 300 down to 10 to the minus 10. And this is the thing you want to look for. If you donít see this Ė if you donít see that Ė if you donít see these plots rolling over, then you havenít achieved quadratic convergence. So by the way, sometimes well, you can get things where itís much more subtle. I mean these, itís just sort of very much in your face, here. These are the right numbers, you know 10, 15, 20 steps. These are, you know, big problems and let me say a little bit about this. Sometimes, itís much more subtle and it just curves a little bit. And if youíre banging your head, thatís enough. But if it kinda looks straight, then you really can hardly claim that youíve reached Ė that youíre achieving quadratic convergence. It probably means something is wrong. So by the way, out here, what would the step-length be? What would you expect the step-length to be out here?
Instructor (Stephen Boyd):1. Right. So real quadratic phase is characterized by unit steps and quadratic convergence. That would be out here. What would you expect the step-lengths to be here?
Instructor (Stephen Boyd):What? Well they might Ė they certainly could be less than 1, but they can also be 1. You donít Ė I mean, you donít Ė the answer is you donít know. But you wouldnít be surprised if you started some, seeing if there were lots of steps out here where you were less than 1. Okay? So this is just four different Ė four different, maybe, starting points, here. Itís the four different trajectories. Okay thatís the first. This is the method youíll implement. Now you can also Ė you can also solve the dual here, but totally different size. The dual has got far fewer variables, right? Well, not far fewer; itís got m and Iím assuming sort of here, well m has to be less than n, but itís got fewer variables.
So itís got m variables. Okay. Oh by the way, hereís an extreme case. Well, letís go all the way to the caricature. Ready? Hereís the extreme case. Find the analytic center with one inequality Ė one equality. So thatís just a transpose x = b x positive. Okay? Here, I guess thereís a couple of things you could say. Letís see how to solve it. The dual has how many variables?
Instructor (Stephen Boyd):One. How do you solve a one-dimensional convex optimization problem?
Instructor (Stephen Boyd):How?
Instructor (Stephen Boyd):No, you could a line Ė I mean you could do it by section, if you like. I mean, you could do anything you like at that point. Okay. And youíd think, ďOh, wait a minute, wait. Iíve got a 1,0000,000 variable problem with 1.Ē So this might be an image; thatís an equality constraint. By the way, itís a stupid problem, but anyway, you have an image with an equality constraint and you want to calculate the analytic center, which would be some Ė and you could wrap some big story around it, like, ďItís the maximum entropy image.Ē or something with this Ė just something, who knows, satisfying one equality constraint.
Now, you could do it by bisection, here. Your dual has one variable and youíd say, ďThatís ridiculous. How could the primal, that has 1,000,000 variables Ė how could that be as easy to solve as the dual?Ē It seems suspicious, doesnít it? So weíll get to it. Actually, itís pretty straightforward, how that works out. The KKT system would then look like this; the Newton system would look like this, where thatís diagonal. Right? So it would look exactly like that. Indeed, it is 1,000,000 by 1,000,000. I would not recommend storing it as a full matrix, just so you know.
But if you stored this exactly in this form, you should look at that and say, ďWell, that might be 1,000,000 equations by Ė with 1,000,000 variables.Ē How fast can you solve this, roughly?
Instructor (Stephen Boyd):Itís in m steps, okay? So actually, what are we talking about, for 1,000,000?
Instructor (Stephen Boyd):Milliseconds, at most. Okay? So anyway, so these are very important to know, these things, and to recognize them. I guess, thereís a lot of people who know you could solve that fast. Although, a lot of them havenít totally internalized it. But what they havenít also done, they havenít, like, propagated this knowledge into the rest of their life. So theyíre doing signal processing or statistics, or whatever, and they havenít put it all together. They donít know what a statistics problem looks like, that happens to be one that you can solve unbelievably efficiently, like that.
So, okay. All right, letís look over here. So this is a Newton applied to the dual. By the way, this is interesting, totally different initialization requirement. But you need to Ė here, is you need to find a vector nu for which a transpose nu is a positive vector because thatís how you need to initialize this problem. Itís completely unconstrained. Okay? And youíre gonna Ė so you need this, totally different problem, initialization requirement. And this one has, you know, nine steps to converge, and again, you see this thing roll over. Thatís the good Ė that gives you that good Newton feeling.
Okay, if you apply infeasible start Newton method to this problem Ė so I initialize it with, letís say x = just all ones. Okay? Just all Ė I just plug in all ones. Youíre very happily in the domain, here but you donít satisfy a x = b. And that takes Ė here are the four things, here. Oh, what I havenít marked here is where primal feasibility is achieved. And so letís make some guesses. If I told you that primal feasibility was first achieved on this run here, what would you tell me?
Instructor (Stephen Boyd):You believe that? When do you think it was achieved? Iíll go back; you say when. Probably somewhere around here. Why would I say that? Well, Iíd say it for this reason. Letís Ė if you put everything together, it works like this. If you take a step of 1, you will be primal feasible at the next step and you will remain so. So youíll satisfy a x = b, if you take a step of 1. We associate quadratic convergence with two phenomena. No.1) Quadratic convergence. Thatís this rolling over of this Ė thatís this thing. Thatís the famous doubling the number of digits of accuracy at each step. Thatís this roll-over. And we associate it with a step of 1. So when you sort of start seeing this roll-over, you have taken a step Ė youíre probably taking steps of 1, in fact, almost certainly. That means youíre feasible. So although I havenít marked it here, a primal feasibility is probably achieved somewhere in here. You know, it could be up here, I donít know. And same for this one. Primal feasibility, maybe, is achieved like here, Iím guessing. Maybe here, I donít really know. Maybe Iíll go back and replot these with filled and open circles, to indicate when primal feasibility is achieved. By the way, number that the Ė note the number of steps is not, you know, itís changing within a factor of 2. You know, itís 9 steps vs. 15 and 20 and thereís one thatís 21 steps. This is not worth getting all bent out of shape about; itís not worth anything. Okay. Now, letís look at whatís required in each of Ė to solve for the step in each method. If you use the first method, this is the one Ė by the way, this is your homework problem. You have to solve this system here. Now, in this case, you end up solving a times the diagonal matrix times a transpose w = b. Thatís the only step here that costs you anything. Thatís diagonal, here. So this is Ė you have to form this matrix and then solve this system. Okay? Now, in the second case, for the dual problem, the dual problem is completely unconstrained. Here it is. Itís this right Ė you just simply maximize this function. You calculate the gradient and the hessian; you calculate the hessian inverse times the gradient, end of story. Now the hessian of this, I think you know actually from the last homework. The right way to do it is, really under no circumstances, should you sit down and start calculating partial squared this partial nu i partial nu j. That works, in theory; itís a huge pain. Instead, you use a composition rule. The hessian of this is the hessian of a function which is a sum of logs thatís gonna be diagonal hessian. Then you multiply on the left and right by a and a transpose, and I forget which one goes where. But one of them goes on one side and the other on the other. Thatís the chain rule for the hessian. Well, thereís the answer; you do this. Now, so to compute Newton method for the dual, you get this. Now, whatís kind of cool about this is, itís actually quite interesting. To solve this primal 1 Ė sorry, this primal Newton, and Newton diagonal and see I have to be very careful. This is the Ė if youíre applying Newton to the dual, here, this is applying Newton to the primal. You donít solve the exact same set of equations, but the structure of the equations is identical. They all have the Ė they both have the form. Theyíre different matrices, but they have the form a diagonal a transpose. Down here, a diagonal a transpose, different diagonal. Everybody see this? So you solve the same thing. In this extreme case, that I talked about, a is a row vector thatís, you know, 1 by 1,000,000, then when you solve this system, this is basically a number and Ė but then itís here, too. So itís identical. Okay. If you use block elimination, to solve the KKT system here, thatís to solve the infeasible start Newton method, itís the same story. You solve the same thing here, but the right-hand side is different, but the matrix is the same and thatís what determines how fast it is. So in each case, you have to solve a d a transpose w = h, with a diagonal positive diagonal. So thatís actually all thatís gonna happen in all of these cases. So if a is sparse, then it basically says the computational complexity of the three methods is absolutely identical, per Newton step. If you add to that, the fact that the Newton method is gonna take somewhere between five and 20 steps or something like that, and you call that just a constant, then it means that all the methods are just the same. Thereís no Ė and in particular, it means thereís no dramatic Ė thereís no dramatic improvement youíre gonna make here, just none. So, okay. Everybody get all this? So by the way, if you donít know how to solve these linear equations, itís everythingís different. Right? The three methods are dramatically different but all youíre doing is displaying your ignorance of how to solve linear equations.
Student:[Inaudible] apart from history of solving [inaudible] problem, you donít have any constant. Is it a benefit because itís easier to find the starting point?
Instructor (Stephen Boyd):No, no, no. Not at all, I mean, you could use the infeasible start Newton method. [Crosstalk]
Instructor (Stephen Boyd):You donít have a feasible point.
Student:Because you could have a feasible start point then youíd have more [inaudible] conditions.
Instructor (Stephen Boyd):No, Iíve been Ė I know what youíre saying. I would like Ė I mean, I also had this urge and, in the past, did not know these things. Yeah, you want to say one of these is Ė you would like to say one of these methods is better, right? Well, weíve just discussed the computational complexity issue. If you know Ė if you know what your doing, thatís a big ďif,Ē itís identical. So the cost per Newton step is the same in all cases, not because youíre solving the same set of equations. Because youíre solving Ė youíre not solving the same set of equations. What youíre doing is, youíre solving a set of equations with exactly the same structure. Thatís why itís comparable. Okay?
Then, you could ask, ďDoes one take more in steps than another?Ē Well, I donít know, you know. No not really, I mean within a factor of 2, theyíre all the same. Okay? So you know, if you Ė here, in this example, the dual method looks, you know solving Newton to the dual looks better. I can make another example where itís worse. Okay? Okay. So then, everything comes down to initialization and, there, I can think of situations, practical situations, where all three methods are the best one. So or thereís just no difference between them.
So I hate to say it, but I think they all have, at the moment, they all have equal status. The main thing is the initialization, as to which one is better or whatever, depends on the situation. If you have an obvious dual feasible point, fine. If you have an Ė if thereís an obvious primal feasible starting point, fine. If you have a guess of a point, which is nearly feasible, but not quite, you might want to use infeasible start Newton method and so on. So okay.
So now, weíll look at some other examples. Theyíre fun and they all have the same theme, which is, you know, you look at a Newton method and then weíll talk about forming it. And, but what youíre really doing is, youíre looking for structure to solve here. So letís look at network flow optimization. So network flow optimization is, this is a single commodity flow; itís a beautiful problem. You should all know about it. So you have a network, a directed graph like this; it doesnít really matter, I mean. And you have flow on the edges. And a flow Ė these arrows actually only tell you the orientation of the flow.
So itís like an electri Ė in a circuit analysis, you make an arrow here and that tells you that, if current flows that way, you will regard it as positive. If current flows this way, youíll regard it as negative. So thatís the Ė now, it could be that, in fact, things can only flow this way. There are many cases where thatís the case. But generally, when I mark these, this is what you have. So you have Ė and you have things like you Ė this could be a single commodity network or something like commodity flow. So these Ė then you have external inputs to the network. Oh, you have one very important requirement. The requirement is that in elect Ė in circuit theory, you call it KCL, which is Kirkoff Current Law and it basically says that, at each point, flow is conserved. The sum of the flows at any node, is zero. Okay? So thatís what it says. And sum of the flows means, if things Ė it says, some people would say the sum of the in-flow is equal to the sum of the out-flow. Another Ė but this is the same and you get this in lots of things. You get in physics; you get it all over the place, in anything. Itís mass conservation, for example, or itís just a conserva Ė it just basically says the flow is conserved at nodes. And if itís a differential equation, I guess youíd say itís a divergence free or something like that, whatever it is there. Okay. So and you could think of this as any way you like. These could be flows of packets on a network, and the fact that itís conserved says the buffer is bounded or, you know, thereís either no buffer or the bufferís bounded and itís not growing or something like that. And itís in steady state. It could be actual current; these could be amperes and it says youíre in steady state, or something like that. Okay. Okay, so this is Ė thatís flow conservation. You have some external sources and syncs. So itís generally, if youíre pumping stuff in, you call it a source. If youíre pulling stuff off, itís called a sinc. Actually, Iím just drawing these as if all the variables were positive here, for example. But of course, this could be a source if this flow is negative. Now, in this case, you have to have the sum of the source as zero. Thatís very easy to work out. Otherwise, it doesnít make any sense. So this is Ė thatís a network. Now, thereíll be lots if you fix the external flows. So if this is a distribution network or a current network, or something like that Ė weíll make it a product distribution network. If this one is pumping in something, this is pumping in something; this is pumping in something, then what youíre pulling off , here at the sinc has to be equal to the sum of these. But there are lots of flows in here that satisfy the flow conservation equations. So you can actually route the flow any way youíd like. Routing the flow, basically means, what comes in here, some you ship there; some you ship there; some you ship there and so on. Okay? So thatís the picture. Anyway, this is just written as a x = b. B is the source vector here, at each node and a is the incidence matrix. So thatís the matrix Ė youíve seen this, shouldíve seen it, that looks like this, where you have nodes and edges like that. And you have a plus 1 or a minus 1 to indicate where that edge goes, the from and to node. So thatís just a x = b. Now, among all the possible feasible flows Ė by the way, thereís all sorts of interesting things you can say . The Null space of a, for example, is called the circulation because, if I add something to Null space here Ė so the circulation might be something like that. Notice that it does not affect the external flow. It adds to this flow, adds to this one, subtracts from this one and subtracts from that one. It subtracts because Iím going Ė my circulation is going against these things. So thatís an element of the Null space. You might ask why on earth would you want to have anything to do with one of these? Well, weíll see why. What you want to do is, you have a cost now, on every edge, and you want to satisfy the flow equations, and you want to minimize the total cost. So by the way, let me just quickly ask something. Right here, thereís something that should pop out if youíre fully trained in all the neural pathways or hooked up correctly. Created. You should be able to look at that and without even thinking, 50 milliseconds later, a picture of the KKT structure should show up in your head. And what is it?
Instructor (Stephen Boyd):Okay. When you see that sum of scalar functions, what is the hessian?
Instructor (Stephen Boyd):Thank you. So you should just look at that and see in your head, this, without anybody saying anything else. Thatís what you should see, when you see this. Okay? So I mean, it should be just an autonomous response. Okay. All right. So anyway, weíll go through this; this is the reduced incidence matrix. And letís look at the KKT system for this. The KKT system looks like that. So h is diagonal. Thatís the important part, is that of course, you should always see this when you see an equality constraint minimization problem, but in this case, thatís very important; thatís diagonal.
You can solve this via elimination. You have a h inverse a transpose a transpose w = this thing. Now, of course, when you actually go to write the code, you have to get all the right-hand sides correct, and all that stuff or itís not going to work at all. But in your first pass over a problem, you want to think about the structure. And so basically, in this problem, the inverse shouldnít bother you at all because thatís a diagonal matrix. So itís Ė although normally an inverse would set up a green Ė a red flag, which is, you need to look at that. Is that hard, easy to do and so on? In this case, it just goes away. It just Ė yeah, of course, you have to actually carry it out correctly in your code. But non-issue.
All right. So basically, what it says is, if you want to solve a network flow problem, each step of Newtonís method is gonna require you to solve a systems of form a diagonal a transpose. By the way, thatís a theme; you might have noticed it. You will hear about a diagonal a transpose, actually just everywhere. You might ask like, you know, what do you do when, you know, if you wanted to ask, ďWhat does spice do?Ē You know what spice is, right? Circuit simulator program? You might ask, ďWhat does spice do?Ē Would you like to know what it does if you profile it? Itís solving systems that look like a d a transpose. Thatís the only thing it does.
Everything. And Iím talking about for all calculations it does. Everything, you know, [inaudible] as if it Ė actually, how many people know what Iím talking about? Anyway, or know about spice? Okay. So you get the same, you know, if you do a thermal simulation, right, youíre solving a d a transpose, with a finite element method. If you solve a Poisson equation, what do you think youíre doing? Youíre solving a systems perform a d a transpose. So itís just, you know, thereís no reason Ė I mean, you might as well accept the fact that itís everywhere. And you will Ė you have seen it; you will see it; you will use it. Youíll use it, many times when you donít even know youíre using it.
Okay. So Iíll have to think about where youíre using it in the machine learning, or something like that, but Iím sure you are somewhere. I just donít know where exactly, right now but Iíll think of it and tell you. Iím sure thereís somewhere where you do this, some base network calculation or something. But anyway. Okay. So you have to solve that. By the way, the sparsity pattern of a diagonal a transpose is extremely easy to work out when a has a graph incidence matrix. Itís very, very easy. It turns out that a diagonal a transpose i j is not zero, if and only if Ė thatís a nodes by nodes matrix that youíre solving
And thatís true, if and only Ė thatís non-zero if and only if it finds j r connected by an arc. So that even tells you, now, about the sparsity pattern youíre gonna see. In fact, I can ask you some questions now and weíll see how well youíve entered Ė this isnít Ė I just want you to guess or whatever and you tell me. Suppose I have a network. I want to do network flow optimization. On my network, the fan-out or the, sorry, the degree of each node on the network is like small, like 3 or 5. But itís got 10,000 nodes. Do you think we can solve Ė we can do optimal flow fast on it? Yeah, probably because if you form a transpose a, it basically says Ė it tells you something about the number of row Ė non zeroes per row in this thing, right? And the answer is, you know, each rowís got only Ė youíre only Ė if each node is only connected to, like, three others, thatís the number of non-zeroes per row or column. Okay? So thatís a, you know, how fast can you do it? I donít know; it depends on the gods of heuristics sparsity ordering methods and whether theyíre smiling on you that day or not. But the point is, itís likely you can solve that problem very, very fast. Now, let me ask you this. Letís do a Ė letís have a node, a single node is connected to a huge number of others. What does that do in sparsity pattern in a h inverse a transpose? Everybody see what Iím saying? So in fact, let me draw my network. I have a network here, which is kind of sparse; everybody is connected to just a few others, you know. I donít know, it doesnít really matter. Okay? But Iíve got some super node here and itís actually connected to a whole bunch of them. What does it do?
Student:[Inaudible] it to the bottom, wonít it just preserve your arrow structure?
Instructor (Stephen Boyd):Okay, as you permute it. Okay, very good. So your proposal is that this should be ordered last. Okay? And what is the sparsity pattern on a h inverse a transpose, if you order this last? In this case, whatís the sparsity pattern on this thing? Itís arrow Ė yeah, well no, itís like sparse arrow or something like that. Itís Ė you get a sprinkling of entries from these guys. But from this one guy attached to all of them, that is gonna give you a dense row and column in this matrix. Okay? Now by the way, you could order it to be the end, but in fact, any decent heuristic for ordering is gonna put that guy at the end, anyway.
So but youíll know itís gonna work. In fact, youíll know that if you had a network with 10,000 nodes, 10,000 nodes and the degree of most edges is 3, but 100 of them are connected to giant piles of nodes Ė weíll call those ďsuper nodesĒ or whatever. It doesnít matter, whatever you want to call them, right? Because youíre maybe optimizing the flow on a peer to peer network or something like that Ė you would know what the structure is. The structure will be dense, with 100 much denser rows and columns and youíll know, actually, that can be solved very, very well. Maybe youíll know, but anyway. Okay. So.
I didnít expect everyone to follow all of this, but this is Ė actually, this is the kind of thinking you want, I think you want to do for this stuff. You want to just be on your toes, connecting sparsity and structure to the actual problem at hand. And the problem at hand, itíll be things like, you know, bottlenecks Ė you know, bottlenecks, super nodes, thatís for this one. But for every problem, thereís gonna be something like that.
Okay, weíll look at one more. Itís an even more advanced one, but it comes up in a lot of things. You want to calculate the analytic center of a matrix inequality. Let me, just first, motivate it a little bit. This comes up all the time, but I can tell you what this is. This is the following problem. I have a co-variance matrix and I tell you some things about Ė well, letís see. I have a random variable z, letís say, and letís make it galcion or something like that; letís just say itís galcion. Okay? So itís n zero x Ė actually Iím gonna call this co-variance matrix x. Okay?
And hereís what Iím gonna tell you: Iím gonna tell you some variances of linear functionals of z. So I will tell you, for example Iíve got a vector c, c i and c i transpose z. Iíll tell you the variance of that. Now this is nothing but c i transpose x c i. Everyone agree with that? Itís nothing but that. Therefore, thatís the same as the trace of x c i c i transpose. The linear Ė the trick Ė the variance of this linear combination is a linear functional of x, a co-variance matrix. Everybody agree? Okay.
And now, so Iíll give you an example. I have a random variable and I say, ďWell, the first component has variance 12; the second has variance 3; the fourth has this variance; the fifth has that. Oh and by the way, the following, you know, the sum Ė this linear combination of them has this variance.Ē You can even do things like give the off-diagonal elements. I can say Ė I can say, actually element 3 and 5 are correlated 37 percent; element 6 and 12 are correlated minus 19 percent. So I just give you a bunch of partial information about a co-variance matrix. And then I say, ďPlease compute for me the maximum entropy distribution that satisfies galcion; that satisfies those known constraints on the distribution.Ē Everybody understand that? Well, that would be this problem. Maximize the determinant, subject to some equality constraints. Okay. I mention this just to say this problem comes up. Okay? This comes up. Okay, so now, letís talk about, letís solve this problem. Well, first of all, you look at it naively. You would say how many variables are there? Well, letís see, x is a matrix variable; itís square; itís n by n. And so, therefore, the number of entries it has is this in it. Okay? Now, if thereís p of these equality constraints, it basically says that if I form the KKT conditions or whatever, Iím gonna have a matrix thatís this big. Okay? So and therefore, our baseline cost is that. Okay? Now that grows like n q, n to the 6th Ė Iím sorry thatís n to the 6th. Everyone agree? Because this is n squared inside a cube. This would be Ė this would start getting tough. You canít do a matrix bigger than 100 by 100 with this scheme. Not to mention, by the way, this scheme would be an enormous pain in the ass to actually write up. Because, here, youíd have to find some way that, to encode, you know, this thing, the vec operation there. Okay? Everybody see what Iím saying? You could Ė I could assign this problem. You could write a Newton method on it, just like that, theoretically. Itís be a pain in the ass, so we wouldnít do it, but you could Ė this is like, no problem. However, the complexity is gonna be n to the 6th, which is not a great complexity. Okay? So letís actually see how you do this the intelligent way, you know, exploiting all the structure in here. By the way, the structure, at this time, is not going to be sparsity. So this is just an example. I mean, I wouldnít expect you to know this. A lot of people donít know it, actually to this date, donít know it. But weíll go through it. Weíll do a little bit of it just to give you the flavor of it. The Newton step is solving a set of Ė well, look, the Newton step solves a set of n, basically, n squared over 2 plus p equations in n squared over 2 plus p unknowns. The unknowns are delta X Newton and the dual variables nu, of which there are p of them. So thatís what you end up Ė you have to solve a set of equations like that. Itís a dense set of equations, by the way, completely dense in general. Right? So thereís no sparsity trick is gonna help here. Those equations, however, look like this. They look like that. So thatís a matrix equation, here. So itís x inverse delta x x inverse. This is, in fact, something like the hessian, or itís complicated but it looks like that. And thereís a very simple way to derive this, actually. What you do is, you just write out the optimality conditions, which is this, here. Then you linearize this at the current x and the current nu and write out that equation. If you do that, youíll get this. This Ė youíd get the same thing if you looked up and figured it out what the hessian of log dead is, and youíd, I donít know, get a big headache and stuff like that. But then, ultimately, youíd find out the Newton step can be expressed as a set of matrix equations that looks like that. So thatís how Ė and you have to solve for this and that. Now, this doesnít look good. However, thereís a block in this system that we can solve fast, and Iíll show you what it is. If I fix w, can you now solve fast for delta x, here, if I fix this part? If I fix this part, then Iím just solving x inverse delta x x inverse is equal to a fixed matrix. How do you get delta x? Thatís a set of n n plus 1 over 2 equations in n n plus 2, n plus 1 over 2 unknowns. If you just go to your generic backslash solver or whatever, thatís order into the 6th. However, in that case, itís very easy. You simply take this thing minus that, and you multiply on the left by x, on the right by x. Okay? All of those operations, matrix multiplication, those are order n cubed. So if you know what youíre doing, you can get Ė if the ws were fixed, you could get delta x here in n cubed operations. N cubed is a whole lot more favorable than n to the 6th, just for the record. Okay? So that tells you that thereís Ė although thereís a dense block in this system, itís one that is easily solved by special structure. By the way, if youíre curious what the matrix structure is, itís called kronecker. Itís a kronecker product structure or tensor product structure, something like that. So thatís why this is fast to solve. So you solve it by block elimination. You eliminate delta x to get this. You substitute this into the second equation, and you get this. And now, you look at this and you realize thatís a set. This, you can write this. This here is something that looks like this. Itís, here, g w = b. The entries of g are this thing. If you work out what the computational complexity is Ė you have to calculate every row, every entry of g and then, solve this. Thatís cheap; thatís p cubed. But the computational complexity comes out to be this. Thatís gonna be one of the leading terms is gonna be this. It depends on the order of m Ė n and p, but they all look like this. The cool part is, itís order n cubed. I mean, so whatever the complexity is, it for sure beats the generic one because this thing actually starts out with an n cubed term; itís n to the 6th. Itís also got a p cubed term. Weíve got p cubed here, but our complexity clearly is way, way, way better. So. So I didnít expect people to follow all that, but itís just to show you that these Ė once you learn these ideas of looking for structure, and block elimination, and things like that, it will come up in lots of places and itíll be surprising. By the way, I didnít know Ė thereís been periods when I didnít know this. Iíve solved problems without [inaudible] and that was a mistake. In other words, Iím saying I made a mistake. We did it for years and so on, before we knew all these things. I do want to articulate these things because sometimes, you see these things and they just look like some secret trick. And you think, ďWhy would anyone think to do that? Itís just some one-off little trick.Ē Anyway, so thatís Ė itís not. Itís just block elimination, with some sophisticated methods for solving structured equations. Okay, so that finishes our discussion of Newton method. And weíre on to our last topic, which is interior point methods. And let me explain first the big picture, how thatís gonna go. Itís gonna go like this. Youíre gonna have a problem with inequality constraints. Letís forget the equality constraints because they donít matter. So weíre gonna have a problem with inequality constraints, here. Weíre gonna solve it, thatís our next step, by reducing it to solving a sequence, weíll solve a sequence of smooth unconstrained problems. By the way, Iím gonna put something above it. Like, Iím gonna put a little node above it. Iíll fill that node in later. But weíre gonna have an inequality constraint problem. We solve it Ė weíre gonna solve it by solving a sequence of smooth unconstrained problems. How do we solve smooth unconstrained problems? Newtonís method. What does that mean? We actually are solving a sequence of Ė this arrow means solve this one by solving a sequence and I mean a small number, like 10 or 15, or something like that. I donít mean some huge long limit. I mean solve a sequence of these. So you solve 5, 10 smooth ones. The smooth ones we do by Newtonís method. Newtonís method is the same as solving a set of quadratic unconstrained problems. Now to solve a quadratic unconstrained equation, thatís the same as Ė these are linear equations. Oops, there you go. These are linear equations. Okay? So and we know thereís lots to say about this. By the way, this node up here, would be Ė we have a non Ė you might start with a non-differentiable constrained problem. Right? Like you want to minimize the L 1 norm subject to a x = b. Letís say you wanted to do that, for example. Okay? The objective is not differentiable; you canít solve it. I mean, you canít use Newtonís method on it or anything like that. It doesnít fit this category. So there, we do a transformation. So this one is a transformation. To minimize the one norm, you do a transformation that goes like this. Youíd minimize the sum of the t i subject to absolute value x i is less than t i, which youíd write as, I guess, x i less than t i minus x i less than t i. Right? Everybody knows that. And so you take a problem which is non-differentiable unconstrained Ė no sorry, equality constrained and you convert it; you add new equality constraints. Now youíve added all these new variables and constraints and you might start getting all nervous and weird about it. But if you just remember that, I guess you know, the size of the problem is not what matters. What matters is the structure. Then youíll be okay. And actually, youíll even have a sense, when youíre doing this, that when your adding these inequalities, theyíre very local. And so theyíre gonna contribute to sparsity in very, very favorable ways. Okay. So but Iíll assume you start here. So smooth unconstrained, you do that by Newtonís method. In each step of Newtonís method, you are actually solving a quadratic unconstrained problems, otherwise known as solving a set of linear equations. So this is how this Ė this is the hierarchy. The first thing is just to know how itís done. The second is to understand that, at the end of the day, when you profile a code running it, itís only solving linear equations. Thatís why you need to know how to solve linear equations. And the more you know how to solve them and about exploiting structure and things like that, the better off you are. Because the equations you get down here are gonna inherit structure all the way up from this one at the top, I mean L 1 optimization being a perfect example. Okay? So youíll Ė so. A lot of people donít know this, but this is Ė but you will. So. Weíre now going to talk about this step. How do you solve an inequality-constrained problem by solving a sequence of smooth unconstrained problems? Or equality constrained, I mean theyíre basically the same. Thatís what weíre gonna do. So weíre gonna solve this problem. All the fs are twice differentiable. By the way, to get to this point, you may have to do [inaudible] in an L 1 problem; you may have to do a transformation, to get to this point. So if your fs are not differentiable, you might have to do a transformation. But Iím assuming youíve done that and now, we have the inequality constrained, smooth inequality constrained minimization problem. Okay. Well, weíll start by assuming that we have Ė the problem is strictly feasible and so I can find a point that satisfies equality constraints; all the f is are negative. So thatís Slater condition, and that means that optimal Ė the Ė if I write down the primal and dual Ė if I write down the KKT conditions, theyíre necessary and sufficient conditions for optimality. So whatíll be examples of problems like that? Well, obviously l p, q p, q c q p, g p, there are all obvious, g p and convex 1, of course. Examples that are not quite like this, you have to be careful with. Things like s d ps and s o c ps, you can actually formulate them this way, but itís quite tricky. And itís better to this by cone programming, which weíll get to soon. Okay, another example would be entropy maximization. So entropy maximization says this: Minimize negative entropy, subject to some inequality constraints and equality constraints. Right? So if x really is a distribution here on some finite set, then one of the constraints, of course, is that x is a base sum to 1. So one row in a is all ones. And then, the other rows in a, these are just expected value of any function of a random variable. So I can have a random variable on 10,000 points and I could say Ė I could give you some moments. I could give you some probabilities that are fixed. I can give you inequal Ė I could give you intervals. I could say the probability that x is in this set is between point 2 and point 3; those would translate to this. I could tell you all sorts of things about the skewness, and the kurtosis, and whatever else I like. Those will all translate to linear equality and inequality constraints. And then, I might say, ďPlease compute for me the maximum entropy distribution that satisfies those conditions, those prior conditions on the probability.Ē And that would be this problem. This is a good example of a problem like that because here, the inequality constraints are affine here. So theyíre obviously smooth and the objective is also smooth. Okay. All right. So weíll get to this idea of the log barrier. And if Ė itís actually an embarrassingly simple idea. It works like this: I want to convert that problem q a to one I can handle. What can I handle? Well, we just finished looking at equality constrained smooth minimization. So what Iím gonna do is take this irritation function, associated with a hard constraint. I mean, it looks like this. Itís zero and then it goes up to plus infinity. Okay? Itís the indicator function of the feasible Ė of actually r minus. Okay? And what Iím gonna do is, Iím just gonna replace that with a function that goes to infinity as you go here and then, but itís otherwise smooth. And what weíll do Ė I mean a typical example would be like a log function. Right? So this is called a barrier function, something like this. And basically, what it says that, instead of being totally cool until you hit zero Ė this is for a constraint, and then basically, going totally insane because thatís what this Ė but thatís just semantics of the original problem. It basically says that it depends on which one you want to pick here. But the point is, you start getting nervous when a constraint starts getting close to infeasible. Thatís what this, you know Ė then how close do you get before you get before you start going nuts? Well, that depends on how good your approximation is. And over here, you might feel more and more comfortable the more margin you have. Iím just anthropomorphizing this, but thatís kind of the idea. So the idea is, you get the someone who, instead of someone who goes insane, is totally cool until a constraint is violated and then goes insane. You want something where, the more margin you have, the cooler you are, moderately. But and you start getting nervous when you start getting close to this. Of course, you still go insane if youíre Ė it itís infeasible here. Right? Okay. So thatís called a barrier. The most common barrier is something like Ė the way to write it out is this. Is you have 1 over t Ė you have a parameter, 1 over t times the sum of the log of the negative things. That is a smooth convex function. You know that, sorry. There, at the minus sign, thatís a smooth convex function. And the parameter t simply sets, basically, the scale of how nervous you are, depending on how close something is to getting feasible. If t is big, it basically says that youíre get Ė that youíre logarithmic barrier here is actually giving you a very good approximation of the true function here. Itís, you know, itís very flat, near zero, then shoots up to plus infinity. If t is smaller, like 1, it means it might be this one, out here. Something like that. Okay. Now that is a smooth problem, this thing. This is Ė sorry, that, well this is. But this is a smooth problem. We can solve that by Newtonís method. Okay? Now, so we can solve this. And now, you can think of all sorts of things. You could set t Ė youíd set t high enough. I mean, so thereís a lot of things to do. I mean, one thing youíd want to do is prove, for example, that as t gets bigger, the solution of this actually approaches the solution of the original problem. I mean, thatís really what you want to do. All right? I guess this is the original problem, here. I minus is the one that goes, like, zero and then plus infinity. So thatís the original Ė you want to show that this approximates that and it kind of makes sense that it would. The other thing is thereís obviously something fishy here. If someone proposes to solve this by this, itís got Ė it should raise up all sorts of like red flags for you. Let me tell you how. Suppose some says, ďLook, I canít solve,Ē itís kind of like this. And this actually happens. Thereís lots of like grown people who say stuff like this, words like this. Thereís an L Ė hereís an L 1 function, and they go, ďI canít Ė you canít solve that.Ē And you go, ďWhy?Ē And they go, ďItís non differentiable. Look.Ē Now letís leave alone the fact that you solve it precisely Ė whatís interesting about L 1 is precisely that point. Itís that point there that makes, when you solve these things, lots of the entries zero. Okay? So letís leave that alone. So the standard method, I call ďthe sandpaper method.Ē So ďthe sandpaper methodĒ is this. Iím not gonna say who does this, or anything like that, but thereís still like, you know, grown people who go around and say this stuff. They go, ďWell, everyone knows you canít solve that. Well, what you do is, you get out your sandpaper and you go towards this tip and you sand it off. You go like that.Ē I mean, a typical one would be to say, ďLook, Iíll write it this way. There and then, minus there, thereís my function.Ē Everybody cool on that? Okay? Now this function, on the right hand side, analytic, totally smooth. Okay? However, the corner has been sanded down at the scale of 10 to the minus 5. Everybody agree? And they go, ďCool. Thatís like Ė thatís just not Ė itís not that Ė thatís not twice differentiable, thatís like 5 times differentiable; itís like 48 times differentiable. So I can use Newton on it.Ē Now, do you go for that? Youíre applying Newton method to something that looks like the absolute value function with just a little bit of light sanding, to get rid of the that corner. Whatís gonna happen? What is the Newton step, for example, like right there? Whatís the Newton step on a sanded, absolute value function?
Instructor (Stephen Boyd):Well, the gradient. Whatís the gradient? Like, minus 1. Whatís the hessian there? Oh, by the way, please donít calculate anything, just use your eyeball. Whatís the hessian?
Instructor (Stephen Boyd):What is it? Is it zero? I donít think so. The hessian of the absolute valueís function is zero. Itís not Ė the hessian of this is not zero, but youíre very close. What is the hessian?
Almost zero. Thank you. Itís almost zero. Whatís the Newton step, at that point then? Itís like some insane huge long step, right? Okay. But more than that, how well is Newtonís method gonna work on a sanded-down absolute value function? Well, what do you have to ask yourself? You have to ask yourself this question. If you form a quadratic model of this function at a point, how good an approximator of the function is it? Well, what do you think? Do you think Newton method is gonna work on this? Well, of course, our theorists tell us, yes it will. And, indeed, itís a proof, of course it will. But how well do you think itís gonna work in practice, Newtonís method on a sanded absolute value function? You get two guesses and then, weíll quit. The first is, like, super-well. And the second is the opposite. It doesnít work, like at all in practice. So which do you think it is?
Instructor (Stephen Boyd):The second one, youíre right. Okay.
Instructor (Stephen Boyd):Because itís obviously Ė and if you Ė and then you say, ďWell, how does that work in the theory?Ē Right down here, the hessian, this third derivative, goes insane. Right? Because the hessian goes from, you know, it goes through something where it goes super high; itís super low; it gets way high then it gets low again. You have to get down to the scale of 10 to the minus 5 to actually see this curvature. So the point is youíre right. Newtonís method, on a sanded down thing, isnít gonna work at all. So Iím just saying that you shouldnít Ė when people do this business, where they go, ďOh, itís non-differentiable. Not a problem, give me my sandpaper.Ē you shouldnít Ė generally speaking, you shouldnít believe stuff like that. This oneís gonna work out, but this one does not. Yeah?
Student:You were saying, before, that if weíre solving a problem like that, that we have to do something to make it differential before we even start this process.
Instructor (Stephen Boyd):Right.
Student:So if thatís not how you do it, then what would you do to make it.
Instructor (Stephen Boyd):No that is what you would do and thatís the correct way to treat L 1 constraints, in most cases. You do exactly what I said; you add new variables, new constraints and then, use these types of methods. So okay, weíll quit here.
End of Audio]
Duration: 79 minutes