ConvexOptimizationII-Lecture06

Instructor (Stephen Boyd):All right. Weíll start. The first thing I should say is that weíre about to assign Homework 4. How was Homework 2? That was due Tuesday, right? And you know we have a Homework 3 assigned. Weíre gonna assign Homework 4 so that weíre fully pipelined. We have at least two active homeworks, typically, at once. Whatís wrong with that? Thatís how you learn. Youíre learning, right? Sure you are. And so the other thing coming up soon is this business of the project. And on the website, we have some date we made up before the class started. And I believe itís even tomorrow. Itís tomorrow. So by tomorrow, you should produce some kind of two- or three-age proposal to us. And weíll iterate it. If we donít like it, we wonít read past the first paragraph. Weíll just give it back to you and youíll keep working on it until we like it. So the way weíll like it is itís got to be simple. Thereís got to be an opening paragraph. In fact, we were even thinking of defining a formal XML scheme or whatever for it. But itís got to be a paragraph that has nothing but English and says kinda what the context of the problem is. So for example, Iím doing air traffic control or Iím blending covariant matrices from disparate sources or something like that. And says a little bit about what the challenges are and what youíll consider. Then when you get into what the problem is, you describe kind of what it is and what youíre gonna do. If it goes on for more than a page, if the setup is more than a page, itís gonna be too complicated for the project. Your real interest might be complicated, but then thatís not gonna be a good project here. So what we want is weíll read these and weíll get feedback to you pretty quickly, we hope, on these things. So you should also be just talking to us, just to kinda refine the ideas and stuff like that. And things weíll look for is, it goes without saying, this will be done in LaTeX. Nothing else is even close to acceptable. I wonít even name alternatives.

The other is that the style should look something like the style of the notes that we post online. That should be the style. There are lots of styles. Thatís one. Itís a perfectly good style. So I donít see any reason why you shouldnít just attempt to match that style. Thatís statistically for random style checker. No statistical test should reveal that, for example, I didnít write it. That would be a perfectly good model. There are other styles that are very good and different, but thatís a good style and you could go with that. So if you look at your thing and if you find that your notation for an optimization problem differs from mine, change yours. Because, although there are other consistent notations, like thereís the Soviet one, and thereís other ones, I donít need to see them. Itís just easier. And if used consistently, thatís perfectly okay. But just to make it simple, just use our notation. So when you describe an optimization problem, it should look like the way we do in notes, finals, lectures, all that stuff. So I guess theyíll be due tomorrow at 5:00 or something like that. If youíre still sort of in active discussions with us and havenít quite closed in on that, thatís okay, but you should be Ė then come to us and let us know youíre gonna do it over the weekend or something like that. And weíll consider this just the first cut. Are there any questions about that, about how thatís gonna work? Anybody still looking for a project, because we can assign. Every lecture I say, ďThis would make an outstanding project.Ē No oneís ever gotten back to me Ė ever. We can make up project for you, too. Yeah?

Student:Can we changes to the initial proposal?

Instructor (Stephen Boyd):Yeah. Donít even think about handing us something thatís 15 pages. Weíre not even Ė actually, ideal would be a half page. I mean, you have to have just the right project. By the way, you might be able to pull it off, half page. Okay. Thatís fine. That would be great. Shorter the better. And if thereís any previous work or something like that, we havenít asked for it, but at some point, youíre gonna have to do that. So I mean, I presume at this point, if you have an idea of a project, youíve kind of done some Googling and all that. We assume that. That goes without saying. And you can add some references or something like that, if you want. Okay. Any other questions about this? Otherwise, weíll just continue. Oh Ė no, weíll wait. So last time we looked at localization methods. So the basic idea of a localization method Ė you can think of these as just generalizations of bisection. Theyíre generalization to Rn. So you have bisection in multiple dimensions, and itís not a simple Ė thereís no simple extension to Rn. Because in R, first of all, convex sets are kinda boring. Theyíre basically just intervals. And thereís an obvious center of an interval. In Rn, things are much more complicated. You have many different centers. I donít know if you have that point. Where is it? Here we go. Thanks. Okay. So the basic idea is you have some polyhedron, where you know the solution Ė if it exists. And you query a cutting-plane oracle at the point x, and what comes back Ė in this case, this is a neutral cut here. In other words, what itís basically telling you is that you need not even consider anything over here. So the solution is not here, if it exists. And so this is your posterior or updated ignorance set, or whatever you want to call it. Itís the set of points or localization set. Itís the set of points in which, if there is a solution, you know it must lie. I should mention one thing, thereís more material on this, of course, in the notes, which you should be reading. And weíre expecting everybody to be reading. Okay. So bisection on R is the most obvious and well-known example. Letís look at some other ones. Weíll look at some of these in more detail today. The first is the CG method. I just got up to that at the end of last lecture. Here we have x(k+1) is the center of gravity. Iíll review that quickly and say a few things about it that I didnít get to say last time. Other would be maximum volume ellipsoid, Chebyshev center, analytic center. These would be some of the other ones. And weíll talk about each of those in a little bit of detail. So the CG algorithm is the CG of the polyhedron of your query point. And thereís an amazing fact.

In fact, we may even, if we can find a simple enough group, weíll stick it on Homework 4, which weíre working on. And the amazing fact says this, that if you take any convex set in Rn, and take the center of gravity of that set, and then pass any plane through that point, the question is, how unevenly does it divide the volume? So anybody guess the answerís 50/50? For example, in R, itís 50/50. But it turns out immediately you draw a couple of pictures in R2 and youíll see that, in fact, thereís no point Ė you can make a convex body for which thereís no point for which all lines going through it divides the area 50/50. But it turns out you canít skew it too much. And it turns out than the actual ratio you can go to is 0.63/0.37. So itís roughly not even two to one. Because itís this divided by 0.37. Not even 2 to 1. Itís more like 1.5 to 1. And that means, basically, if you find the CG of a set, and you put the worst hyperplane through that point, it will chop off at least 37 percent of the volume. So thatís what it is. Okay. So that says in the CG algorithm, thatís very nice, because it says the volume of the localized set goes down by the factor of 0.63 at least. It can go down more. If you have a deep cut, it goes down by more. And of course, the 0.63 is the factor for the absolute worst possible hyperplane at each step. And in fact, it obviously is typically much less, or something like that. So thatís that. Weíll look at the convergence and look through various things. And we came, at the last minute, to basically the problem with the algorithm. The problem with the algorithm is computing the CG is extremely difficult. So itís a conceptual problem is the way itís described. Now you can modify the CG method to work with approximately CG computations. Iím gonna mention some of those, because thereís some very cool new work, not from Moscow, from MIT, and itís very, very cool stuff. Iíll say some of that. Obviously, you donít have to compute the CG exactly. So first, there are methods even from the Ď60s, where a cheaply computable approximation of the CG is given, and then you have a chopping, a slicing inequity theorem, which is not as strong as the 0.63, but is enough to sorta give you convergence and so on. So those you already had in the Ď60s. So those are approximately CG methods. But let me just mention, actually, a really cool method developed four or five years ago, something like that, at MIT. Very, very cool. Goes like this. You want to calculate the CG of a convex set. And you can use a randomized method. And the randomized method is called hit and run. Has anyone heard of this? Itís just very cool. Itís an advanced class, so I can go off on a weird tangent and tell you about it. So the way it works is this, you have a point, and you generate a random direction. So you generate a random direction and you make a line. And you need the following: you need the ability to find out where that line pierces your set. So thatís what you need. So hereís your set. Iíll make it a polyhedron, but it does not have to be a polyhedra. Just any old convex set. It doesnít really matter. Hereís your set. And you take a point here, and you generate a random direction. Actually, how do you generate a random direction? Uniform on the unit sphere. How do you do that?

Student:[Inaudible].

Instructor (Stephen Boyd):Thank you for bringing that up. I should say, calculating the CG in 2D and 3D is no problem at all. 2D is trivial. All you do is you calculate, you find a point in the set, you calculate a triangulation of the polyhedron, done. In 3D, same thing. However, itís gonna scale sorta exponentially in the dimension. So if youíre interested in 2D and 3D, CG is a non-problem. Okay. So how do you generate a random direction, uniform on the unit sphere? Just a random variable uniform on the unit sphere?

Student:Choose a random vector.

Instructor (Stephen Boyd):Yeah, fine, but how do you choose a random vector? What distribution?

Student:[Inaudible].

Instructor (Stephen Boyd):Why, because you canít think of any other? You happen to be right. You choose a random Gaussian vector. So itís n0i vector. Thatís circularly symmetric. And so if you then normalize that, you get a uniform distribution on the unit sphere. So you choose a random direction here, and so hereís the line you get. And the only thing you need to do is to calculate these two intersections, which is very easy to do. Now you do the following: you generate a random variable uniform on this interval, and that is your new point. So thatís gonna be right here. And you repast. So you go here, you calculate a random direction Ė the direction turns out to be this Ė you do a random point on there when you get there, and you go on. So I suppose these are the hitting points and this is the running, or who knows. I donít know how that name came up, but thatís what it is. Everybody got this? So thatís a Markov chain. Itís a Markov chain propagating in the polyhedron. And it actually converges to a steady state distribution thatís uniform on the polyhedron. So thatís a method for generating a random variable thatís uniform. By the way, if you have a method that calculates a random variable thatís uniform on the set, how do you find the CG?

Student:[Inaudible].

Instructor (Stephen Boyd):Thank you. You just average. Let me ask you another question. Anybody got other ideas for how to generate a random point in a polyhedron uniform distribution?

Student:Run the distribution everywhere and reject the ones at the outside of the set.

Instructor (Stephen Boyd):It couldnít be everywhere, because it has to be uniform. Okay. So what you do is you find a bounding box. Everybody knows how to find a bounding box. How do you find a bounding box for a polyhedron? This is the problem with scheduling class at 9:30. Itís before full outer cortical activity has warmed up. How do you find the bounding box for a polyhedron?

Student: Minimize x1 [inaudible]. Instructor.

So you minimize x1 and you maximize x1. You minimize x2 and you maximize x2. And these are like LPs or something. Of course, maybe weíre trying to solve an LP. I donít know what weíre trying to do. But still, thatís how you do it, by solving two NLPs. So you make a big bounding box like this, and now you generate a uniform distribution on that. And you do that by taking a uniform distribution of each edge and sticking them together. And then you reject the ones that are outside this thing. And the ones inside, actually, then, are drawn from a uniform distribution on the polyhedron. Any comments on how you imagine that might scale with the dimension? Iím not saying you have the bounding box.

Student:Exponentially.

Instructor (Stephen Boyd):Which way, though? Exponentially what? What would happen exponentially? Youíre right, something happens exponentially. What is it?

Student:To test if the point is in the polyhedron?

Instructor (Stephen Boyd):No, no. To test if a point is in the polyhedron, you form ax. You give me x, I calculate ax. I check if itís less than or equal to b. Thatís fast. What happens.

Student:Volumes.

Instructor (Stephen Boyd):There you go. Ratio volumes in the worst case goes down exponentially. And that means youíll generate exponentially many random points before you even get one in this set. So although that does, indeed, produce a uniform distribution on the set, itís not gonna scale gracefully to the high dimensions. Okay. So anyway, this is the hit and run method. Thereís lot of Ė you can certainly do a cutting-plane method with something like this. And a more sophisticated version. Okay, so thatís the CG method. Yeah?

Student:[Inaudible] in the vertices, where it intersects?

Instructor (Stephen Boyd):Oh, here? Oh, thatís actually quite straightforward. If this is a polyhedron, let me show you how that calculation goes. So the calculation goes like this: hereís your polyhedron. Itís defined by that inequality. So ax less than b. Youíre given a base point x0 and a direction, v. And you want to find the maximum and minimum values for which this holds. That holds for t for an interval, right. So Iím assuming this is bounded polyhedron. Iím assuming x0 is in the polyhedron, although none of that matters. So that says that a(x0), thatís if t equals zero, is less than b. And now you want to find how much can you increase t until this becomes false. And how much can you decrease t below zero until it becomes false. Thatís the question. So you just write it this way. Itís actually kinda dumb. You write a(x0) plus t times av is less than or equal to b. Thatís a vector, thatís a vector, now itís very easy to work out the maximum. Because this is basically Ė letís call this a plus t Ė no, no, that wasnít a good idea. How about a plus tc is less than b. Now itís gonna work. Now itís extremely easy. What you have to do is this, if I increase t, this thing Ė by the way, if a ci is negative, or zero, increasing t doesnít hurt Ė weíre not going to get into trouble there. So if I want to know how high can I increase t, I only focus on the positive entries of c. So I first look at the positive entries of c, and I divide them by a, or something like that, and then I take them in, or something, and thatís my Ė well, I do something. Anyway, itís a 300-level class. I can just say itís easy to do. This made sense, though? So in fact, when you actually do this, itís two little lines of something. Okay. Maximum volume ellipsoid method, this method is not a conceptual method; itís a real one. Actually works really well. And here, x(k+1) is the center of the maximum volume ellipsoid in Pk, which, I donít know if you remember or not, but that can actually be computed by solving a convex problem. By the way, how about the minimum volume ellipsoid method? In that case, you calculate the minimum volume ellipsoid that covers the polyhedron. Letís have a short discussion talking about the minimum volume ellipsoid method. I donít know if you remember this, but you can kinda guess. Any comments on the minimum volume ellipsoid method? By the way, it would work really well. You would get a guaranteed volume reduction and everything.

Student:

[Inaudible].

Instructor (Stephen Boyd):Yeah. In fact, finding the minimum volume ellipsoid that covers a polyhedron defined by linear inequalities is actually hard. The method is like the CG method. I gave you enough hints that you could figure it out. Did you actually remember that?

Student:[Inaudible].

Instructor (Stephen Boyd):Oh, and then your remembered. Good. Okay, thatís a short discussion of the minimum volume ellipsoid method. But the maximum volume ellipsoid method is actually a quite Ė itís not a conceptual method. It actually works very, very well. And in this case, the volume reduction you get is actually very interesting. And the volume reduction is this: the guaranteed volume reduction is not a fixed number, as it is in the CG method. In the CG method, you will reduce by 37 percent, end of story. Any dimension. In this case, your volume reduction guarantee degrades with dimension. By the way, it degrades gracefully enough that this method can actually be used to construct a polynomial time convex optimization method. Because a good enough approximation of the maximum volume ellipsoid can be calculated in polynomial time and so on and so forth. And this is also called Ė itís got four initials, TK Ė Russian names go together. I wonít attempt it. Actually, I think itís described in the notes. Now, once you know to get this volume reduction, you can work out a bound on the number of steps required, and in this case, it scales not like n, as it was before, but itís actually n2. So this one degrades. And this is actually quite a good practical method. Okay. Other ones? You can pretty much make up your own method here. Chebyshev center method, I thought Iíd mention it. Now, the problem with the Chebyshev center, thatís the largest Euclidean ball in Pk, and thatís solved by an LP, by the way. Thatís a linear program to calculate the maximum volume ball that fits inside an ellipsoid is just an LP. Now this one, however, is not affine in variant. In other words, if I take my original problem, and I do a linear change Ė if I do a scale, actually, if I scale the variables, maximum volume ellipsoid is CG, you get a commutative diagram. In other words, if you take a polyhedron and you subject it to an affine mapping, you multiply it by t and add some s, or something Ė tís a non-singular square matrix Ė then the CG will also change coordinates exactly the same way. Because CG commutes with an affine change of coordinates. So that means that the CG method is actually affine coordinate change in variant. Okay. The same is true of maximum volume ellipsoid. If I give you a set and I ask you to find the maximum volume ellipsoid inside it, if I then change coordinates by an affine change of coordinates and ask you to do it again, it commutes. So the maximum volume ellipsoid that fits in the transform thing is the transform of the maximum volume ellipsoid that fit in it. So that was my verbal way of drawing a commutative diagram. Now, this has a number of implications. First of all, for our complexity theorist friends, this is, of course, extremely important.

But it actually has lots of practical implications, and theyíre quite serious. Basically, you have a method thatís affine in variant, in practice what it means is scaling of variables is a second-order issue. Now in exact arithmetic, of course, itís a non-issue. But it means itís a second-order issue. Itís a second-order issue because scaling can hurt you, but only through numerical round off and things like that. Therefore, being off by a factor of 1,000 in the scaling coordinates and no one gets hurt. Whereas, a method like Chebyshev center, here, I guarantee you, if you have your Chebyshev method, the center method works really well, and I scale, for example, the first and the eighth coordinate by a factor of 1,000 to 1, your method is not going to work, basically at all. So thatís the basic point there. Now having said that, I should say this, if youíre solving a specific problem where you know what x1 is, x1 is measured in meters per second, and it varies between plus/minus five. And x7 is a pitch rate in radiants per second, or something like that, or radiants per second squared. And you know how itís scaled. So in actual specific practical problems, the whole business of affine and variants may not be that relevant, because in a specific problem, you generally know what the numbers are, what the ranges are and things like that. Okay. Now weíre gonna look at this one in considerable detail, actually, in the next lecture, because itís one that this one seems to have a very nice property. It actually works really well in practice. And it also has some nice theoretical properties. So in this case, you take x(k+1) to be the analytic center of Pk. I should mention that this English here, the analytic center of Pk, thatís slang. Thatís informal and itís actually not correct. Because the analytic center is not the analytic center of a geometrical set. You have the analytic center of a set of inequalities. And obviously, you can have multiple sets of inequalities that describe the same geometric set. And in fact, take your favorite polyhedron, and add all sorts of redundant inequalities. The set hasnít changed; the analytic center has. And in fact, the following is true: take a polyhedron, and take any point in the interior of your polyhedron. So pick any point at all. It canít be on the boundary, but take any point in the interior of the polyhedron. By adding redundant inequalities, you can make any point you like the analytic center. Now, you may have to add a large number of inequalities and things like that, thatís true. But the point is, you should remember that this is informal and it is slang. So meaning itís just not right. Okay. So the analytic center is this. Itís the argmin, of course, of the sum of the logs of Ė you maximize the sum of the logs of the slacks. These are the slacks in your inequalities. By the way, this is the same as maximizing the product of the inequalities. So itís exactly the same thing. You maximize the product of the inequalities. By the way, if you Ė thereís many other things you could do. You could do things like maximize the minimum slack.

And if the aiís were all normalized to have the same length, for example one, that would correspond exactly to the Chebyshev center. Because bi minus ai transpose x is norm Ė whatever it is. Itís the distance to that hyperplane divided by the 2 norm of ai. So this works quite well. And weíre gonna go into that in great detail later today. Okay. For these cutting-plane methods, there are lots of extensions. Iíll mention a bunch of them, and when we look at center cutting-plane method, weíll look at how these things work. So once you get the idea that the only interface is a cutting-place oracle, then you can imagine all sorts of cool things. So you could do all sorts of things. You could do things like this. Your oracle, when you call your oracle with x, instead of just giving a single Ė I mean, thereís tons of stuff you could do. Instead of just returning a single cutting plane, you could easily imagine something that returns multiple cutting planes. And Iíll give you an example. Suppose youíre solving an inequality constraint problem. The simple method would be this: you take your constraint functions and you evaluate each one. You have to do that. So you evaluate f1, f2, f3, and the first time you find an f thatís positive, you call that f, fk.getsubgradient. And you get a subgradient back. Thatís one method for generating a cutting place. But in fact, what you could do is if you sweep all the constraints, you could then get the most violated constraint. But another option is to give all the constraints, just give them all at once. And you just update your polyhedron and everythingís fine. By the way, you can actually return not just Ė if you return multiple inequalities, you could actually return shallow cuts. Shallow cuts are fine, as long as one of the cuts you return is actually deep, or neutral. So itís still useful information. So let me just draw a picture to show what that would look like. So that would look something like this. And you might get some other kind of very colorful name for it. Hereís your localization set. Hereís a point Ė actually, it doesnít matter however you calculated it, CG, hit and run, analytic center Ė it doesnít matter. What would happen is you call it here, and you could do something like this. It could give you a deep cut, basically says, donít bother looking over here. But thereíd be no reason that it couldnít also give you a shallow cut like that, that says, donít bother looking here. No problem. So you just incorporate both. So thatís how that works. I mean, thatís kinda obvious. Actually, whatís interesting about it is that a lot of these things donít really make it work much better. Okay. Instead of appending one, you append a set of new inequalities. You can also do non-linear cuts. So youíll also find this talked about all over the place. And letís see, these would work Ė you could have quadratic cuts, where you return Ė that basically says, instead of returning a half space, where the solution, if it exists, must lie, youíd return an ellipsoid, basically. And thereíd be plenty of cases where something like this would work, if your inequalities were quadratic or something like that. In this case, the localization set is no longer a polyhedron, but that doesnít really matter. I mean, in particular, things like ACCPM, they work just fine.

So these are kind of obvious and the ones, however, that are quite useful, and also quite strange and mysterious, is the idea of dropping constraints. So dropping constraints would go like this: after a while, you generate a big pile of constraints. So what happened was, you started with a whole bunch of inequalities. At each step, you have added at least one inequality to your set. So whatís happening is the polyhedron, the data structure that describes your current ignorance, is growing in size as you go. And that means, for example, whatever work youíre gonna do on it is also gonna grow in size. I mean, like grow linearly or polyhedrally Ė sorry, like polynomially. But at least itís gonna grow linearly. When you add these, thereís the option of deleting dropping constraints. So then how do you drop constraints? The most obvious this is the following: hereís your current set. And here are some redundant inequalities. So one obvious thing is to call a method on your current polyhedron, which is minimal. And what minimal does is it goes through every inequality, and drops the ones that are redundant. By the way, how do you find a redundant inequality? Suppose I give you a polyhedron and I point to the 15th constraint and I ask you, ďIs it redundant?Ē How do you do that? How about that first one? How do I know if the first inequality is redundant in a polyhedron? Got any ideas?

Student:If itís linearly dependent on other columns?

Instructor (Stephen Boyd):No, that might be wrong, because the b it connects into. But that would be one. Thatís how you detect linear Ė thatís linear equality equations being dependent. But thatís Ė it beats relatedness, but thereís actually really only one real way to do this.

Student:Consult an LP and see if itís strictly feasible?

Instructor (Stephen Boyd):You got it. Hereís what you do. You want to know if a1 transpose x less than bi is redundant. Actually, thereís a very sophisticated way to do it, but this is very simple, basic, everyone understands it. Hereís what you do. You maximize a1 transpose x over the remaining inequalities. Thatís an LP. Now, the optimal value of that LP is either less than or bigger than b1. If the maximum of this over b1 is less than or equal to b1, what can you say about the inequality? Redundant. So thatís one way to do it. Otherwise, actually, what that method will do is it will either produce a point that satisfies all the other inequalities, but not that one, or it will produce, basically, a certificate proving that inequalityís redundant. And actually, if we get back to your suggestion about linear dependents and stuff like that, thereís more sophisticated ways. In fact, is it relevant? I donít know. Thatís a great homework problem in 364a, but I guess this isnít 364a. Anyway, the question would be, how do you find out if the first five inequalities are redundant. You should think about that. Okay. You could implement a method that was called polyhedron.reduce, and it would just go through and solve one LP for each constraint. So it would check an LP for each constraint and drop that. Now, if, by the way, the subgradient calls are fantastically expensive, the subgradient calls are cutting-plane oracle calls. If the cutting-plane oracle calls require a network of computers to fire up and the wind tunnel at NASA aims to get fired up and all that kind of stuff, then no problem calling localization polyhedron.reduce. No problem. But in a lot of cases, that would actually be too much Ė to solve whatever k LPs would actually Ė anyway, okay. Thereís gonna be cheap ways to drop constraints, and thereís safe and unsafe constraint dropping. Safe constraint dropping goes like this: you drop constraints that, without actually solving an LP, you know are going to be redundant. Weíll see how thatís done with analytics and cutting plane. Very cheaply. Thatís one method.

And the other method is the unsafe method, is you just drop constraints, whether or not they are redundant. That method works, by the way, really well. Capital N here. It seems sort of like the word on the streets is that if capital N is five little n, everythingís cool. There are some proofs of this and all that, but theyíre very complicated, as you might imagine. When you are dropping constraints that are not redundant, you are increasing the volume of that set. Thatís our merit function. So now the proofs get very tricky. And you have to show that your constraint dropping method is increasing the volume of the localization set more slowly or roughly than your cutting planes are reducing it. So that would be it. But the fact is that weíll see something like N equals 3 to 5 n works, apparently, very well. Okay. In fact, thereís a shock. Weíll see it next lecture, which is gonna come up very soon. Whatís weird is you can actually drop constraints, and the method will do better. I know itís a bit weird, but itís also true. It can happen. Okay. Another variation on this is so-called epigraph cutting-plane method. If you have Ė it works like this. You put the problem into epigraph form like this, and what youíre really going to do is get a cutting-plane oracle for this problem here. So this is the problem. And what you want to do is you want to separate x and t. So x should be considered as your current proposal for x, or something like that. And t is to be considered here and upper bound on the objective. And youíre gonna separate this from x*, p* where p* is the optimal value. Now the way to do that is this, if the current point is infeasible for the original problem, and violated the jth constraint, you add a cutting plane that is associated with that violated constraint. And this is a deep cut here. I guess this is positive by definition. This is a deep cut here, this one, where you get a subgradient of the violated constraint here. And this is a deep cut here. And you can also just put this with zero here and that would be a neutral cut for the same problem.

Now, if x(k) if feasible for the original problem, you can actually add two cutting planes. And the two cutting planes you can add are this. This is a deep cut here. This is a deep cut that you add in x and t here. And this here is actually something that says that the current point, thatís the objective value because feasible, t, which is supposed to be an upper bound on the optimal value, is clearly less than that, so you can add that one, as well. And this requires getting a subgradient of the objective. So thatís how you do this in a Ė you can run a cutting-plane method in the epigraph. Okay. Now you can also get a lower bound. Weíll get to some of these in a minute. This is related to something called Kelleyís method. But suppose you evaluated a function, f, and a subgradient of f at q points. So it doesnít need to be convex, but of course, it has to have subgradient. What that means is that f(z), by definition, means f(z) is bigger than this, all z. So it says that this is an affine global lower estimate on f. Well obviously if f(z) is bigger than each of these, itís bigger than the max. So you can write it this way. And this piecewise linear function here, weíre going to call that f hat. So itís a very useful way to think of what it means if you get a subgradient at a bunch of different points, a single subgradient will give you an affine lower bound. If you call multiple subgradients, if you get multiple subgradients at different points, for example, if its an algorithm and your k steps in, and at each step youíve evaluated subgradient, you actually have a piecewise linear lower bound. And thatís this thing. And Iíll draw that. Itís kinda obvious what it is. But let me just draw it just to give you the picture of this. So hereís the picture of that, other than R, which is kind of a stupid example. Hereís your function. And letís suppose Iíve evaluated the Ė radiant, sort of here, you know, and here, and here. And what you do is you simply draw these first-order approximations, like this. There you go. Something like that. And then this one looks like that. And this function right here that goes like this, thatís a piecewise linear lower bound on your function. By the way, if you minimize that, you get this point here, and thatís clearly a lower bound on your function. The point is that once you evaluated a couple of Ė not a couple, but in fact, n + 1 subgradients of a function in Rn Ė if you evaluated n + 1 subgradients in Rn, then generically, you can now produce a lower bound on that function. The cost is solving an LP. So what happens is this, if you replace f0 and fi by their polyhedral under-approximators, then solve this problem, which is an LP. Well, itís not an LP, but it can be converted to an LP. You know how to do that.

So if you convert this to an LP and solve it, I guess this oneís trivial, and this one you introduce one epigraph variable t, you get an LP. You solve that and the optimal value is a lower bound. In fact, itís this value here. By the way, we didnít mention this. Iíll mention it here, because itís an interesting method. If you solve that LP here, then you get a lower bound on the optimal value, and if this is your new point at which to query, you get a nice and you get a famous algorithm. So if your next query point is here, and it goes like that, thatís called Kelleyís cutting-plane method. And itís actually quite a good method from the Ď50s, I think. Either Ď50s or Ď60s, one or the other. Thatís Kelleyís cutting-plane method. And what would happen is when you add this, your lower bound goes up there. And your next point in here. And you can actually see in this case itís a smooth problem, that itís gonna do pretty well. Itís not hard to show that it converges and so on. A small modification of Kelleyís cutting plane method, actually, makes it work really well in practice, and itís not clear from this picture why it wouldnít. But of course, pictures are generally in R2 and R3. And thatís not where Ė thereís no interesting problems. These arenít real optimization problems. Real optimization problems go down in like R10 minimum. They get mildly interesting in R100. And actually are real honest problems in R100K, or something like that, where your intuition in drawing pictures isnít going to cut it even close. Okay. Letís see. So weíll go on and do the analytic ACCPM. Okay. So weíll look at ACCPM. I think this is sort of Ė Iím not sure, but I guess Iíd probably put this at the if someone needed to run one of these non-differentiable methods or something like that, a cutting-plane oracle, or something, and you actually really needed to do this, I think this is probably Ė depends on the situation Ė but this is probably the algorithm of choice. Weíll look at that. And thereís a couple of things we have to talk about. So the first is, is just to remind you what the analytic center is.

Again, this is not correct. Oh, unless Ė well, this is wrong. This notation universally defines p to be this set, and this is not a function of p. Itís really a function of the inequalities. But weíll leave this and just understand that thatís slang. So what you have to do is you have to maximize the product of the slacks in a polyhedron. And that can be done by Newtonís method. This is completely smooth. Itís actually self-important. So that means if youíre gonna give a complexity theory result for it, youíre already well on your way. And hereís the ACCPM algorithm. You start with a polyhedron. By the way, it doesnít have to be even a polyhedron, but it doesnít matter. You start with a polyhedron known to contain the target set. Then you go this, you calculate the analytic center, you query the cutting-plane oracle. The cutting-plane oracle can tell you that youíre done, in which case you quit. Otherwise, it returns a cutting-plane inequality, thatís this, and that is appended to p, the description of p. Now if this polyhedron is empty, then you can announce that capital X, the target set, is empty and quit. Otherwise, you increment k and repeat. So thatís ACCPM. Letís see, for constructing cutting planes, I think weíve actually looked at a bunch of this before, but weíll go over it anyway. If you have an inequality constrain problem, then you do the following: if x is not feasible, you choose any violated constraint and do an added deep cut. If itís feasible, you can do a deep objective cut. And the deep objective cut is given by you maintain the best value youíve found so far, and you add this. So thatís a deep objective cut. You get the analytic center. You have to solve this, and thereís lots of ways to do that. But there is one issue, and thatís this. Youíre not given a point in the domain of this function. So thatís actually the challenge here. And thereís lots of ways to get around this. One, of course, is to use a phase one method to find a point in the domain. And that actually will have the advantage here of if you do that, the advantage here is going to be that that will surely, in a graceful way, determine if this thing is empty. If itís not empty, you can use a standard newton Method to get the analytic center. You donít have to use an infeasible start Newton method. And thereís one more method, which is you can actually solve dual. You do the dual of this analytic centering problem. And if you think about why that will work very nicely, you have a set of inequalities, the dual has a single variable for each inequality. If you add an inequality, youíre actually adding a new variable to the dual. And so, there the initialization is easier.

So thatís yet another method. Infeasible start Newton method, actually, the truth is I didnít cover it, really, at all in 364a. Itís in the book and stuff like that. But here it is. I covered it maybe too fast. You want to minimize this separable convex function subject to y equal b minus ax, so we introduce this new variable, y. And the way itís gonna work is this: the infeasible start Newton method does not require the points to be feasible. The traditional Newton method you must start with a feasible point and every direction you generate is in the null space of a, and so the points remain feasible. So itís a feasible method. Infeasible start Newton method, you donít have to be feasible. So hereís the way you would solve this problem. You initialize x as anything you like, zero, the last x you found, some guess, or something like that. And a good Ė the radical method is to initialize all these yís to be, for example, one. Why not? Thatís a point nice and squarely in the domain of the objective, which is the sum of logs. When you do that, obviously, this equality constraint is not satisfied, unless youíre really, really lucky. Unless it turns out that your x. So this is the initialization. So here would be a common choice. Suppose you had to guess x previous. You could choose y to be the following: you would evaluate each bi minus ai transpose x, thatís the ith row here, and if thatís positive, no harm. Leave yi to be equal to that. If itís positive, itís in the domain of log y. And there might be a little threshold value, but small enough you donít. Otherwise, you just set yi to be one. So what happens, then, is these equality constraints here are some are already satisfied. But anyone where this is not the case are not satisfied. So thatís how that works. By the way, this infeasible start Newton method is very cool. It has the property that if any constraint, literally coordinate by coordinate, is satisfied at any step, it will be satisfied forever afterward. So for example, if thereís 100 inequalities here and 80 of these are satisfied immediately, but 20 are violated Ė or letís make it even simpler.

Letís do cutting-plane method. So in a cutting-plane method, you just solve the problem with 99 inequalities. You added one inequality, and now you have 100. You satisfied all the 99, because you were actually at the analytic center of those 99. However, the new constraint you add, you for sure violated. Well, you might just have equal to zero. It might be just Ė if itís a neutral cut. If itís a deep cut, youíre guaranteed to violate that point. But then whatís cool about that is only that one Ė you have 99 equality constraints here are satisfied and only one is not, which is the new one. Everybody understand? Okay. So the method works like this, you define the primal dual, the primal residual is y plus ax minus b. Now of course, in standard Newton method, you are always primal feasible. So you donít even have the concept of Rp. Rp is just zero always, initially, every step Rp is zero. And it stays Rp because your Newton step is in the null space of a. So no matter what step you take in the null space of a, you continue to satisfy ax equals b. In the infeasible start Newton method, thatís false. Rp starts off Ė in fact, itís guaranteed of the original point was not feasible, starts off non-zero. And the dual residual is a transpose nu ng plus b plus nu. So these are the components associated with both x and also with y. Okay. And here the gradient is very simple to calculate. Itís simply the vector, which is basically one dot with a y minus sign. Thatís the gradient. Now the Newton step is defined as follows: you actually linearize these equations at the current point. And youíd linearize the equations Rp equals zero and Rd equals zero. Now, this equation is already linear, so thereís hardly any linearization going on there. This one is not linear, because g is a non-linear function of y. And youíll get something that looks like this. Now, h here is the heshen of the objective, so thatís diagonal. You get a and you get a transpose and so on. Now thatís a large system. The system is size, I guess, letís see, I canít remember the dimensions of n and m, but y is going to be m, so that m by m Ė this is not a small system here. Itís quite big. However, this should be screaming structure at you. Screaming at you with structure. Thereís tons of zeroís, thereís iís. This is diagonal. Thereís a and thereís a transpose. So you should have an overwhelming and uncontrollable urge to apply a smart method to solve these equations. And in fact, you can do this by eliminating various blocks and things like that. Instead of having a complexity, which is cubic and m plus n, youíll actually do much better.

Whatíll happen is, you can actually solve them by calculating delta x this way and delta y that way, and so on. Now when you calculate delta x, you have a couple of different methods for actually computing delta x. One is you actually form this matrix. And in fact, here you can exploit any structure in a to form this matrix. But this matrix, you can Ė even if the matrix is dense, you can exploit structure informing it. That would be one method. If this multiplied out matrix here, if this heshen is actually has structure, you could use a Sparse Cholesky Factorization or dense or something like that. And the other method for solving this, and this would be probably the method that would be not frowned upon by people who do numerical things, if you just simply worked out what this is, this delta x is the solution of this least squares problem. If you actually form this matrix, a transpose ha, I guess thatís a Ė most people who do numerical things would consider that a misdemeanor. Itís called squaring up. So whereas, over here, you never actually form this matrix, which has, whatever, the square of the condition number of this thing here. So this is another way to do it. And by the way, the h halves donít have to scare you, because theyíre actually diagonal. And this is nothing but a diagonal least squares problem. So hereís the infeasible start Newton method. You start with a point, x, arbitrary. And y has to be positive. It could be all ones or that initialization we talked about before. You have your usual parameters. And what you do is you calculate the Newton step in delta x, delta y, and delta nu. And you do a backtracking line search. In infeasible start Newton method, you cannot use the function value to do a line search. Thatís completely obvious because youíre infeasible, so your objective, actually, can be very, very good. Very good. Itís just totally irrelevant because youíre not even feasible.

And you really canít Ė anyway, it would make no sense. And in fact, you could easily see the method would fail, because you initialize it at any point thatís infeasible, but has an objective value that is better than the optimal objective value, which of course, is higher because it requires feasibility. Then if you did a line search or just a numeric function, the function value youíd never get to the optimal point. That would pay off. Instead, what you do is you do a line search on the norm of the residual. And thatís easy enough to do. The simple way is something like y plus t delta y is not positive, you could do this. By the way, related to the question you asked earlier about this, itís actually must more common. In fact Ė it doesnít matter. You can either t time equal beta in a while loop here, and simply find out if youíre positive. But the fact is, youíd probably immediately, if youíre writing this in a real language, what youíd really do is youíd go here, youíd get y and youíd get delta y, and youíd quickly calculate the largest t for which this is still the case. And then youíd take the minimum of that t multiplied by 0.99 and 1. So thatíd be your minimum. And then youíd check. Youíd drop down to this one. And then youíd t times equal beta over here until this is satisfied. And then you update. Now thereís actually some pretty cool things about this method. This method has several properties. Iíll just mention them and you will see them at some point soon. Maybe on Homework 5. I donít know. Youíll see them. One interesting point about it is this, if you ever take a step size of one, then all equality constraints become satisfied on the next iterate. And they will always be satisfied from then on. Thatís one property. Another one is if any of the individual equality constraints is satisfied, ever, at some step, it will be satisfied forever on.

Actually, thatís unlikely to occur during the algorithm, so probably the better way to say that is something like this, any equality constraints initially satisfied will remain satisfied during the algorithm. Then there will be a point at which you take a step size of one, after which all equality constraints will be satisfied. Thatís how that works. Okay. I think I already said all these things. Thereís a couple things to mention. This method does not work gracefully if there is no feasible point. Feasible means thereís no point in the domain. So thereís no positive y for which y equals ax minus b, or something like that Ė b minus ax, y is b minus ax. Thereís no positive y that satisfies that. And then, of course, obviously, what happens is the Newton method churns along. It obviously cannot take a step length of one, because if it took a step length of one, the next would be feasible, but thatís impossible, so it doesnít. So there are methods you can do to recognize infeasibility and stuff like that, but I think the best way to say it is that infeasible Newton method is designed for things, which are going to be feasible. And they just donít gracefully handle infeasible things. Letís go back to this pruning and I can say something about this. This is also material from 364a and in the book, but itís a big book and 364a we went way fast, so I didnít expect anybody to get it. So weíre just doing backfill now. These are actually very, very good things to know. So if you have the analytic center of a polyhedron Ė again, thatís slang Ė then if you calculate the heshen at the analytic center here, and that simply looks like this. Itís nothing but this thing here. The gradient, of course, of the analytic center is zero, obviously. Thatís the definition of it. Then it turns out the following is true: that defines an ellipsoid. In fact, let me say something about the other one. Because thereís actually some very good things to know about these. These are in the book, I think, but I canít really remember. Here it is. Hereís a polyhedron. Hereís a point, which is not the analytic center. Calculate the heshen, thatís this thing up here. The heshen of the log barrier like this, like that. Thatís not the analytic center. And this ellipsoid here is the set of x such that x minus z transpose h, this is less than one. So youíd simply calculate that heshen right there, that formula, without the stars. You calculate that heshen. This ellipsoid is guaranteed to be inside that set, always, whether or not youíre at the analytic center. Thatís gonna come up later, because weíre gonna look at another interesting method, very interesting method called the Deacon method.

So at any point, if you calculate the heshen of the log barrier function, then the ellipsoid with a one, defined by that heshen, is inside this set. Period. Letís go to the analytic center. If you go to the analytic center, letís say thatís here. Then this result holds, because it actually holds for any point. So if I draw the ellipsoid like that Ė I made it too small, by the way. Youíll see why in a second. If I draw it with a one, itís inside the set. And what this says is if I draw it with an m2 Ė letís see, one, two, three Ė how many of these? Six? So if I draw this ellipsoid with a one, Iím inside it. This result says if I puff that up by a factor of m, which is six in this case, you will cover the set. No, Iím gonna miss it. I didnít draw it big enough. But theoretically, if I drew this right, and I puffed up that set by a factor of m, I would cover the set. So if you calculate the analytic center, you get, actually, an ellipsoidal approximation of your polyhedron. And itís an ellipsoidal approximation that is within sort of a factor of m in radius. In other words, you have an inner ellipsoid, you have an ellipsoid guaranteed to be inside. Puff it up by a factor of m, guaranteed to be outside. By the way, the fact that this is guaranteed to cover the set is not Ė thatís actually interesting, because if I just give you a polyhedron, and I give you an ellipsoid, and I say, ďWould you mind checking if your polyhedron is inside my ellipsoid,Ē thatís NP hard, because you have to maximize a convex quadratic function over a polyhedron. So you canít even check if a general one was. So in this case, thereís a fast way to verify that this is true. I think that this is easy to show. It might even be Ė you may even do it yourself. Who knows? So what this allows you to do, now Ė and youíre gonna calculate this if you use analytic center cutting-plane method, you had to calculate this anyway, because thatís that ah a transpose. Thatís all that is.

So youíve already calculated all this stuff, anyway. And it turns out that you can now define something, which is this irrelevance measure. And itís this: itís bi minus ai transpose x* divided by this thing. Hereís what you can guarantee now. Itís actually the normalized distance from the hyperplane to the outer ellipsoid. And I guess Iím not ever sure I need the m here. Is that right? No, I guess you do. Maybe we should make eta go between zero and one. Anyway, I donít know. So hereís what youíre guaranteed. If eta is bigger than m, then the constraint is redundant, and that you know without solving an LP. And by the way, itís super cheap, because the fact is, I think youíve actually already calculated at the last step of Newtonís method, I think youíve actually already calculated all these numbers. So I think you have, but we can check. But Iím pretty sure youíve already calculated all these numbers anyway. Or youíre very, very close to getting them. Well, weíll leave it. You have, because youíve actually factored Ė itís cheap, anyway. Youíve factored this to calculate the last Newton step. So again, if youíre doing direct methods, youíve factored that, so once youíve factored that, computing this is actually the norm, you have a Cholesky factor of that, this is the norm of L minus 1 ai and thatís one back substitution.

Student:

[Inaudible].

Instructor (Stephen Boyd):Hang on here. If you have a whole bunch of constraints in describing P, the analytic center cutting-plane method says youíve gotta calculate that heshen. You have to calculate a Newton step to calculate the analytic center. So thatís done. Youíre gonna already be doing that. Now what is true is that thereís no reason to believe that any of Ė this is safe constraint dropping. In other words, if you drop a constraint based on eta being bigger than m, itís safe. You donít have to avoid your complexity theorist friends and things like that. The proof is still very, very simple, because the pruning does not change the set. It changes the description of the set, because you dropped irrelevant inequalities, but it does not change the set. So you can hold your head high if someone grills you and says, ďAre you absolutely certain this method will converge?Ē Then you can say, ďYes.Ē So thatís safe. Thereís no reason to believe, actually that this will work. So in fact, whatís generally done is you keep 3n to 5n constraints, and you just keep them in order by this. So you drop the ones with the Ė you keep the 3 to 5 n constraints with the smallest values of etai. Now, I can describe it another way. Let me describe it another way. Hereís another way to describe this method. The method goes like this, and in fact, itís a perfectly good way to describe it. It goes like this: hereís your polyhedron like that. And you calculate the analytic center. Now, you change coordinates so that the heshen of the log barrier at that point is i. So in other words, what that does it you make your set Ė youíre rounding your set. Youíre making it more isotropic. Now, what you know now is that a ball of radius one fits inside your set. But a ball of radium m is outside your set, covers your set. So thatís whatís happened. And so now, the irrelevance measure is merely the length in those transformed coordinates. So you can say all sorts of interesting things about it.

Actually, it says that no value of eta could possible be less than one. Itís impossible. Because at that center, the ball of radius one fits inside your set, and eta is the distance to the first hyperplane. So all the eta are bigger than one. Any of the etaís bigger than m are guaranteed to be redundant. Thatís the picture in the transformed coordinates. Okay. And you get a piecewise linear lower bound. We can go through that. In ACCPM you get that trivially, because you get the subgradient. You can form the piecewise linear approximations. We know this. And you can actually get this piecewise linear lower bound in analytic center cutting-plane method. Thatís not surprising because the analytic center cutting-plane method calculates an analytic center, but an analytic center is a step and a barrier method. When you finish a barrier method, you get a dual feasible point, which we know. Thatís the basic idea of a barrier method. And so itís not surprising that youíre gonna get a lower bound. The details donít matter, I think. But the point is that this x(k+1), thatís the analytic center of the big thing thatís gonna satisfy something like this. And this is maybe better done in the notes, rather than in slides. Anyway, you end up with a lower bound from all the stuff you just calculated. And itís based on the idea that you calculated an analytic center, analytic centers are associated with center path, on a central path youíve actually calculated dual feasible points, whether you know it or not. Dual feasible points means you have lower bounds on the problem. And so when you put all that together, thatís the lower bound right there that youíve actually calculated. Okay. And ACCPM isnít a decent method. So you keep track of the best point found and so on. Letís just take a look and see how this works. So hereís a problem instance with 20 variables, 100 terms. So I guess itís a piecewise linear function. I think itís the one weíve been using all along. So this is what it is. And I guess the optimal valueís around one. So that means 1 percent accuracy. That means 0.1 percent and that means like coordinates of accuracy. And this is f minus f*. Obviously, you donít know this at the time. In this case, itís an LP, and you solve the LP to find f*. And then this shows you the progress. And you and see that itís not a Ė we can say a lot of things about this plot if you compare it to the subgradient plots. This appears to be, and thatís actually correct, so theyíre proofs of polynomial time converge and everything like that. This appears to be what weíll call linear convergence, because thatís a log scale and thereís like a line that looks like that.

And that means something like this. That roughly each iteration gives you a constant factor of improvement. You can work it out by calculating the slope there. Now, subgradient is very different. Subgradient youíre talking like one over square root of k or slower convergence. And that doesnít look like this. It looks like that, and then it gets very, very, very slow. Okay. This is actually, if you look at just the best point you found so far. Weíll come back and look at some of these later. And this shows you the real effort. The real effort is the number of Newton steps required. And in fact, you can see here that the infeasible start Newton method at some point jams. So each of these is like a constant here, and the width of a tread is the number of Newton steps required to find a feasible point and then calculate the analytic center. By the way, things like this, these are just appalling. Thatís 200 Newton steps or something like that. So this is sort of the amateur implementation. The correct method of dealing with this, which would have made this thing go like that, would have used something like a phase one there. So phase one would be like far superior. This is just to show. And then this shows the bound and everything like that. Weíre out of time, so weíll come back and get to this later, although I will just show one more thing, which is this, and weíll talk about it next time. If you drop constraints, you do very well. Okay. Weíll just continue here next time.

[End of Audio]

Duration: 78 minutes