ConvexOptimizationI-Lecture08

Instructor (Stephen Boyd):Weíll start with an announcement. It should be kind of obvious anyway. You should start reading Chapter 5. So Iíll go fast, not that fast, on our next topic, but you should be reading if you want all the details and things. You should be reading along with or even maybe ahead of us, Chapter 5. Okay. So weíre really gonna start this topic of duality. I think last time I did nothing but say a few things about it. They were kind of incoherent, but maybe Iíll make them coherent today. One way to think about duality is itís a way to handle the constraints by incorporating them into the objective. That's the basic idea. So letís see how that works and it fits in exactly with what I was talking about last time, which had to do with this concept of irritation versus value for a function and the idea of a hard constraint where you go from completely neutral to infinite irritation as opposed to a soft, something like an objective, and an objective, itís just a linear irritation function or something like that that means the larger the function is the more irritated you get and the smaller it is, the happier you get. Thatís gonna tie into the idea of duality. All right. So first we start with a couple of definitions. We start with a problem like this so minimize the objectives, some inequality constraints and quality constraints. We are not going to assume this is convex and in fact, some of the most interesting applications of the material weíre gonna talk about now, occur when this problem is not only not convex, but itís a problem known to be hard. So thatís gonna be some of the most interesting applications. We form a lagrangian and the lagrangian is simply this, I mean, itís really kind of dumb. You take the objective and to the objective, you add a linear combination of the constraint functions and the equality functions.

So thatís it. Here you say, for example, that lambda three is the lagrange multiplier or dual variable or price, Iíll justify that term soon, associated with the third inequality, so FI of X less than zero, thatís what lambda three is. And in fact, you can even sort of get a hint at whatís going on here. If, for example, lambda three were six that would mean the following: up here this says that if F3 is less than or equal to zero, itís perfectly okay; if itís bigger than zero, itís infinitely bad. When you replace this constraint by this charge, lambda becomes the price or you can call it a penalty and it basically says FI can be positive, thatís not a problem now, but youíre gonna pay for it, and for every unit that FI is positive, youíre gonna pay six here, whatever those units are in. Now, the flipside is actually quite interesting. You actually get a subsidiary for coming in under budget. Up here, as we said last time, thereís absolutely no difference between F3 of X, for example, being minus .001 and minus 100, absolutely no difference, both are feasible and you have no right to say that you like one better than another because itís not an objective, itís a constraint. Now, when you convert that one constraint to this sort of thing, something interesting happens. When F3 is less than zero, you actually get a subsidy because itís 6 times whatever margin you have, so when you convert this constraint to a term in a lagrangian, you actually get a subsidy for coming in under budget and you do like F3 equals minus 100 more than you like it minus .0001. You get a subsidy. Iím just hinting at what youíre gonna see soon. Thatís a lagrangian. Then we look at the so-called dual function or the lagrange dual function, itís got other names, but letís look at it. So the lagrange dual function is actually very, very simple. It just says minimize this lagrangian over all Xs. Thatís what it says. So you just minimize the lagrangian. Now, actually itís very interesting because what itís really saying is this, and by the way, itís a function now of these prices or dual variables. All sorts of ways to interpret this. You can interpret this as sort of the free market approach or something like that.

This is the constrained approach where you simply say by fiet FI of X has to be less than zero, HI of X has to be zero then you could say, no, you know what, weíre gonna let FI float above zero, no problem, weíll charge you for it the lambda Iís are positive. That would be the meaning here. But if FI goes less than zero, weíll actually pay you for it. Youíll be subsidized and this is sort of the optimal cost under these sort of market prices. All of this will become clearer as we move on with this. Now, hereís an interesting fact. If you look at this function as a function of lambda and nu, what kind of function is this as a function of lambda and nu for each X. Itís affine. Yeah. Itís linear plus thatís a constant. Okay. So itís affine. And the [inaudible] of a family of affine functions is of course concave. So this function G, you can think of it as the optimal Ė weíll get to it later Ė as a function of the prices, even if the original problem Ė even if these things are not convex, sorry, these are affine, thatís not convex. This dual function is absolutely concave so thatís Ė all right. Now, we get to something very simple, but itís one of those things where you get a sequence of 12 simple things and you know the right sequence of 12 simple things will lead you to a very interesting thing. So trust me, thatís what weíre doing here. It looks very complicated, itís quite profound. It says, basically, if you can evaluate G as a dual function, youíre gonna get a lower bound on the optimal value of the original problem. Thatís what it says. So the argument is just embarrassingly simple. Look at this thing and imagine X is feasible. For any feasible X FI of X is less than or equal to zero, but if lambda I is bigger or equal this term is less than or equal to zero so therefore, this whole thing is less than zero. If youíre feasible, HI of X is zero so it doesnít even matter what the side of new I, this is zero and, therefore, it says that the lagrangian is less than F0 of X or any feasible X.

Now, for infeasible Xs thatís false, but for feasible Xs itís absolutely true that L of X, lambda nu is less than or equal to F0 of X because thatís zero and thatís less than or equal to zero. By the way, note the level of the mathematics being employed here. Itís quite deep. It relies on the very deep fact that the product of a non-positive number and a non-negative number is non-positive, and also that you can add such numbers up and you still have something less than or equal to zero. So I just want to point out nothing has been done here. Itís just embarrassingly simple. Okay. Now, if you then [inaudible] this over X well then obviously itís less than F0 of X tilde. This is true for any feasible X tilde and thereís no conclusion possible other than this. Okay. So letís look at some examples. Letís do least norm solution of linear equations, so hereís Ė now, of course this is stupid. We know how to solve this problem analytically, you know how to, it doesnít even matter what the solution is. It doesnít matter. We know everything there is to know about this, but just as an example letís see how this works. Well, the lagrangian function is this; 2X transpose X, thatís the objective, we add new transpose AX minus B so here, by the way, you have to write this as AX minus B = zero. You have to decide what H is. H is either AX minus B or something thatís B minus AX so all that would happen there is the sign on nu would flip or something, but it wouldnít matter. Iíve written it this way, AX minus B = zero so. Weíre gonna minimize this over X, thatís completely trivial because thatís a convex quadratic, thatís affine in X and you just take the gradient, you get 2X plus A transpose nu = zero and this is the optimal, thatís the X that minimizes the lagrangian here. All right. Now, we take that X and we plug it back into here to get G so when you plug this in here thatís the X that minimizes the lagrangian, you get this and you get some Ė well, first of all, letís take a look at it.

Evidently, this function here is concave because that is a positive semi-definite quadratic form. All we care about is a positive semi-definite quadratic form, but it happens to be positive definite. No, actually it doesnít matter in this case because Iím making no assumption about A whatsoever in this case; everything is true no matter what I assume about A. Okay. So this whole function is concave quadratic so there it is, which we knew had to happen because the dual function is always concave. Now, hereís whatís interesting. And weíve already learned something. Itís not a big deal because I know how to solve this problem, but look at this. What it says is the following; if you come up with any vector nu at all, which is I guess the size of B, the height of B, letís call that M, you come up with any vector nu at all and you simply evaluate this function then whatever number you get is a lower bound on the optimal value of this problem. In this case, itís totally useless. If thatís a small problem Ė say, X has a thousand variables or something like that, this is goofy because we know how to solve this problem extremely efficiently, get the optimal solution, we donít need a lower bound on it. This is actually immediately useful. Letís look at a standard form LP. So I want to minimize C transpose X subject to AX = B X figured or = to zero. Just standard LP. Okay. The lagrangian is C transpose X plus, and Iím gonna add a lagrange multiplier for this, thatís nu, transpose AX minus B and then here Ė to put this in standard form you really want to write it. You have to write it as minus X is less than zero so these are the FIs over here. Okay. So if I simply form this thing and that explains the minus sign here on the lambda. Okay. So thatís your lagrangian. What kind of function is the lagrangian in X? Hey, look at that, that says linear doesnít it? And it seems to me that that is false, I believe. Thatís the name for something thatís not true. Isnít that correct? You agree with me? The lagrangian is not linear in X. Thatís ridiculous. Is that affine? L is affine in X. Itís affine because thereís this constant term minus nu transpose B here. Okay. Now, this leads us to something. Itís actually related to something on the homework from this week, which was Ė hereís a very, very simple linear program. Ready for it? No constraints. How do you minimize an affine function? How do you minimize an affine function? More than a small number of people were stumped by this because itís such a stupid question, I think. Actually, itís not a stupid question, sorry, itís just that they wanted to make it more complicated. How do you minimize a linear function?

Student:

[Inaudible]

Instructor (Stephen Boyd):Whatís that?

Student:[Inaudible]

Instructor (Stephen Boyd):Whatís the minimum of a linear function? Just in R2 I have a linear Ė whatís the minimum of X1 minus X3?

Student:[Inaudible]

Instructor (Stephen Boyd):X2. Sorry. Go ahead. What is it?

Student:Itís minus infinity.

Instructor (Stephen Boyd):Of course. Itís minus infinity. You have level curves which are hyper planes and you just go as far as you can in the direction minus C, itís minus infinity. So thereís nothing, thereís no mystery here and whatís the minimum of an affine function. Thereís no mystery here. Okay. By the way, notice that thatís a valid Ė so if G is minus infinity, itís cool, itís just minus infinity and note that it is indeed a lower bound, however itís an informative lower bound because if somebody comes up to you and says, ďCan you help me get a lower bound on this,Ē you can just automatically say minus infinity. Exactly one. Whatís that? When the function is zero. And thatís it. Thatís the only way. So in fact, if you minimize this over X, you will get minus infinity. One exception. If C plus A transpose nu minus lambda is zero then this whole thing goes away and you get that. So the dual function for the standard form LP is a very interesting function weíre going to look at it very closely. Itís this. It is a function which is linear on this weird affine set and itís minus infinity otherwise. By the way, that function is concave. Right? You can just visualize it. I mean, itís kind of weird. It would be an R2 like my drawing a line like this and here are the values, 0, 1, 2, -1 and then you have to visualize this so if you want to make a graph coming up. Slope this line up and now what I want to do is everywhere off that line the function guy is minus infinity so just make it fall off a cliff everywhere else. That function is concave. Well, it has to be concave because we know the G is always concave. Okay. But if you want to visualize it, thatís what it looks like so itís a weird thing, itís linear, but then off this thin affine set, it falls to minus infinity. Okay. So thatís what G is. Is there a question?

Student:[Inaudible] LP is affine [inaudible]?

Instructor (Stephen Boyd):Okay. So actually the question is this function, which I write this way. You just have to blur your eyes because Iím asking you what kind of function of X is it? Okay. So letís look. Thatís is a constant. That is a constant vector, but well, if I include the transpose here, itís a constant roe vector, therefore, this is linear in X, I add a constant so itís affine. Does that make sense? Okay. Okay. Now, this is really interesting. We can actually say what the lower bound property is. So the lower bound is this. This function is the dual function. It is always the lower bound on that LP. Now, of course, if you randomly pick lambda and nu, youíre gonna get minus infinity. Just minus infinity. Okay. In which case itís still a valid lower bound, but itís just a completely uninformative lower bound. Itís the lower bound that works for all problems. Itís the universal lower bound. So letís see what weíve come up with. It says the following. If you have this linear program here and someone says what is the optimal value of it, well, it depends on the context. If a person has a specific A, B, and C, and actually just wants to know solve my problem, you can run some code and solve the problem if itís not too big and if thatís what theyíre interested in. Okay. However, you can make a very general statement. You can say the following. If you find any vector nu by any means, it doesnít matter how you find it, itís no oneís business how you found such a nu, if you found a vector nu such as A transpose nu plus C is a non-negative vector, then you evaluate minus B transpose nu, thatís the lower bound on this LP. That strings together three or four totally obvious things, but I think you come up with some Ė thatís not obvious. Okay. Letís do a quality constrain norm minimization. So here you minimize the norm of X subject to AX = B.

By the way, weíve seen that Ė in fact, I guess we just did that problem two examples ago Ė not quite. We did the case where this was norm squared and where this was the two norm, now, itís completely general. Well, the lagrangian is the [inaudible] of norm X minus and then some horrible thing here, which is affine and here we have to be able to minimize norm X minus nu transpose AX, this is a constant so itís totally irrelevant and you have to be able to do that. Now, this goes back to the idea of a dual norm and let me go to that so letís look at that and if I want to minimize this thing Ė the question is what is this thing, what do you get here, right? And the answer is actually pretty [inaudible] straight from dual norms. In fact, we can do it for the two norm first just as a warm up. So for the two norm youíd say, well, look, if norm Y is bigger than one in two norm then I can align X with it in that direction and then this thing sort of over powers, it has enough gain to overpower this one and I can make it go to minus infinity. Okay. Now, on the other hand, if norm Y is less than or equal to one [inaudible] tells me that this thing is less than that and, therefore, this whole thing is bigger or equal to zero. So I could never ever make this thing negative. On the other hand, by choosing X equals zero, I can make it zero, so thatís clearly the optimal.

So this is equal to and this generalizes now to a general norm. Itís either equal to minus infinity if the dual norm of Y is bigger than one or at zero otherwise. Okay. And in fact, thatís the dual norm. Itís from the definition of the dual norm. So this is what you get. All right. So applying this up here gives us exactly this. This is our Y and here you have it so once again, this dual function is not totally obvious. I mean, itís not an obvious thing. Itís something, itís linear, itís a linear function, but itís got a weird domain and in this case, itís the set of points nu where the dual norm of A transpose nu is less than or equal to one. Okay. Thatís that. Okay. And then you can go through the argument here and I wonít go through it, but actually now youíve got something totally non-obvious. It basically says if you can come up with a vector nu, for which A transposed nu is less than or equal to one in dual norm, then B transpose nu is a lower bound on the optimal value of this problem. Hereís an example: ready for a dual feasible point? Nu equals zero. Well, letís check. Thatís zero. Zero is definitely less than one and now we have a drum roll to find out what lower bound weíve come up with. The lower bound is zero and thatís actually not particularly interesting here because the objective is zero. Okay. So thatís what it says here. Okay. Okay. This gives you a parameterized lower bound. Itís parameterized by nu. Okay. Now, weíre gonna look at a problem and weíll see it a whole bunch of times. Itís actually just a simple example, itís a perfectly good working example of a hard problem. Itís two-way partitioning. Itís embarrassingly simple. It goes like this. I want to minimize a quadratic form subject to XI squared equals one and this means XI is plus or minus one. Okay. And let me first just say a little bit about this problem. Weíre gonna see it a lot and just so you get a rough idea of what it means.

So XI is plus minus one so we can really think of this as the following is you have a set of points, like, M points and what you want to do is youíre gonna partition them into two groups. Okay. Thatís one group and then the other group will be this, okay. And we encode that by saying hereís where XI is plus one and hereís where XI equals minus one. So weíre gonna use the variable here, which is XI, which is the plus minus one vector, to basically encode a partition. Itís a partition. Itís just that itís a numeric data structure to encode a partition, okay. All right. Letís look at what the objective is. The objective is sum XI XJ WIJ and letís just see what it is. So you sum over all pairs. If XI and XJ are in the same partition, what happens, what is XI XJ?

Student:One.

Instructor (Stephen Boyd):One. Okay. And then you add this to this thing. Now, this is something we want to minimize. Okay. Now, if XI and XJ are in opposite partitions, this is negative so I think that means that WIJ is a measure of how much I hates J. Did I do that right? I believe so because if W is very high, it means that if XI and XJ have the same side, youíre gonna be assessed a big charge in the cost. I mean, if XI is high and theyíre in opposite things, youíre gonna decrement the cost a lot and happiness is gonna go up. Okay. So WIJ is basically how much I annoys J, but itís symmetric so itís the average of how much I annoys J and J annoys I. Okay. Now, if WI is small, it means they donít care much. So in fact, this now makes perfect sense as you have a group of people, you have social network or something like that and you want to partition it. There would be obvious ones. If the sign pattern in W were such that like everybody liked everybody except one then it would be very simple; youíd just isolate that one nod. But in general, actually finding a solution to this problem is basically extremely hard. You canít do it. So if this was a hundred or a couple of hundred, you canít do it. It just cannot be done. Okay. So thatís the partitioning problem and for us, itís gonna be a canonical example of a hard problem. By the way, thereís instances of it which are easy, I just mentioned when thereís some obvious solution, but Iím talking about the general case here. Okay. By the way, it comes up in tons and tons of other Ė my interpretation was sort of a joke, but the point is it comes up in tons and tons of real applications, it comes up in partitioning, it comes up in statistics, I mean, just everywhere. So my interpretation was a joke but itís a very real problem with real applications. Okay. Now, the dual function is this. We simply take X transpose WX, we add as the lagrangian tells us to a linear combination of these functions. I write them as XI squared minus one so I get this and I have to calculate Ė this is the lagrangian and the lagrangian is quadratic. Okay. Itís a quadratic function. Weíre gonna have a very short discussion about how do you minimize a quadratic function. The first thing you do is letís talk about how do you minimize a quadratic form? So what is the minimum of a quadratic form so what is the minimum of a quadratic form?

Student:[Inaudible]

Instructor (Stephen Boyd):What is it?

Student:[Inaudible]

Instructor (Stephen Boyd):It can be negative infinity. I agree. When would it not be?

Student:Within [inaudible]

Instructor (Stephen Boyd):If the quadratic form is positive semi-definite then the only values it takes on are non-negative so it couldnít be minus infinity then so thatís exactly the condition. The minimum of a quadratic form is minus infinity if that matrix is not positive semi-definite so if it has one negative eigenvalue, the minimum is minus infinity. Okay. Otherwise, if itís positive semi-definite the minimum is zero because it canít be any lower than zero and it can be zero by plugging in zero. Okay. Let me tell you what the lagrange dual function is for you right now. It is a lower bound on an optimization problem, gives a lower bound. It of course can give you Ė itís parameterized by lambda and nu by these dual variables. Now, in some cases if you plug in some lambda nu youíre gonna get the following lower bound minus infinity. But itís always a lower bound. In some cases, you plug in lambda nu and youíre gonna get actually an interesting lower bound thatís not obvious. So right now, you should just think of it as a lower bound. Lower bounds can be good, they can be tight, they can be crappy, weíre gonna get to all that later. Okay. Okay. I just want to tie together the idea of the [inaudible] function and lagrange duality. So if you have a function with just linear inequality and equality constraints, a problem, and you work out what the dual function is, itís a minimum of F0 plus Ė I collect this together and multiply by X and then thatís, of course, a constant. And what this means is the following. If I focus on this and then go and look up what the conjugate function is, which was the conjugate over X of Y transposed X minus F of X, thatís the dual, if you plug in also all the right minuses, you get this. Itís equal to that. Now, what that means is the lagrange dual function of this thing is exactly equal to this.

Itís equal to that. Now, recall the conjugate functions often have domains that are not everything. It was actually the probability simplex was the domain of it. So thatíll automatically impose some inequality constraints in here when that happens, but hereís an example. The maximum [inaudible] problem is maximize sum minus XI log XI subject to some inequalities and equalities. And by the way, thatís already a really interesting problem because it says lots of things. It says find me the maximum entropy distribution that has these expected values Ė these are just known expected values. These can be moments, it could be probabilities, it could be anything and these are inequalities on expected values. So itís really quite a sophisticated problem to ask. Whatís the maximum entropy distribution, for example, on these points that, for example, has the following variance and has the probability in the left tail less than that. You could go on and on and make it a very sophisticated thing. Thatís the maximum entropy problem. Thatís this thing. And if you work out what that is when FI of X is the negative entropy here, thatís minimized negative entropy, you will actually get the sum of exponentials. So the dual function for a maximum entropy problem is gonna involve a sum of exponentials. Now, if youíre in statistics Ė and I said statistics not probability, this will be very familiar to you because itís a connection between exponential families and maximum entropy and weíll see more of this later. Just a hint. Okay. Now, we get to the dual problem and to write down the dual problem Ė I mean, itís the dumbest thing ever. If someone walks up to you and says, ďI have a lower bound on my problem, but itís parameterized by this vector lambda and this vector nu,Ē and then you say the only interesting thing about lower bound is, ďWell, that itís a lower bound,Ē and if someone has multiple lower bounds, obviously the higher the lower bound, the more interesting it is.

So you can just say okay, whatís the best lower bound, on the original problem, that you could establish by lagrange duality? What is the best lower bound? We donít know if itís sharp. Weíre just saying whatís the best one and it kind of wouldnít make any sense to really examine any other anyway. All right. That leads you just to this problem right here. Now, I want to point something out. This is always a convex optimization problem no matter what the primal problem was. Oh, by the way, this is called lagrange dual and sometimes itís just shortened to the dual problem here. In fact, people say ďthe dual problem,Ē the same way we say ďthe optimal point,Ē even in situations where we donít know that thereís an optimal point. Weíre actually gonna see this multiple dual the way the word is used on the street. Thereís lots of dual, but weíll get there. For now, it actually really is ďthe lagrange dual problem.Ē And it says simply maximize the dual function subject to this. The subject to the lambdaís being positive, thatís all. Okay. Now, often what happens is G is minus infinity for some values of lambda and nu. Weíve already seen that a couple of times. That is a not interesting lower bound and itís sure not gonna help you maximize something. To find a point where itís minus infinity, you know, this thing could actually be minus infinity, that can happen, but the point is itís not an interesting value. So in fact, often what happens is you pull the implicit constraints in G out and make them explicit. Okay. Now, here for example, letís go back and look at this. The dual function for this LP is this weird thing that looks like this. I drew it somewhere. It was this sick thing here where this thing is kind of going up on a line, but off that line, the thing falls off to minus infinity and weíre just simply going to maximize that subject to lambda positive; however, itís easier to simply take the implicit constraint out and you end up with something that looks like this.

Okay. So hereís the so called standard form LP and then this is what it looks like when you have actually pulled out this implicit constraint. Technically speaking, this is not the lagrange dual of that. However, people would call this the lagrange dual so youíre given a little bit of license to form the lagrange dual and do a little bit of trivial rearrangement and people would still call it the dual or something like that. This is equivalent under a very, very simple equivalence of this. The lagrange dual of this thing is that where G is the sick function. I just want to point this out. Okay. And by the way, letís see what happens here. That is also an LP and letís see what it says. Here I can say something about this problem. If you have a feasible nu here then minus B transpose nu is a lower bound on the optimal value of this problem. This thing says, ďOkay, you have a family of lower bounds, please get for me the best lower bound.Ē Thatís what the meaning of this problem is. This also has beautiful interpretations in a lot of cases. So for example in engineering design, itíll make a lot of sense. X will be a sub optimal design, for example, sorry, any X here, if itís feasible, it satisfies the constraints, but something thatís feasible here would be a sub optimal design. Okay.

The nu will have a beautiful interpretation. A dual feasible nu in that case is a certificate on a limit of performance. Thatís what it is. Thatís the meaning of nu here. Okay. Actually, weíll see that when you look at a real problem, it will have physical significance. Weíll get lots of examples. If this is a question of how bad could something be bad, if, letís say youíre at a bank and they want to know, okay, whatís the worst thing that could possibly happen, then a lower bound actually gets interesting. Yeah.

Student:Could you please say why that was not, technically, a lagrange dual on the right?

Instructor (Stephen Boyd):This?

Student:

Yes.

Instructor (Stephen Boyd):Sure. Thatís the lagrange dual, thatís why.

Student:I mean, relative to the LP.

Instructor (Stephen Boyd):No, theyíre not the same thing. Theyíre not the same thing. This is minimizing a function, which is a weird function. Itís equal to minus B transpose nu provided that A transpose nu plus C minus lambda equals zero, okay, something like that and itís minus infinity otherwise. Okay? So thatís what this is. And by the way, if you were very careful, it would make a different. Let me explain that. For example: suppose I throw in a nu for which A transpose nu plus C is not a non-negative vector, okay? Then in this problem when I say how about this nu, how do you like that, what is sent back to me is actually the infeasible token is sent back to me saying your nu is infeasible. Okay. Over here, itís actually more interesting. Over here, if I throw such a nu in or whatever, what comes back to me is the object function sends me an OOD token, Out of Domain. Now, thatís a concave function and that means itís minus infinity. You get two slightly exceptions are thrown in this thing. But I want to point out that these are just Ė you can call this just silly semantics and all that if you like, but itís very important to understand these are not the same problem. By the way, donít focus on these minor things. Thatís something you can think about, you can read this, think about it on your own. Donít let silly little technical things get in the way of what the picture is. The big picture is you have an optimization problem and you form another one called the lagrange dual. That lagrange dual problem, essentially, is saying what is the best lower bound on the optimal value of the first one I can get using the lagrange dual function. That is whatís important. Okay. So now we get to the idea of weak and strong duality. Now, weak duality says that ďDĒ star is less than ďPĒ star. Now, here, let me see how this works. Okay. So in this context, the original problem is called the primal problem and the lagrange dual is then called the dual problem. Okay. So thatís the primal and the dual and weíll call Ė well, weíve already assigned the symbol P star to mean the optimal value here. Weíre gonna let D star be the optimal value of the dual problem. Okay. So optimal value of this is gonna be denoted D star. You always have D star as less than P star. Why? Any dual feasible point is a lower bound P star so the best one is also a lower bound. This is called weak duality. Itís called weak duality because let me review the deep mathematics required in establishing this, right, it hinged on properties such as the product of two positive numbers is positive in the sum of positive numbers.

So itís weak because you can explain it to somebody in junior high school. I mean, they might not have taken those 14 steps, but the point is it has nothing in them thatís hard so itís called weak. Okay. All right. Thatís weak duality. All right. It always holds. Convex, non-convex. Itís absolutely universal. It could be stupid. You could, indeed, have D star equals minus infinity in which case your best lower bound is of no interest what so ever. Okay. That can happen, but this is always true. Okay. Now, if we go to the partitioning problem and we ask what is the best lower bound on the two-way partitioning problem you can get from the lagrange dual you will form this problem. That is a semi-definite program. And now, things are interesting because, although this is something that was not known 15 years ago, and absolutely inconceivable 20 years ago, I can tell you this, this SDP, you can solve it, people can solve it. You can solve it like that for a thousand variables. No problem here. And if you knew what you were doing you could go, easily, to problems with 10,000 and 100,000. The point is, you can solve this SDP and you will get a lower bound on the two-way partitioning problem. That is fantastically useful if you couple that with a uristic for partitioning. So you do some crazy uristic, thereís lots of uristic; some of them work really well by the way. Now, you donít expect it to work all the time because you are solving, after all, an NP hard problem in general, so you donít expect it to work well all the time, but what happens is youíll do a partition and youíll say, ďHereís my partition and hereís the number I got,Ē whatever it is. Itís the X transpose WX and you want to know could there be a better one. You can solve this SDP and in fact, youíll see in a lot of times the numbers are pretty close. Okay. At least itís a good thing to know. You would know I have a partition, but thereís no partition thatís more than Ė Iím at most such and such sub optimal.

And you might just say, okay, thatís good enough. All right. Okay. Strong duality, this will not rely on junior high math, okay. Strong duality is going to be that that lower bound is tight. That says, thereís a lower bound that goes all the way up to the optimal value. Thatís strong duality and weíll see what its equivalent to, but that is not trivial. And by the way, it often doesnít happen. Okay. So in two-way partitioning problems, by the way, if it were true there, youíd have P = NP because this problem we can solve in polynomial time and so in fact, if P star were equal D star Ė and in fact, thereís even approximation. If you know about complexity and you have something thatís not even approximable or something like that then that tells you that you canít even get something where you can bound the gap or something like that but I wonít go into that. Now, hereís an interesting part. When a problem is convex, you usually have strong duality. Okay. So thatís actually amazing. Thatís gonna actually have a lot of implications. Itís gonna be equivalent to, by the way, itís gonna involve the separating hyper plane something. Weíll see what it connects to. There are multiple books, multiple courses, not here, but at some other schools; you can take entire courses, read books, thousands of papers that elaborates on this one word, usually. Okay. Now, basically these are called constraint qualifications. So a constraint qualification theorem goes like this. It says if the primal problem is convex and then you insert your constraint qualification here, okay, then P star equals D star. Thatís a constraint qualification. You could devote your life to this. On occasion, these issues actually do come up, but maybe less frequently in applications than the people who devote their lives to it would like to think. Iím saying that of course because their grad students will watch this and then alert them to it.

So Iím just making trouble. Now, by the way, if youíre in this industry, sub industry of constraint qualifications, then this like the big, the sledge hammer, the most unsophisticated one there could be possibly be, this is the basic one that everybody knows. Okay, this is the least squares or something like that of the constraint qualification worlds, its Slaterís Constraint Qualification, although, actually, the correct name here would probably be Russian, but we wonít get into that. So letís call it Slaterís Constraint Qualification and it says this, if you have a convex problem like this, it says if there is a strictly feasible point, if there exists one, then P star equals D star. Strictly feasible means not just that you met the inequality constraints, but you do so with positive margin for each one. Thatís the condition. Okay. Now, I should add that basically, itís completely clear, that for most problems that covers everything in engineering, pretty much, I mean, as much as people would make fun of Slaterís Constraint Qualification and give you reasons and they could make examples up why itís not sophisticated enough and sure enough, there are problems where you donít have a strictly feasible point, but for most problems that come up in engineering, anything in machine learning, pretty much anything, this makes perfect sense, right.

For example, if the third inequality was a limit on power, it doesnít make any sense to say Ė just think about it, right? If Slaterís condition failed to hold, it means their existing circuit dissipates 100 milli-watts, but thereís no circuit that dissipates 99.999999 because if there were, Slaterís condition would hold. Everybody see what Iím saying here? If solving that problem relied on these most fantastically subtle facts as to whether strict inequalities held or weakened equalities and one, but not the other held, then I got news for you, youíre not doing engineering, youíre not doing statistics, youíre not doing economics, youíre doing something like peer analysis. Okay. So thatís my little story on it. Again, there are actually cases where these come up in practice, but theyíre pretty rare. And mostly, Iím saying this to irritate at other universities, my colleagues, who will be alerted to this, watch this tape and be very angry. But I thought Iíd mention this. Okay. All right. So letís go to the inequality form linear program. Here you want to minimize C transpose X subject to X less than B. G of lambda is C transpose X plus lambda transpose AX minus B because I put the B on the left-hand side to make this F less than zero. I do this and I infamize this, but we know how to infamize a affine function. You get minus infinity unless the linear part vanishes so I get this and so this is the dual problem. Notice this is actually not the dual problem. So if thereís lawyers present, you would say, ďThis is a problem that is trivially equivalent to the dual problem,Ē okay, but after a while if there are no lawyers present youíd just say thatís the dual problem or something like that. So thatís it. Okay. Now, Slaterís condition says that if the feasible set Ė of course the feasible set is a polyhedron and by the way, one possibility is the feasible set could be empty, which in fact, is a polyhedron. What Slaterís condition says geometrically is very simple. It says if that polyhedron has non-empty interior, thatís what this means, it means, basically, that thereís an interior point, if it has a non-empty interior then you have strong duality so you have P star equals D star. Okay. So thatís the picture.

Letís look at a quadratic program. Letís minimize X transpose PX subject to AX less than P. Thatís minimizing quadratic form over a polyhedron, the dual function is this X transpose PX and weíre gonna assume P is positive definite. Actually, thatís so that I can avoid the horrible way to write down Ė itís not that big of a deal, but the horrible to infamize a general quadratic function with a linear term because I donít feel like doing it so this will work out. So here the dual function is you infamize over XX transpose PX plus lambda transpose AX minus B here like that and now I minimize over X. Now, the nice part is P is positive definite so I know how to minimize this. Itís P inverse times whatever something. Iím not even gonna do it because itís easy to minimize a strictly convex quadratic function so I minimize it. I plug that X back in here and I get this thing, okay, which is I get minus one quarter lambda transpose A, P inverse, A transpose lambda something or other and my dual problem then looks like this. By the way, this really is the dual problem because in this problem, up here, notice that the dual function, the domain is all of our Ė letís call it RM, itís all of RM. Okay. So in this case, the dual function is domain is everything, which is to say, you get a lower bound for any Ė if you plug in random numbers lambda and youíre not gonna get a trivial lower bound. Okay. You might get a rather stupid one. For example, you might get the lower bound minus seven. Letís talk about the lower bound minus seven here. Why is the lower bound minus seven valid for this problem? Because the objective is always non-negative, but the point is, you get a lower bound and you get this. So thatís the dual problem. And by the way, what weíre saying here is not obvious at all. What weíre doing is weíre saying, you want to solve this quadratic program Ė we havenít yet told you how to do it or how itís done or anything like that, but weíll tell you this, if you come up with any vector lambda thatís non-negative and you evaluate this concave quadratic function, you get a lower bound on the optimal value of this thing. This has lots of uses. For example, suppose someone says I know how to solve this problem and you say, ďHow did you do it,Ē and they go, are you joking, Ė thatís, like, ďIf I told you, Iíd have to kill you.Ē Iím patenting it right now in [inaudible]. Okay. I canít tell you how I did it.

And you say, ďWell, why should I believe that thatís the optimal X, how do you prove it? You say, ďWell, watch this.Ē You say, ďCheck out this lambda, notice that itís bigger than or equal to zero,Ē and you go, ďYeah,Ē then you evaluate that number and that number is equal to the value up here of the point. That, by the way, ends the discussion. That X is feasible and by the way, you would call that lambda a certificate proving it. Everybody got this? And notice that you didnít have to say how you did it. Everyone got this? And then youíd say, ďHey, howíd you get the lambda,Ē and you go, ďLike Iím gonna tell you that either.Ē Now, Slaterís condition says the following: If this polyhedron has non-empty interior, then these are absolutely equal then their always exists a certificate proving optimality of the optimal X. Always. So okay. By the way, a very small number of non-convex problems have strong duality. Iím not gonna go into it because itís complicated and so on. This is actually covered in an appendix of the book and I would encourage you to read it. This one is not obvious. And actually, thereís a whole string of these. Thereís, like, 15 of them or something like that and theyíre just weird things that have to do with specific problems that are non-convex and just happen for deeper reasons to have zero duality gap.

The quadratic ones are the ones we collect at the end of the book in one of the appendices. There are others, you will see them, theyíre kind of weird and some of them are quite strange. One Iíve seen recently where it involves complex polynomials of degree four. Right? And then something that should have zero duality gap and it comes down to something in algebraic geometry, but thatís always the way these are. These are not simple. This is just to say there are non-convex problems with zero duality gap. A few. Okay. Letís look at the geometric interpretation. All right. So letís see if we can do this right. So weíre gonna do a problem with just one constraint so what weíre gonna do is weíre gonna minimize Ė Iím gonna write the graph of the problem. What Iím gonna do is for each feasible X or each X in the domain, Iíll evaluate this pair. So although the problem may be happening in a hundred dimensions, for every X, Iím gonna plot a point which is in this plane; and one, basically, this tells you the objective value and this tells you the constraint function. So, basically, everything over here corresponds to feasible. Okay. And then the height corresponds to the objective value, so quite obviously, thatís the optimal value. Any point that ends up being colored there is optimal. Okay. So thatís the optimal value, P star. Everybody see that. So thatís the idea. So that point really has a very nice objective value, but itís infeasible because itís constraint function is positive. Okay. So thatís P star. Now letís see what the dual is. How do you get lagrange duality in this picture? Well, lagrange duality works like this. You minimize F0 plus lambda F1. Now, on this plane, that corresponds to taking something here like this an itís got a slop of Ė is it one over lambda or something like this, letís see, itís slope minus lambda so I take something like that.

So for example, if you fix lambda and then ask me to evaluate the dual, what you do is this. You fix a slope here and you march down this way until you just barely leave this set, and that would be right there. Okay. And then when you work out what G of lambda is, itís this intersection here. Okay. So this is G of lambda and now the dual problem says, ďOptimize over all lambda,Ē so if lambda is zero, you get this. You go down there and G of zero is this number right here, which is indeed a lower bound on P star, it has to be. Okay. Now, I crank up the slope and as I crank up the slope G is rising and it keeps rising until you just hit here, this point, at which point here its right there. Okay. Now as I keep increasing lambda what happens is the optimal point is actually here and this thing is rotating around Ė itís not a fixed point, itís rolling the context, but because itís got sharp curves, itís just rolling just slightly. Itís rolling along here and as I increase lambda, G gets worse and worse. In fact, if lambda is huge, it looks like this and G is very negative. Itís still a lower bound, just a crappy one. Everybody see this. So D star is that point. Questions?

Student:[Inaudible]

Instructor (Stephen Boyd):Weíre gonna talk about that, but it depends very much Ė so for example, in a non-convex primal in two way general partitioning problems, NP is hard, but the dual is a SDP. Thatís easy. In that case, it can be infinitely far away. Now, in the case of a convex problem, now it gets interesting. So in a convex problem, you will see later that they both solve the problem and a lot of people get all excited and they go, ďOh, how cool, I can solve my problem by the dual.Ē It turns out that if you really know what youíre doing, the complexity of the primal and dual are equal if you really know what youíre doing. You will in about four weeks. Three. Whatever it is. Yes?

Student:How did you rule out the bottom point for P star? You canít just say it was that.

Instructor (Stephen Boyd):How did I do it?

Student:Yes.

Instructor (Stephen Boyd):Well, the first thing I asked is I asked Ė this shows you the objective and the constraint function for every possible point in the domain, okay, now, that points not good, for one thing, itís got a high objective, but itís also infeasible. Anybody who landed on the right here is infeasible. So in fact, these are very interesting, but theyíre not relevant as far as the optimization problem is concerned so we simply look at these. Now, every point that got shaded in here is feasible. Okay. The height tells you the objective value and so you want the lowest point among these. Thatís clearly right there and you go across here and thatís P star.

Student:Why would the right-hand be infeasible?

Instructor (Stephen Boyd):Because your first coordinate here is your constraint function and F1 has to be less than or equal to zero. Thatís what it means to be feasible. Okay. So thatís the picture. So here you have a gap. By the way, this thing strongly suggests something very interesting and you can see why convexity of the problem is gonna come in. When F1 and F0 are convex, this weird set G Ė now, what Iím about to say is actually not true, but itís close to true Ė that weird set G is convex, okay, when something is convex, you have a gap here because this blob is non-convex so if this thing had to be convex you canít have a gap. Everybody see this? Thatís what is gonna happen. Now, Iíll tell you the truth. G is actually not convex, but its lower left corner, which is what we care about, is. Now Iíve corrected it and said the truth. By the way, you can also see how Slaterís condition works so if you take not G, but A, thatís the set of points that you can meter beat in a bi-criterion problem, so basically, if you take A then color in all these points here and now you can see A will actually be convex if thatís convex and thatís convex so A will have to look like this. Slater condition says that somewhere A goes a positive amount or it goes into the left side. These are the kinds of things you would study in a one of these whole courses on this topic. So thatís the idea. So you can even get how Slaterís condition connects to all of this. Okay. Iím gonna mention one more thing. Weíll get to one more topic. Itís a complimentary slackness. So letís assume that strong duality holds and actually, I donít care if the problem is primal or feasible. Okay. Convex. What I said made no sense whatsoever so letís start over. What I meant to say was I donít care if the primal problem is convex. Thatís what I meant to say, but it just came out a weird permutation. Okay. So I donít care if the primal problem is convex, of course the dual problem is always convex. So letís assume strong duality holds and letís suppose X starís primal optimal and lambda star and nu star are dual optimal. That says this. By the way, this is basically what it comes down to, it says X star is an optimal point, lambda star and nu star, you can think of then as a certificate establishing optimality of X star. Okay. By the way, these ideas, weíre gonna use them from now on. Theyíre gonna come up computationally. All algorithms are gonna work this way. All modern methods Ė you havenít done it yet, but whenever you solve a problem, it doesnít just say hereís X and you have to trust the software or whatever.

It doesnít work that way, although you havenít seen it return you yet. They also return, no exceptions, a certificate proving that itís the solution so you donít have to trust the implementation. Everybody see what Iím saying? These ideas are gonna diffuse through everything we do. So basically you think of that as an optimal point, optimal design, whatever you want to call it, this is a certificate proving thatís optimal because thatís what it is. Thatís a lower bound on P star, thatís a point thatís feasible and satisfies and has objective value equal to this lower bound of P star, therefore, it is P star. Now, by definition this thing is the infinium over all X of G with these optimal lagrange multipliers. Okay. But if itís the infinium over all X, itís certainly less than or equal to this when I plug in a particular X and Iím gonna choose to plug in X star. Okay. So I plug in X star and I have the following. Very interesting. This says F0 of X star is less than or equal to F0 of X star plus something where every term in that is less than or equal to zero. Okay. And every term in that is zero. So this one is not relevant. Okay. Weíll get to that. Okay. Yes, everything here is zero. And now you say, wait a minute here, if this thing is less than or equal to that thing and thatís the same as that, then theyíre all three equal and we have no choice but to conclude that the sum of lambda I star times FI of X star is zero. Okay. But thereís more than that. Wait a minute. This is a sum of numbers, all of which is less than or equal to zero. If you have a sum of numbers, which are less than or equal to zero and itís equal to zero, thereís only one conclusion; every single one of those numbers has to be zero. And that says the following: lambda I star times FI of X star is actually equal to zero for all of these. Okay. And thatís known as complimentary slackness and what this means is the following: it says if you have any primal optimal point and any dual optimal point, the following must hold; if the optimal lagrange multiplier is positive or zero then that thing has to be tight. If a constraint is loose at the optimal point, these lagrange multipliers have to be zero. Okay. So this is gonna have lots of implications and when we give other interpretations of what all this means, itís all gonna tie in, like, with these things being prices for example. But weíll quit here for today.

Student:[Inaudible]

Instructor (Stephen Boyd):Exactly.

[End of Audio]

Duration: 77 minutes