ConvexOptimizationI-Lecture07

Instructor (Stephen Boyd):Let me start with a couple of announcements. Not announcements. Actually, one was weíre interpreting relative silence from the class as an indication that for example, CVX is actually installing correctly and that everything is working. Now, there is another interpretation and that is that no one has started homework three yet. So if you find anything, any inconsistencies in anything, let us know. We already had a couple of typos in the user guide pointed out, but if you actually find errors otherwise just feel free to let us know. Actually, itís a good way to get it all debugged very well. Now, I have a very strange question for you. The question is this. It sort of came up in office hours and then I talked with the TAís about it. Now, hear me out before you laugh or start throwing things or anything like that. The question is this. We think we know the answer. The question is this. Would you like to have a midterm?

Student:No.

Instructor (Stephen Boyd):Okay. Okay. Actually, itís an honest offer. And let me just say what it is so this is just if you want and then you can say no again, but then at least Iím saying something very specific so here it is. Today, weíre gonna start basically the last theoretical, I mean, the last sort of theoretical concept in the class, so thatís duality. Thereís a whole section on algorithms and things like that, weíll get to that later. But then thereís just gonna be a lot of applications and things like that and just doing this stuff. So in a week or two weíd be in a natural place where sort of the core theory is basically that itís all done. Thatís it. Itís done. Itís more of the same, a lot of applications, things like that. And so the idea is we could have Ė Iím just saying we could, this is an offer, we could do something like have a boring conventional style midterm for an hour and half somewhere and it would be these boring problems like which of the following functions are convex and all that stuff. But then the final could be Ė which will naturally be a mixture of Ė itíll have a handful of these things, but then mostly itíll be on problem formulation and things like that. What do you think?

Student:No midterm.

Instructor (Stephen Boyd):All right. All right. Thatís fine. The TAs predicted that by the way. Okay. All right. All right. Thatís fine. All right. Well, letís continue. The next topic is generalizing the quality constraints. By the way, as I said, I know weíre going quite fast over this material. Weíre going at about twice the speed I think we should be going, but thatís fine because things like the generalizing the quality Ė weíre going to be using these in applications basically from now on. So youíre gonna weeks and weeks of seeing real problems that use these things so if it feels like weíre going too fast right now we are, but youíll have four weeks in the context of applications to absorb all these things. So generalizing equality constraints. So here weíre gonna generalize the inequalities, by the way, later, weíre gonna generalize the objective. Thatís actually, in some ways, more interesting or itís a more significant generalization. But right now, weíre gonna generalize the inequalities to be vector inequalities. So here, FI of X until now has returned to scaler and it was FI of X is less than or equal to zero, weíre gonna have FI of X return a vector, thatís a vector inequality and itís defined by this cone KI. So in this case if FI is KI convex so itís a convex function with respect to this cone, this generalize and equality, then this is called convex problem with generalize and equality constraints. And you can generalize almost everything youíve seen to this. For example, the set of optimal points for this problem is convex and all these kinds of things. Everything just goes through here with this. But thereís a special case thatís gonna be more important than the general one, and it turns out itís interesting. Itís a very special case, but it turns out to be, in some sense, also the most general case and thatís this. Itís a so called conic form problem and youíll hear a lot about this. I donít think itís that interesting and Iíll tell you why in a minute. Let me show you what is it. Youíll hear a whole lot about it.

Itís a very special case. Everything is affine so itís equivalent of a linear program basically. A normal optimization problem, if every function is affine you have a linear program because you minimize and affine function, subject to affine functions less than zero because youíre conventionally called linear and inequalities. Affine equality, thatís just conventionally called linear equality constraints. So here, everything is affine. You minimize a linear function. Now this is a very, very interesting inequality. This is affine and this just says that FX + G is less than or equal to zero with respect to this cone. Thatís all. So this is called a conic problem, a cone problem, something like that. You can type this into Google, a cone problem, and youíll get tons of stuff. I think itís interesting because it sort of generalizes the idea of linear programming to cone programming. If this inequality is with respect to the non-negative orphaned, so recall that if we erase this, thatís the default. The default inequality between vectors with respect to R + and it means component wise and inequality. Thatís a general LP, thatís a general linear program. So cone programming, I think itís caught on for several reasons. It is actually interesting to look at it, but itís caught on for a lot of reasons. First of all, if youíre trained in traditional operations research you grow up Ė by the way, which is not me just so you know Ė but if you are trained or subjected to, maybe thatís another verb one could use, a traditional training, everything youíll hear is about linear programming. The nice thing about cone programming is all you do is you drop this little K down there. By generalizing the cone from the non-negative orphaned you now have something that looks like linear programming so if thatís your training and background itís deeply familiar, youíve had three classes on it and now all you do is you generalize this and it means that everything is gonna be familiar and so on and thatís what happens. A lot of things are familiar.

Thatís the first reason I think itís caught on. But the other is also interesting and you have to watch out for it, itís this, itís caught on because in fact the most highly developed solvers Ė weíre gonna go in some detail in this in the last third of the class Ė the most highly developed solvers for convex optimization problems actually are conic solvers. So people who work on solvers naturally think that cone programming, conic programming and things like that is this huge important topic that everyone would be interested in. Actually, like linear programming, itís not. Okay. Itís actually not. Itís very important that if you worked on these things, fine, but for people who use these things, itís not that important. Itís important in the following sense. There are a lot of interesting problems people really want to solve that transform to conic problems, and then, all the people who worked on the cone solvers can use their super fancy primal dual homogenous methods to solve them, but in some sense, that should not be the business of the general public. That should be the business of this small group of people who worked on cone solvers. Iím saying this by the way because thereís a lot of people who worked on cone solvers who will see this tape and I just wanted to insult them, but thatís all right. I hope it worked. Iím sure it did actually. All right. So this is a conic form problem and in fact, the one you will see most Ė thereís a question?

Student:Iím wondering why doesnít I [inaudible] on the K in the first Ė is it because you can do different cones in the same set of constraints?

Instructor (Stephen Boyd):Yes. So the question was why the I in the K and the conjecture was itís because you can have different cones reach constraint.

Student:You can generalize it to make it Ė

Instructor (Stephen Boyd):And my answer was yes.

Student:It seems like you could generalize it all to one cone, the highest order cone or something like that?

Instructor (Stephen Boyd):Yeah, you can do all sorts of things. You could lump these all together and make it one inequality with one cone, which is the direct product of all the cones. So you could do lots of things. Okay. Now, in conic program what youíll see the most is something called semi-definite programming. And this is actually interesting, well, it depends, what youíll actually see is the combination of two types of cone programming which youíll see in a minute, which Iíll talk about in a minute. So semi-definite programming, this is relatively new. So this is maybe 15 years old, but like everything else can be traced back into the 60s although people had no idea how to solve SDPs. So as a specific example I worked on SDPs before, in fact, the name had been fixed in the context of control and things like that and we were very, very happy when we could solve tiny small ones in an hour. Boy were we happy. But it didnít occur to us that you could solve them a thousand times faster and things like that and that the current state now wouldíve been just unimaginable 15 years ago. This has sort of made the transition in the last few years from this kind of esoteric thing that a few people in research worked on to just kind of it being all over the place. Itís very fashionable. Itís also used commercially I should add in areas of finance and stuff like that itís kind of widely used. So semi-definite programming or SDP is very, very simple. It looks like a linear program except for one generalization. So we minimize linear function, linear equality constraints Ė thatís so far linear programming. Hereís the difference. Thereís an inequality and itís a matrix inequality. These are symmetric matrixís here. So this thing here is called an LMI, or linear matrix inequality. So thatís what this inequality is. Itís quite sophisticated. A matrix inequality is very complicated. To say that a matrix is positive semi-definite or negative semi-definite here is actually a very sophisticated inequality so from that point of view, theyíre right, it is sophisticated but sophistication doesnít come because you have a name attached to it or anything.

So this is an LMI and thatís it. Thereís an interesting thing. Here as drawn thereís only one LMI, but of course, you could have multiple LMIs, but the fact is that if you have a bunch of LMIs like this, if you have two, you can lump it together into one and you lump it together into one by simply making a block diagonal matrix here with the LMIs because if I have a block matrix and I say that the block matrix is negative semi-definite, it just means basically each block is negative semi-definite. So thatís why we can, without lose of generality, work with just one LMI in general. Okay. Now, semi-definite programming is a generalization. Actually, everything youíve seen so far except geometric programming, so letís see how. So itís a proper generalization of linear programming, so how do you represent this linear program this way. This inequality here, that, this one here is a vector inequality so this means element wise. Okay. How do you write that? Well, itís kind of stupid. If you take diag, this is the diagonal matrix with this vector on the diagonal, a diagonal matrix is negative semi-definite if and only if all the entries are non-positive. These are interesting for lots of reasons. The first is just sort of theoretical. Itís nice to know that you have a problem class that subsumes other common problem classes. Itís nice to know that in fact you didnít really have to know about SOCP quadratic programming. Is there a question?

Student:[Inaudible] problem solver or with a [inaudible]

Instructor (Stephen Boyd):Thatís a great question. So it depends. If the SDP solver Ė weíre gonna talk about solvers later in the class, it depends on how sophisticated the solver is. If any modern SDP solver Ė if it handles sparsity and itís a Ė letís look at AX minus B here, diag minus B and realizes itís diagonal, it will have no loss compared to solving it as a linear program. So that leads me to my second point. SDP embeddings actually also have substantial practical use because if you write an SDP embedding of a problem it means that that other problem class can be represented in an SDP and it means that one SDP solver can do everything for you. Now, very roughly speaking thatís what CVX does.

Student:[Inaudible] use that in their problem solver or [inaudible]

Instructor (Stephen Boyd):Nope.

Student:Is it because the [inaudible]

Instructor (Stephen Boyd):Absolutely not. Because itís gonna be a mixture of LMIs, all these constraints and it cannot call an LP solver, does not call an LP solver, will not call. So itís absolutely not the case. For example, in CVX or any of these other tools if you write something that happens to be an LP, it doesnít not call a different solver. It calls exactly the same solver as if youíd written an SDP. Okay. So no difference. So the point there is an SDP embedding actually then has a practical use and itís so expressive. SDP is so expressive that it can be used to represent a huge number of constraints and functions and things like that. Weíll talk more about that later in the class. I should say also that itís come up now in theoretical computer science. I think itís in lots of area. Actually, I think we just found out that SDP traces back with a different name into some calculations in physics and chemistry in the 80s or something like that and it goes back into control in the 60s and things like that, so itís got a long history, but that long history didnít really particularly come with good computational [inaudible] so it was just theory. Thatís changed. SDP is now not just theory as I guess you all well know by the time you finished homework three, you will have solved, whether you know it or not a lot of SDPs. Weíll get to that later. Okay. Hereís a more interesting embedding. SOCP, so second order cone programming, the question is how do you represent this as a linear matrix inequality? Now, this is highly non linear and itís non differentiable and the answer is this and weíll see more about this later and in fact we have something queued for you, not on homework four, but on the next one, I think. This is a matrix here, which has a block Ė this one one block of the matrix is a scaler times the identity. Now, actually I donít know how many people remember about sure compliments, but letís just say we the teaching staff will remind you about sure compliments soon enough, but a block matrix is positive semi-definite. Now, this is rough so this is not quite right. Itís positive definite if and only if the diagonal entities are positive and if something called the sure compliment is positive and the sure compliment is this minus this times the inverse of that times that is bigger or equal to zero. Now, for this particular matrix, that says CI transposed X + DI like this Ė let me put these together. Maybe if you do it on the pad here and capture both the big LM on this one and that one. There we go. If you can just capture this and what Iím writing here. Itís this must be bigger than or equal to this thing AI X + BI transposed times the inverse of this guy, well, thatís I inverse so Iíll just leave as I, sure, Iíll just put I inverse here, why not, and then thatís a scaler so Iíll put that down here. Iím using a loose parson here.

And then times AX AIX + BI and that is that term. Okay. Everybody see this inequality? So this one has to be positive because itís the last diagonal entry in a positive semi-definite matrix. It canít be negative. Thatís for sure. So this one I could square and put it over here. Okay. Or I can multiply both sides of this inequality by that. I can get rid of all of that. I get this. Now, I take the square root. When I take the square root, I have to be very careful. I can only take the square root here if I know CI transposed X + DI is bigger or equal than zero, but it is and Iíll get exactly this second order cone constrained so this is an example of how this works. Thatís sort of the idea here. Okay. Letís look at some other problems. Hereís one. Itís eigenvalue minimizations. So here, I have a matrix which is an affine function of a vector X, so an affine matrix valued function looks like this. Itís a constant matrix, these are all symmetric, plus and then a linear combination of symmetric matrixís here and I want to minimize the maximum eigenvalue of that matrix. Now, let me just point something out. Thatís a really complicated problem, right, because you take this matrix Ė so far, itís fine, every entry of this matrix is an affine function of X, but now to get the maximum eigenvalue value you form the characteristic polynomial. Thatís it. Itís huge so you donít want to see it. You canít see it for a 15 X 15 matrix. Okay. You donít want to see it. You get some huge polynomial of degree N if N is bigger than five, thereís no formula for the roots at that polynomial then dispute the fact you have no formula for the roots, you now have to select the largest of them. Okay. Now, that can switch. As you mess with a matrix the second largest eigenvalue value and the largest one can kind of go like this and when they switch this function is non-differentiable. Okay. So I just want to impress upon how complicated this function is.

Itís very complicated, this thing, however, we can write itís non differentiable, itís highly non linear and so on. Now, on the other hand, we can actually optimize the maximum eigenvalue very easily as an SDP. Itís totally straight forward. Itís just this. You minimize T subject to A of X is less than or equal to TI. Now, to write this in this full form, I guess, these things Iím not gonna do, but you write it this way. And this is indeed affine in the variables X and T. Thatís this thing. So people just write it this way. Itís fine. So thatís eigenvalue. So eigenvalue minimization is quite straight forward. This is something that 20 years ago no one knew that this was straight forward, no one, 10 years ago a few people knew and so on. So this is new. This is not LP. Weíre not talking LP anymore. Weíre not talking stuff from 1953 anymore. This is recent history. Letís look at another one. Matrix norm. Now, matrix norm, as you know is a very complicated function. Itís the square root of the largest eigenvalue of A transposed A. Now, when you look at this you immediately know you got a problem here. Land of max, well, we know thatís a convex function. We know that. Right. Not only that, we know itís sort of SDP representable because we just showed how to minimize the length. But the problem here is this essential non linearly. This is quadratic and in an SDP, which after all has the constraint of a linear matrix inequality, you got the word linear in it. Youíre not allowed to have products. This has products here of A with A. So you have XI XJ terms. Thatís not linear. So the question is how do you turn a non linear problem into a linear matrix inequality and again, itís an SDP embedded so you take a larger SDP, itís larger but itís actually gonna be now affine or linear in the variables. So letís see how thatís done. Itís done this way. And you have to study this matrix quite carefully. Itís in the queue. Weíre getting it ready for you. You have this block matrix here and this is TI and you want to know when this matrix is positive semi-definite.

Well, itís basically you have to have TI positive, and the other condition is that TI is bigger than A of X transposed times the inverse of that. Iíll just put divided by T because the inverse of I is I times that. Okay. And very important here Ė that is a matrix inequality. Okay. There. T is positive; itís non-negative so this T can go over here safely like so. Now, you take a look at that and you realize weíre quite cool because T is non-negative I can write this as TI is bigger than Ė no, Iíll leave it this way. This basically says that T squared times the identity is bigger than this matrix. In a matrix sense that means that all the eigenvalues of this matrix are less than T squared. Okay. That means the maximum eigenvalue of this matrix is less than T squared, but the maximum eigenvalue of A of X transposed A of X is the maximum singular value of A squared and you take square roots and youíre back to this thing. Okay. So thatís the embedding. Youíll see more of this. I just wanted you to sort of see it now. Also, itís interesting to see you donít have to know the details yet, later you will, that if youíre actually curious about what CVX is doing, after all you can look at all the source if you like and find out whatís happening, but if you look deep enough in it and then ignore all the silly testing and cases around there, the essence is every function implemented in CVX is an LMI representation. Thatís not quite true. In fact, let me say a little bit about that. I donít mind lying, but let me just Ė I donít do it that often and when itís for a good cause it doesnít bother me at all, but Iíll just make this a little more accurate. Itís not a total lie. What cone solvers really do and it comes back to your question to be honest, so the question is if you go to Google and you type cone solver or something like that, youíll find all the basic ones, youíll find thereís 20 or 30 by now, thereís a commercial one actually that you can get in excel so it goes on and on. But they all kind of do the same thing. What they do is they solve a conic problem which is like [inaudible] note the period, C transposed X subject to Ė and thereís many ways to write it, but Iíll write it this way Ė Iím gonna make these lower case Ė something like that.

Okay. And you can put an inequality constraint here. That doesnít matter. So they solve problems like this. Now, here I have to interpret these Ė these are the Ks you can have. Theyíre typically sort of R plus so thatís linear inequalities, they can be second order cone or larents cones so they can be SOC of some size K or something like that and they can also be sort of linear matrix inequality of PSD cones of size K. So this is really what they look like. As to whether the linear inequalities and the second order of cone inequalities are then further dumped into matrix inequalities and solved, is actually a question of semantics and things like that. So weíll just leave that out. But this is in fact really what if someone says cone solver, this is what they mean. This is not that interesting to people who want to use these things. I mean, I donít know what this would be like. This would be like looking at a very low level LA pack code and looking at the full calling sequence. Itís of great interest to people who write low level code. Itís very important that it be done right. That smart people thought about it, implemented it. But not all of us need to see the 14 arguments that you call, you know, DGESL with or something like that. Most of us are perfectly okay with just kind of the back slash or something in between like in a python or something where youíd actually let somebody else deal with all the memory allocation and passing pointers and all that kind of stuff. So I think this sort of has that flavor of that cone solver. So now Iíve been honest about what an actual cone solver in practice is. Okay. All right. Letís move on. What weíre gonna do now is we are gonna to generalize the objective to be a vector. So, so far, we left the objective as a scaler and we generalized the inequalities to be vectors and you get things like cone programming and all this kind of stuff. Now, weíre into the objective and this is a different thing all together. This is a very different thing when you change the objective because now you actually have to redefine the semantics because first of all it doesnít even make any sense so the first thing you have to do is say what on earth does this even mean. What does it mean?

What is FO of X is a vector with two things? What does it mean to say minimize a vector? Okay. And you get all sorts of nonsense about this so itís actually important to get all the semantics right. So a convex vector optimization is just like this except FO of X is a vector and thereís a cone associated with it and thereís a cone associated with it. And let me just say what this might mean. Letís do the scaler case. Let me tell you the semantics of scaler objective optimization. It goes like this. If the semantics are extremely simple and it all comes down to the following. If you have three people come up with a proposed X Ė Iíll tell you everything you need to know about the semantics of scaler optimization. I look at the first X and I check the constraints. If any constraint is violated in the slightest I dismiss that X and that person, right, it means itís completely unacceptable and they canít say things like it was almost feasible or something like that. Theyíre dismissed. Okay. Now, the remaining two people both pass the feasibility test. Theyíre Xs are both feasible. They need the constraints. They havenít been dismissed. How do I judge between them? Well, I mean, itís extremely simple. I evaluate the objective function for those two points. Okay. If one is less than the other I say thatís better and I dismiss the other one because here I have two potential points, each of which satisfies the hard constraints, this one is better on the objective. Now, if theyíre the same, then the semantics Ė actually, what is the semantics if theyíre the same?

Student:[Inaudible]

Instructor (Stephen Boyd):No, theyíre not both optimal. I didnít say they were optimal. The answer is as far as that problem is concerned, you actually donít even have a right to have a preference between one or the other. You donít. Okay. So if you look at that and you say, ďOh, no, I much prefer this solution,Ē then your objective is not reflecting your interest and your utility function. If you like one more than the other, but the objectives are the same, youíre modeling is not right. Everyone see what Iím saying here? So thatís it. Now, the key to this Ė in the scaler case is easy because an objective comes out to a number. It comes out to, like, $17 a cost per unit, in which case, $16.38 is better than $17, period, if itís a cost. If itís a revenue it switches. Okay. So the point is that theyíre all comparable. Now, suppose theyíre vectors. Now theyíre vectors and as you know with vector inequalities theyíre not linear orders. Linear ordering is one in which any two elements can be compared. Thatís one of the nice things about linear. There are linear vector orderings. Did we assign that problem? I think maybe we didnít. Does anyone know a linear vector ordering? This is just for fun. What.

Student:[Inaudible]

Instructor (Stephen Boyd):Alphabetic or lexicographic. Someone else was gonna mention something. Same thing? Yeah. So lexicographic ordering on a vector says you look at the first entry, you get two vectors, you look at the first entry and if one of the entries is bigger than the other, done, itís higher in the ordering. If theyíre equal, you drop to the second and then you see who beats on the second and if theyíre equal, you go to the third. So thatís lexicographic ordering and itís a total ordering or a linear ordering is the name. So if theyíre equal all the way through then the two vectors are the same and there wasnít really a question anyway, right? So thatís lexicographic ordering and itís a total ordering or a linear ordering is the name for that. It means any two points can be compared. Thatís why dictionaries work and blah, blah. Okay. All right. Most orderings of interest are not total orderings, most vector orderings, of course, the ordinary ordering on R is a linear ordering, but most vector orderings are not. And in fact, thatís exactly what makes it interesting. So for example, in matrixes for example when youíre talking about confidence ellipsoids and you say which is better, this or that, this estimation method has this confidence ellipsoids and this one has that one, which is better?

Well, of course, it can happen that one of the ellipsoids fits in the other and in which case you can say I like that one better, but of course, you can have this thing where theyíre not comparable. Were neither ellipsoid contains the other and then you have to say something like algorithm actually is better at estimating [inaudible] or whatever than elevation. This one on the other hand is better at doing this than that. So thatís the idea. This is just to get the rough idea here. Okay. So we actually have to figure out what is the semantics here. Well, the way this is done you have to look at the set of achievable objective values. Now, that is a subset of a vector space of dimension bigger than one. Weíre assuming thatís the case here. You look at the set of achievable values. Thatís the set of objective values you can possibly get with a feasible X. And now you say the problem is very simple. You say the point is optimal if that achieved objective value is minimal in O. You say itís pareto optimal if itís a minimal value. Now, you probably donít remember these from long ago, but the ideas are this. A minimum point of a set is one where every other point in the set is comparable to it and larger. So this is a minimum point. If this is the set of achievable objectives, this is a minimum point. This is the only case, essentially, when you can say unambiguously this point is optimal. Itís the only time you can actually say minimize something and say X solves this because no one could possible argument with it here in this case. Minimum means youíre better than every other point. Minimal means no point is better than you. Now of course if in total ordering these are basically the same. Theyíre the same. But the difference here is you can have points Ė so if you look at this point here, all of these things are honestly beat by this point.

Theyíre honestly beat by this point because theyíre comparable to this and they beat the object period. Thatís this point. But these things over here are interesting. These over here are interesting so the things in the second and the fourth quadrants, theyíre not comparable to this point. They donít beat it. Any point over here would beat that point unambiguously and the whole point about being pareto optimal or minimal in this set O is that there are no such points. Thatís the definition of being pareto optimal. So you would say, in fact, if someone said, ďYeah, what about that point and that point,Ē you would actually say that theyíre incomparable, that neither one is better than the other and you would say, in fact, that in discussing the two designs thereís a tradeoff between the two. You can trade off one objective for the other. Okay. So this is the picture. Some of these ideas are kind of obvious and youíve probably seen them before in other areas.

Student:That curve Ė a significant portion of that curve also is pareto optimal?

Instructor (Stephen Boyd):Yeah. You see this little thing in here?

Student:Yeah.

Instructor (Stephen Boyd):Thatís not pareto optimal. Okay. And actually, that point, although you couldnít possibly see it certainly at video resolution you canít see it, or can you? I donít know. Itís somewhere right around one line. This point is open and this point is actually filled in so these are pareto optimal down here and all of these are. Right? Another name for pareto optimal is not stupid. Thatís another technical term because for example if someone says, ďYeah, well, thatís my choice,Ē okay, the technical name for that is stupid and the reason is this Ė if you go down here like this, you know, everything like this Ė anything in there is a point that unambiguously beats this. So it would be stupid to choose this point. Any of these other points in here would actually be comparable to that point and unambiguously better. So thatís the other way to think of it is just not stupid. But the way to do it is very simple. It basically says minimal means you look up in here and you want everyone else to be up there. Those are the places that are unambiguously worse and pareto optimal says you look down and left. I mean, this is in this very simple case. You look down and left and these are all the places which are unambiguously better than you and you want to make sure that no one is unambiguously better than you. So thatís it. And youíve seen this idea many times Ė my guess is you have. Or even youíve thought it and just didnít have a name for it or something. All right. Vector optimization comes up in essentially only two practical contexts. One is where the ordering is with respect to matrixes and thatís where youíre comparing and that actually comes up all the time. It comes up in experiment design; it comes up in estimators and things like that where basically the objective is something like a covariance matrix like an arrow covariance matrix. Everybody wants a small arrow covariance matrix. It defines an ellipsoid. Everyone wants a small confidence ellipsoid. Thatís clear. The problem is those are not totally ordered, right? And weíll talk about what happens in that case. So thatís the first case when these are matrixes. Thatís sort of the subtle case. The other case is actually much simpler although it comes up maybe more frequently. Itís just multi-criterion optimization and here the cone is nothing but the [inaudible]. So here basically itís called multi criterion optimization. Your F0 has got a bunch of components and people call this a multi-objective problem, thatís another name for it and you will hear a lot of nonsense put up around all of this.

You hear people say, and of course, itís true, you hear people say things like, ďOh, all optimization problems are multi-criterion,Ē well, like, duh, so what? I mean, thatís completely obvious and weíll see that it relates immediately if you know what youíre doing to ordinary optimization problems. So weíll get to that in a minute. In a multi-criterion problem, you have multiple objectives and if you want to think about something specific here you might think of an engineering design, letís say a circuit and one of these would be something like power, one would be delay, one would be area, I mean, just for example. Okay. Or this could be some control system or something like that and one of them could be, again, something like power, one could be how close youíre tracking and the other could be something like the roughness of the ride. The point is there are multiple objectives all of which you want small. Once you think about it you realize everything has this form. In an estimation youíd say, ďWell, I want something that matches my observed measurements well,Ē I mean, it would be foolish not to have something like that, but at the same time, I want an image thatís smooth, or I want one that is simple in some method. In some sense, those are all gonna be multi-criterion problems. Okay. And Iíll say a little bit about how it all fits together in a minute. Okay. Roughly speaking, you want all of these small. Now, the one thing you can say is this, if someone comes up with a point and you beat someone else on all objectives Ė or I should be more precise, if you meet or beat them on all objectives and beat them on one, youíre better. Everybody got that? This is like a core in economics you go through all of this. Youíd see all these things. Thatís what it means to be better. So pareto optimal means no one beats you. Thatís pareto optimal. It means basically that if you have two pareto optimal things, you have to have traded something off.

They have to beat you on something; you have to beat them on something else. Okay. Now, pareto optimal, well it says exactly that. It says that if one thing is less than or equal to the other the same objective thatís in a vector ordering then theyíre equal. And if you have different pareto optimal values thereís a tradeoff between the objectives and the classic thing from lee squares or anything else would be something like that so youíd write it this way, youíd say minimize with respect to R + squared. I mean, this is a bit of a pedantic way to write it, but it is actually exactly what you mean. You say please minimize with respect to the second order [inaudible] the following: two objects, both by the way convex here. Norm X minus B squared, thatís like a matching criterion. Thatís how well do you map something. But at the same time, please do it with a small x. Okay. How do you judge such a problem? Well, what you do is this. X by the way could be a vector in 10,000 dimensional space, but thereís only two objectives, so in principal, what you do is you check every vector in our 10,000 so you plug in every possible X and you get a score, which is a pair of numbers, non-negative numbers and you put a dot at every single point. This is conceptual. You now do this for all 10,000 points and you get this shaded plot that looks like that. In fact, itís some kind of a hyperbole or something, it doesnít matter what it is, right? Now, also I should add that O has some interesting structure out here, but itís not interesting to us. Whatís interesting for us is down and to the left. Thatís the only thing interesting to us. And therefore, whatever O looks like up here is actually of no interest unless it kind of does some weird thing and wiggles around and comes out down here, but it doesnít, okay? So weíre only interested in whatís down and left. So in actuality the only thing thatís interesting here is whatís shaded here is the pareto optimal curve. Now, you saw this if you took 263. This is called the regulization and so on, so youíve seen all this.

Iíd say the technical term for this is quite stupid. Let me justify that. Itís quite stupid because, first of all, thereís better points. Thatís number one. Number two, itís not even achievable, so thatís why it seems to me itís even worse than stupid, but anyway, this should be clear. Hereís a generic example. It looks like this. I have two objectives and let me explain the problem here. I have a vector that represents an investment portfolio so XI, for example, might be the fraction Ė people set this is up differently, but a classical one is this Ė XI represents the fraction invested in asset I. So you have N assets, negative Ė if you want to have negative, I mean, here weíve ruled it out, but the point is if you want to have negative it means you have a sure position. If that asset is cash, it means youíre margining. Youíre borrowing money to finance the purpose of other assets or something like that. But here to make it simple we just made X a positive and youíd say that there are no short positions and if one of those assets is cash youíd say no borrowing. No margining. Okay? So thatís the idea. Thatís the fraction. And of course as a fraction, itís got to add up to one. Okay. Now, what happens here is you invest in these assets as one period and what happens is thereís a return or the prices change, so you can call P Ė itís a vector of relative asset price changes. So if you form P transpose X that gives you the total return on your portfolio. Okay. Now, obviously, if the returns were known Ė in fact, letís discuss that because itís a very short discussion. If I told you the returns, whatís the best possible portfolio?

Student:[Inaudible]

Instructor (Stephen Boyd):What is it?

Student:You put it all into the best one.

Instructor (Stephen Boyd):Thank you. You put it all into the best one. So you find and someone tells you, reveals to you what the returns are, itís extremely simply, you simply put EK, where K is an index corresponding to whichever one had the highest return. Okay. That was the discussion of deterministic of finance. It was a short one. So the interesting part is this. Itís a random variable. So the return is actually a random variable so itís a random variable here and you can get much more sophisticated, but the basic one goes like this. So P is a random variable and therefore P transposed X, thatís the total investment return, is a random variable. Okay.

And now obviously you want it large. What does it mean to say a random variable is large? By the way, thereís tons of ways to say a random variable is large. Okay. Lots. You could say itís 90th quintile is large, itís 10th quintile is large, you could say all sorts of things. Thereís many, many ways of doing this. The simplest thing that makes any sense at all is this, you evaluate the expected return and the variance, or you could take the square to the variance and get standard deviation. Thatís sort of the minimum that makes any sense at all would be this pair. And now you have a bi-objective problem because everyone wants high-expected return and everyone will take a lower risk over a higher one. Letís just say that people are risk adverse. So what that means now is you plot these things by two things; youíd look at the risk as measured by standard deviation of return Ė of course you can get much more sophisticated measures of risk like you can have downside risk, value at risk, you know, it goes on and on. But the simplest measure is just variance. Then you have all sorts of long discussions about this like somebody would make the stunning Ė are you ready for an amazing observation Ė how about this, somebody points out, ďYeah, we judge portfolios by risk, but it turns out Ė well, imagine what would happen if your return was much higher than you thought, would you be unhappy or happy?Ē And youíd say, ďWell, let me think about that for a minute. I guess Iíd be happy and they go see variance is the wrong measure of risk, you care about the downside variance.Ē So these are all totally obvious things, but letís just Ė and by the way, if you make any assumption about the distribution, like, if itís galsium, then all these go away. So letís just assume variance is a measure of risk or standard deviation and what youíd do is youíd solve this problem Ė sorry, I havenít said yet how to solve a multi-criterion problem, but this would actually be the pareto optimal boundary. Youíll do plenty of these, just for fun.

Now let me say something about what does it mean to solve a multi-criterion problem? Actually, thatís an interesting question. We all know what it means to solve a scaler problem; what does it mean to solve a multi-criterion problem? Well, thereís many definitions. One is this, if that multi-criterion problem has an actual optimal point then for sure solving means you better find it. So if there was actually Ė and it could happen. Here in a finance problem you could have something where basically you have one asset that beats all of the others in both risk and return, for example, in which case this would be a stupid problem. The pareto optimal curve would be a single point. Everything else would be down and to the right meaning that you can choose this portfolio or if you wish, there are many portfolios to choose from, but all of the have both lower return, higher risk and one of those inequalities is strict. So thatís what it would mean in this case. It could happen. It obviously doesnít happen in practice, but that could happen. Okay. Now the other meaning is something like this. You could define the semantics of solving a multi-criterion problem as producing a pareto optimal point. Thatís another meaning for what it means to solve it or an interesting one would be to produce one or more Ė or actually you want a method for exploring the pareto optimal points. Thatís actually probably what you want when you say how do you solve it. So most methods for solving multi-objective problems will have some parameters or a joy stick in them that would allow you to move around on the pareto optimal surface, which would be called exploring the pareto optimal surface and that kind of thing. So classic methods, probably hundreds of years old, scalarization. How does that work? In the general case, actually itís quite elegant. It works like this. You choose a positive vector, positive in the dual cone, okay, in the dual cone. Now, you might remember this.

When we looked at it, it was completely context free and had no actual practical meaning so you probably ignored it, but one way to say that F0 of X is K convex Ė one condition for that is the following; itís that lambda transposed F0 of X should be convex. Thatís a scaler function. Itís a convex function for all lambda incased [inaudible] in the dual cone. So thatís one way to say it. In particular, this is a scaler convex optimization problem. Okay. By the way, this is already said something. In the case of multiple criterion weíll see itís something very simple. Itís a positive way to sum up the objectives. I mean, everybody knows that. You want to minimize a bunch of things, you put positive weighted sums and you wiggle the weights around to change the tradeoff. Thatís obvious. Itís much less obvious when we get to matrixes. You want to minimize, for example, estimation error covariance, it says take a positive definite matrix, you know, lambda and minimize trace lambda sigma. Thatís what it says. This is not obvious just to let you know. Okay. So hereís what happens. This is a general problem, non-convex and hereís what is really happening. So here in this problem if you take a lambda like this and you minimize this, it basically says you go as far as you can in this direction and you might find that point right there. Thatís pareto optimal. Okay. Now, by wiggling lambda youíre changing the slope and it says for example you can get this point. Now, whatís very interesting here is that this point right here is pareto optimal, but you cannot get it by any choice of lambda. Itís shadowed. Thereís no way of getting it. If you want to wiggle around on the pareto optimal design, you change the weights and itís kind of obvious how you change the weights. Actually, the choice of weights relates to our very next topic which is duality, but for now, you can think of the weights as simply a joy stick that moves you along the pareto optimal surface so thatís the right way to do it.

So if you go back to the risk return tradeoff now I can tell you how those plots were obtained. They were obtained as follows; we maximized this minus that and I varied lambda from very small to very large and I solved this QP for each value and thatís how I got the curve. Every one of those is risk return pareto optimal and this is in fact gets you everyone in the convex case. By the way, this has a name; this is called the risk adjusted return. Thatís the average return and now youíve penalized the return by some adjustment for risk and then you would say something about gamma is your risk aversion parameter because it tells you how much mean return are you willing to give up for variance in your total return so this would be a risk aversion parameter. For us, itís a weight, itís in the dual cone of R2, itís a weight that simply weights variance versus return. Thatís all it is. So thatís that. Okay. So this finishes our whirlwind tour through optimization. You should have read chapter four by now and you should read of course the CVX users guide for that and they should be coherent in fact. Well, I hope theyíre coherent. If theyíre not coherent, let us know. Weíre by no means done, but the rest of the class, every homework, no exception, you will see problems where you will formulate them as convex problems, youíll do numerical solutions, youíll work out more theory as we do it. So in some sense, weíve kind of finished our first coverage of problems Ė we do problems from now on period in applications and things like that. Now, weíre going to enter a new topic, which is actually sort of our last sort of theoretically oriented topic. There will be some others later, but this is a very important one. Itís duality. It relates exactly to these weights. And Iíll start on it. This is in chapter five. You should start reading chapter five. Okay. Duality is a very interesting topic. Itís interesting theoretically, it turns out it has extremely interesting computational implications. It also turns out itís extremely important in all practical cases so whenever you use optimization, ever, you have to be aware of duality. And by the way, itís not just a theoretical thing; however, if you randomly search in the literature on duality youíll imagine itís just something that people who do theory worry about. Itís immensely practical and weíll see that into it. Thereís something I forgot. I pushed it on a stack and didnít come back to it. I want to come back and say something about multi-criterion optimization versus scaler. And I just wanted to put them all in the common parent so the common parent of scaler and multi-criterion optimization is this. If you want to put them on a common footing, youíd introduce a concept called the hardness of a constraint. Okay. So that would be the word used. A constraint that is infinitely hard or letís say an objective thatís infinitely hard. The problem is constraint already means hard and objective already means soft so letís call them objectives, but letís call them functions.

There we go. Functions. F0, F1 and so on, weíre gonna call them functions. Infinitely hard constraint Ė so you can do the concept of hardness by plotting your irritation versus the value of FI of X. So hereís 0, that means youíre not irritated at all and the semantics of the objective, your plot looks like that. Thatís your irritation; I guess this means youíre pleased. Okay. And this says basically the following; it says that you can make F0 Ė F0 can be large, if it has to be large, it has to be large. It irritates you, but if it has to be large, it has to be large, so this would be a soft irritation function, right? So thatís the picture there. You can also be pleased and if you have a choice between two points, one of which was sort of here and one was here, youíd prefer that because youíre lower and it means youíre less irritated. Everyone see this? Suppose you wanted to change this irritation function; what irritation functions work right away with convex optimization, just instantly? Let me draw an irritation function and you see if you like it. How about this? Iím down there. Thatís my irritation. And then someone says, ďWhy,Ē and you go, ďNot a problem. If this is the area, yes, I always like things to be smaller, always, but you know what, when things start getting bigger than some threshold, it starts irritating me super linearly,Ē thatís what this is. Can I express this as convex optimization problem? I can because I would simply take this function of F0 Ė this function is convex and it is increasing and therefore I can do that. Okay. Then the objective value would match yours. For a constraint, the canonical irritation function looks like this; itís along the axis so I kind of have to draw it that way. Okay. Thatís zero. Thatís your canonical irritation function. It basically says if Ė letís say in a classical problem, if F3 of X is less than or equal to zero, it doesnít irritate you at all. By the way, the same is also true in the semantics of the basic problem that a smaller value of F3 means nothing to you. Absolutely nothing. By the way, you might find someone in practice say something like ďOh, no, I prefer a smaller Ė I prefer,Ē but the actual semantics of the base optimization problem says that F3 of X equals 10 to the minus nine, right, and equal minus 100 are equal for you. This is by no means any better than that. They are both feasible. End of story. That is the same as that. This and this are also equally and infinitely irritating. They are totally unacceptable, and here you canít say come on, I just barely, you know Ė now, as a practical matter, you might argue and if you look carefully, what numerical solvers will return, I mean, obviously, theyíll return things that, for round off reasons, might violate things slightly.

Anyone who asserts that thereís any difference between these means that you havenít formulated the problem. The common parent is the following is that you look Ė this is sort of a very general thing where every constraint has a certain softness and hardness and if youíre willing to label all the merit functions except for one as absolutely hard and one as soft, then you have a classical problem. That never really happens. If you label them all as soft, youíd have a multi-criterion one. What really happens is that some constraints are non-negotiable and others maybe are and they might be negotiable for different ways. And you might even have weird things like you could say, ďWell, you see this power constraint or whatever,Ē but actually if the power goes up to two watts everything is fine because I can use this Ė I mean, actually, I like less power, but I can go up to two watts with this package. If you go above two watts, boy thatís not good. Letís get back to duality. Maybe Iíll just say a little bit about it. Itís Lagrange duality, which weíll see it in a bit. Iíll just say a little bit about what itís used for and all that kind of stuff. Hopefully, this is will de-mystify the often mystical Lagrange multipliers, which is generally taught as a behavior in an advanced calculus class. So this will completely de-mystify it. Thatíd be the first thing. Actually, itís got a lot of interesting things. In fact, what itís gonna do is itís gonna end up producing Ė from any optimization problem it will produce a convex problem, which is called the Lagrange Dual. Okay. If you start with a non-convex problem, youíll get a Lagrange dual, which is convex. We can solve convex problems, I mean, roughly speaking. We can solve them. Okay. That Lagrange dual problem is actually gonna give a bound on the optimal value of that original problem. Okay. If that original problem is convex, itís gonna be a sharp bound.

Thereís some conditions, but itís gonna be a sharp bound and itís gonna be very interesting because itís gonna give you alternative methods for solving that problem. Itís gonna give you all sorts of interesting things. Itís also gonna be extremely interesting when the original problem is non-convex. Well, for the Lagrange dual, itís gonna be a convex problem, we can solve it. And youíre gonna get bounds on that original problem. Weíll see a few cases where that actually happens. So thatís a case where a problem that looked hard by forming a Lagrange dual and then doing some extra math to show the bound is sharp, actually is solved by convex optimization. By the way, thereís gonna be lots of those. Okay. Any time someone says a problem is easy and itís not posed as a convex problem, itís likely that in fact itís because the Lagrange dual is sharp and thatís true for many [inaudible] problems and all sorts of things. The other thing that could happen is you could get a bound on a problem not known to be hard, but suspected to be hard and thereís amazing things that people have recently proved. There will be a semi-definite programming Lagrange dual of a hard [inaudible] problem and they prove things like specific ones, like, Maxcap for example, when you solve this Lagrange Dual the bound is never more than 13 percent off. Crazy things like this. That gets the complexity theorists all excited and indeed that result got them very, very excited. As you can imagine, right, but also as a practical method for generating sub optimal solutions of that problem, these things can also be fantastically useful. So thatís our next topic. Itís gonna be essentially the last sort of core theoretical topic in the class and I guess weíll start in on that next time.

[End of Audio]

Duration: 77 minutes