Instructor (Stephen Boyd):Okay. Welcome to 364a, which is convex optimization one. Unfortunately, I have to start class by apologizing. If youíre watching this, I guess youíd say Ė if youíre watching this at the scheduled time of the class, youíll notice that Iím not there. Thatís because Iím in India. You can actually get very confused doing this.
Thereís actually one more confusing thing than this, and thatís when you tape ahead of class before it actually Ė it goes after a class you havenít given yet. That gets very complicated. Iím sorry to miss the first day of the class. As I said, Iím in India or will be, depending on where the time reference is. Thatís very tacky. Iíve never done it before, so this is a first.
Actually, weíre going to make a lot of firsts today. I believe we may be making SCPD history because weíre taping ahead this lecture not only the quarter before but the year before. A lot of people are very excited about this. If this works out well, who knows where this is going to go. I could teach all of my spring quarter class the last few weeks of winter quarter. Weíll see how this works. Weíre possibly making tape ahead history here.
A couple things I should say about the class, other than apologizing for not being present or, for that matter, in the country for the first day of it. Iíll say a couple of things, and then weíll get started. For those of you watching this, since youíre not going to get to see the audience and know that I sent out a pitiful plea to come if youíre around for the tape ahead, you should know the room is not empty.
Let me say a little bit about the class. I think some of it is kind of obvious. We barely got the website up and running. We change it every day. Let me say a few things about it. The first is the requirements Ė there is going to be weekly homework, which if you are interpreting this relative to January 8, I think the correct statement would be ďhas already been assigned.Ē If youíre interpreting it in real time, itís actually fall, so we havenít even thought about homework one, but we will. Itíll all be ready to go.
Thereíll be weekly assignments, and there will be a take home final exam, which will be the standard 24-hour take home exam Ė the usual format. Youíll do serious work. The other requirement for the class Ė itís not really a requirement, but there will be minimal Mat-Lab programming. Itís not a big deal, but there will be Mat-Lab programming, and that will be done using a system called CVX, which if you go to the website, you can find out about.
Youíre welcome to install it and read the user guide now. Thereís no reason not to. Youíll be using it throughout the quarter. You might as well look at the user guide, install it, see if it works. Not everything in the user guide will make sense until you get to week four of the class, but thereís no reason you canít start poking around with it now. When we get to it, you should be extremely grateful for the existence of this stuff, because it trivializes the ďprogrammingĒ that weíll do in the class. Programming is in quotes, and the reason is weíre talking things like ten lines. Itís not a big deal.
By the way, on that topic, I would say this Ė if thereís anyone that Ė CVX is based on Mat-Lab. Thatís the way it is. If thereís anyone who wants to try to use a real language, like Python, I will be very happy to support you in that. The TAs are Yung Lung, who I guess you canít see, and Jacob Mattingly, whoís in New Zealand, but will not be in New Zealand when this is played back. Jacob would be very happy to Ė if thereís one person who does all the numerical stuff in Python. Iíll just say that, and feel free to come forward and let us know.
The only other thing I would say is something about the prerequisites. The prerequisites are basically a working knowledge of linear algebra. 263 absolutely Ė itís clearly the perfect background. If not, what we mean is a working knowledge, so not the ability to know about these things but actually how you use them in modeling and things like that. The other one is the mathematical sophistication level is going to be a jump up from 263. This is a 300 level class, so the mathematical sophistication is going to jump up.
The cool part is the mathematical sophistication jumps up but actually is still in some sense a very applied course in the sense that weíre not just going to talk about optimality conditions. You will do portfolio optimization. You will do optimal control. You will do machine learning. I mean actually do numerical stuff. It wonít just be stupid programming. Itís going to be figuring out whatís going on. It wonít be obvious. Itíll be good stuff.
Iím trying to think what else to say before we start in on the class. Iíll just set it on context how itís different from 263. 263, which is a linear dynamical systems class Ė I would call that basic material. Itís useful in tons of fields. Itís very widely used. A lot of people know it. Itís used in economics, navigation Ė all of these areas. Control and signal processing is basically all done this way. Communications Ė a lot of stuff is done this way. Itís a great first look at how these mathematically sophisticated methods actually end up getting used.
364 revisits it. Itís quite different in a couple ways. The first way is you might even just say this is 263 beyond least squares. In other courses similar to that, so in your first look at statistics and your first course on statistical signal processing or whatever Ė itís the same sort. Everything is Galcean. All distributions are Galcean. All objective and constraints are quadratic. You do this because thereís analytical formulas for it. Behind the analytical formulas, we have very good computational tools to give computational teeth to the theory.
Here, it is going to be way beyond that. Weíre going to have inequalities. Weíre going to have all sorts of other interesting norms and other functions. Itís a much richer class of problems. These are problems much closer to a lot of real problems, ones where Ė you donít have the concept of an inequality in 263. You have no way to deal with noise with a [inaudible] distribution in your first class on statistical estimation. Weíre going to look at things like that.
I should also say that the number of people who know this material is relative to 263 very small. Itís probably a couple thousand. Itís growing rapidly, but itís just a couple thousand. That means youíre going to move from material Ė a lot of the 263 material is kind of from the 60s and 70s. Itís not stuff thatís new. This class, in contrast Ė youíre at the boundary of knowledge, which makes it fun. Maybe youíll even push the boundary. I hope you do, since thatís the point of the class to train you to find your corner of the boundary and start prospecting.
I canít think of anything generic to say, so maybe weíll actually start the class, unless there are any questions.
Student:If you havenít taken 263, is it a big problem if you have a working knowledge of linear algebra?
Instructor (Stephen Boyd):No, no problem. Whatíd you take instead? No problem. I should also say this Ė something about the background. The class is listed in electrical engineering. Iím not actually sure why. The last I checked, thatís the department Iím in, so it seemed like a good idea all around. You donít need to know anything about electrical engineering Ė in fact, weíll talk about lots of applications throughout the quarter, and honestly, probably for half of them, I donít even know what Iím talking about. That will happen. Youíll be in some field. Iíll super oversimplify it. Youíre welcome to point it out.
The point is, Iíll talk about circuit design. Weíll talk about machine learning. On all of these topics, there will be people in the class who know more than I do about it, and there will be a lot of people who donít know anything about it. If youíre one of those latter, donít worry about it. Theyíre just examples. There is one prerequisite, although I donít know what would happen if you didnít have it. That is that I will assume that you know about basic probability estimation and things like that.
Do I have to say anything about the textbook? Go to the website. Thatís all I have to say. Thereís a textbook. Youíll find it at the website. Iíll start by covering broad basics. Itís not in any way representative of what the class is going to be like, what Iím going to talk about today. Donít think it is. Weíre going to cover the basics, and maybe Iíll get into something thatís real. This is going to be the very highest level. Weíll talk about mathematical optimization. Weíll talk about least squares and linear programming, convex optimization. Weíll look at a very simple example, and Iíll say a little bit about the course goals and style.
Letís start with this. Optimization. Depending on the company youíre in, you may have to put a qualifier Ė mathematical. If you say just optimization, it goes to something on complier optimization or something in business practices. If youíre in a broad crowd like that, you need to put mathematical in front to make it clear.
Hereís what it is. You want to minimize an objective function. You have a variable X with N components. These are loosely called the optimization variable. Colloquially, you would say that X1 through XN are the variables. By the way, there are other names for these. People would call them decision variables. Thatís another name for it. In a different context, they would have other names. Decision variables is a nice thing because it tells you these are things you have to decide on.
When you make a choice X, youíll be scored on an objective. Thatís F0 of X. Weíll choose to minimize that. In fact, you ask what if you wanted to maximize it? Then you would just minimize the negative of that function, for example. Itís convenient to have a canonical form where you minimize instead of maximize. This will be subject to some constraints. In the simplest case, weíll just take a bunch of inequalities Ė thatís FI of FX is less than BI. Here, without any lost generality, the BIs could be zero, because I could subtract BI from FI and call that a new FI. Thereíd be no harm.
This is a nice way to write it because it suggests what really these are. In a lot of cases, these are budgets. Itís a budget on power. It often has an interpretation as a budget. There are other types of constraints weíll look at. What does it mean to be optimal for this problem? By the way, this is redundant. You should really just say an optimal point or the solution. In fact, you shouldnít trust anyone who says optimal solution, because that would be like people who say the following is a true theorem. Itís kind of weird. Itís redundant. It doesnít hurt logically, but it kind of Ė what about the types when you forgot to add the attribute true? What were those theorems?
Iíd say an optimal point or a solution is a point that satisfies these inequalities here and has smallest objective value among all vectors that satisfy the constraints. Thatís what you mean by a solution. Of course, you can have everything possible. You can have a problem where there is no solution. You can have a problem where thereís one, where thereís multiple, and you can have other things, which weíll get to more formally. This is just our first pass.
Letís quickly look at examples. Here are some absolutely classic examples from all over. The first is portfolio optimization. In portfolio optimization, the decision variables X1 through XN Ė these might represent the amount you invest in N assets. For example, if XI is negative, it means youíre shorting that asset, unless that asset is cash, in which case youíre borrowing money. The X1 through XN is actually a portfolio allocation.
Youíll have constraints. Obviously, there will be a budget constraint. That tells you the amount that you have to invest. There might be maximum and minimum investment per asset. For example, the minimum could be zero. Youíre not allowed to actually purchase a negative quantity of an asset, which would be shorting that asset. You can have no shorting.
You might have a maximum amount youíre allowed to invest in any asset. You might have a minimum return that you expect from this portfolio. Your objective might be something like the overall risk or variance in the portfolio. Youíd say hereís 500 stocks you can invest in. My mean return absolutely must exceed 11 percent. Among all portfolios that meet Ė Iím going to have no shorting, and the sum of the XI is going to equal one because Iím going to invest one million dollars.
The question is among all portfolios, they guarantee a mean return of 11 percent. I want the one with the smallest variance in return. That would be the one with the least risk. Weíll look at these in detail later. This is one where when you solved the optimization problem, it would probably be advisory in the sense that youíd look at what came back and say well, how interesting. Maybe Iíll execute the trades to get that portfolio or maybe not. This could also be real time if youíre a hedge fund and youíre doing fast stuff. This could be programmed [inaudible]. There are lots of different things this could be used for.
Letís look at [inaudible] electron circuit. Here, you have a circuit. The variables, for example, might be the device widths and lengths. It could be wire widths and lengths, something like that. You have constraints. Some of these are manufacturing limits. For example, depending on what fabrication facility is making your circuit, you have a limit on how small the length of a gate and a transistor can be. It canít be any smaller than 65 nanometers. Thatís the smallest it can be.
Thatís a manufacturing limit. You would also have performance requirements. Performance requirements would be things like timing requirements. You could say this circuit has to clock at 400 megahertz, for example. That tells you that the delay in the circuit has to be less than some number because it has to be done doing whatever itís doing before the next clock cycle comes up. Thatís a timing requirement.
You might have an area requirement. You could say if I size it big, Iím going to take up more area. This portion of the circuit canít be any more than one millimeter squared. The objective in this case might be power consumption. Youíd say among all designs that can be manufactured in the sense that they donít reject your design instantly and meet the timing requirements among all of those, you want the one or a one with least power. That would be minimized power consumption.
Our last one is just from statistics. Itís a generic estimation model. Generic estimation looks like this. Whatís interesting here is these are from engineering and finance and stuff like that. In these cases, the Xs are things youíre going to do. In this case, theyíre actually portfolios youíre going to hold, and in this case, they will translate into polygons that get sent off to the fabrication facility. The [inaudible] are actions. Here, itís very interesting, because the XIs are not actions. Theyíre actually parameters that youíre estimating.
Here, you have a model. You can take any kind of data or something like that and youíll have parameters in your model. You want to estimate those parameters. These XI are not things youíre going to do. These are numbers you want to estimate. Thatís what it is.
You have constraints. For example, letís say that youíre going to estimate a mean and a covariance of something. It can be much more sophisticated, but letís suppose thatís the case. We have a constraint here. There might be some constraints. Hereís a constraint. The covariance matrix has to be positive semi definite. Thatís a constraint, because if you created a covariance matrix that wasnít, youíd look pretty silly. Thatís a nonnegotiable constraint.
You could also have things like this. You could say these Xs must be between this limit and that limit. For example, suppose youíre estimating some diffusion coefficient or some parameters known to be positive. Then you should make it positive. These are parameter limits. The objective is, generally speaking, dependent on how you want to describe the problem. If you describe the problem in an informal way, itís a measure of misfit with the observed data. For example, if I choose a bunch of parameters, I then propagate it through my forward model and find out what I would have had, had this been the real parameter.
The question is, does it fit the observations well? Another way to say it Ė itís a measure of implausibility. Thatís really what it is. Itís a measure of implausibility. In this case, weíre minimizing it. In many contexts, youíll see it turned around so itís maximized. If youíre a statistician, you would reject the idea of a prior distribution on your parameters, and your objective would be to maximize the likelihood function or the log likelihood function. Thatís what youíd be maximizing. Thatís essentially a statistical measure of plausibility. Youíd minimize the negative log likelihood function, which I think they call loss in some of these things.
Implausibility in a [inaudible] framework, a measure of implausibility would be something like the negative log posterior probability. That would be a measure of implausibility. If you minimize that, youíre actually doing MAP, which is maximum a posteriori probability estimation. By the way, weíll cover all those things again, so this is just a very broad brush. If you donít know what these things are, you will, if you take the class.
These are examples of optimization. Everyone in their intellectual life goes through a stage. It can happen in early graduate school, mid graduate school. It can also happen in later life, which is bad. Itís not good to have it when youíre an adult. Let me describe this stage of intellectual development. You read a couple of books and you wake up at 3:00 in the morning and say oh my god, everything is an optimization problem. Actually, a lot of books start this way. My answer to that is Ė you have to go through this stage, so thatís fine. But get over it quickly, please. Of course, everything is an optimization problem.
What youíll find out quickly is it doesnít mean anything to say that. It says nothing. What matters is what optimization problem it is, because most optimization problems you canít solve. They donít tell you that. Typically, they donít tell you this. Or, what they do is they do tell you, but they distribute it through the whole quarter. It turns out if you just say a little bit of that message every day, at the end of the quarter, no one can accuse you of not having admitted that we donít know how to solve most optimization problems.
However, because it was below the threshold of hearing in each lecture, as a result, all these students went through and the big picture didnít come out, which is basically Ė the way you cover it up is by going over 57 different types of algorithm for solving things, which basically is a cover up for the fact that you canít solve anything. Weíll get to that.
This is related to the following Ė solving optimization problems. To say that everything is an optimization problem is a stupid tautology. It all comes down to this. How do you solve them? It turns out itís really hard, and basically in general, I think itís fair to say it canít be done. I can write down shockingly simple optimization problems and you canít solve it, and itís very likely no one else can. I can write down very innocent looking optimization problems and theyíll be NP hard.
What do people do about it? Well, there are a couple of ways to do it. What do I mean by NP hard? Nondeterministic polynomial ties. You take at least quarters on this in a computer science class. Iím going to give you the 15-second version. Iíll tell you about it from a working perspective. In the 70s, people identified a bunch of problems that so far no one had figured out a good polynomial time algorithm for solving.
Polynomial time means thereís an algorithm where the problem comes in, you measure the size by the number of bits required to describe Ė you measure it by the number of bits required to describe the data, and then you look at the number of operations you have to do. If you can bound that by a polynomial, then youíd say thatís polynomial time. Bear in mind, Iím compressing a ten week course to about 14 seconds. Iím going to gloss over some of the details.
There were a bunch of problems where people Ė the famous one would be traveling salesmen problem. No one had found a polynomial tying method. A guy named Cook, maybe, did the following. He catalogued a bunch of problems and said if you can solve any of these, then if you make a method for solving that, I can make a reduction. I can take your problem and map it to this, and with that, I can solve the traveling salesmen problem. Then you had an intelligent way of saying of two problems, one is just as hard as the other.
NP hard means, and this is really Ė people are going to cringe if they have a background in this Ė it means itís at least as hard as a catalog of problems thought to be very hard. Thatís your prototype. Basically, what Iím saying is not true, but as a working definition of what it means Ė for your purposes, this is going to be fine. It means the problem is at least as hard as traveling salesman.
Let me tell you what that means. It is not known that these things cannot be solved from the polynomial tie. Thatís not known. It is thought that thatís the case, and it may be at some point, somebody will show that you canít solve these. Right now, theyíre thought to be harder. I think thereís an insurance policy. Let me tell you why itís an insurance policy. A ton of super smart people have worked on all these problems, and now all of these things are banded together as you would do in insurance. You band a whole bunch of things together.
What would happen is if tomorrow, somebody, probably a Russian, proves P equals NP, meaning you can solve all of these problems, it would indeed be embarrassing for a lot of people. However, the embarrassment is amortized across Ė people could say you went around and made a big deal about convex problems and polynomial time. I say look at all the other people. The embarrassment is spread across a great swath of people. Itís an insurance policy. It is thought that theyíre really hard. If theyíre not, youíre in very good company with the people who also thought they were hard.
Student:Itís just in its own field. Itís not proven to be certain. Itís not mathematically proven to be non-polynomial.
Instructor (Stephen Boyd):Thatís right. Thatís why itís still an open conjecture. If, in fact, it turns out that these are theoretically not hard, the [inaudible] could end up being huge, and that would also blunt the embarrassment. In any case, the embarrassment is spread across a lot of people. Weíll come back to that problem several times in the class.
Methods to solve a general optimization problem Ė they always involve a compromise. The two main categories of compromise are as follows. The first one is to not back down on the meaning of solve. Solve means solve. It means find the point that has least objective and satisfies the constraints. Thatís the definition of solve. You leave that intact, but you leave open the possibility that the computation time might involve thousands of years or something like that. People call that global optimization, and itís a big field. It is almost entirely based on convex optimization.
The other one, which is the one most widely done by people who do optimization Ė they do the following. Itís very clever what they do. They go and solve Ė they put a footnote, and then way down at the bottom, they write this Ė they write and now Iíll have to ask you to zoom in on that. There it is. It says not really. What that means is they make a big story about this, and they say itís a local solution. What happens is they donít Ė it gets tiresome to say solve locally, plus, it doesnít look good.
What happens is you drop it after awhile, and then you say I solved that problem, and people give talks and say I minimized that. The point is they didnít. What they did was they got a local solution. Weíll talk more about that as well. There are exceptions. If you were going to do that, I would maintain, although this is not widely held Ė youíll find that many of my opinions are not widely held. If you were going to do that, I would do it via convex optimization.
Student:But you will be doing it in the spring?
Instructor (Stephen Boyd):Yes. Itís scheduled to be in the spring. Who knows when Iíll do it. What are the exceptions? There are cases where you can solve these problems with no asterisk. The most famous one in the world by far is least squares, least norm. No asterisk there. Itís not like oh, yeah, well, I transpose A inverse A transpose B. Yeah, that often works super well in practice. The status of that is that is the global solution. There are a couple others that you might not know about, and that would be things like linear programming problems. Weíll get to those quite quickly. Those are ones where you get the answer.
There are asterisks in these, but theyíre much smaller fine print. Iíll get to them later in the class. When thereís an admission that has to be made, I will make it. The parent of these, and a considerable generation, is this convex optimization problem. This is really the exception. The rough idea, to put it all together, is something like this. Most optimization problems Ė letís review what weíve learned so far. A Ė everything is an optimization problem. B Ė we canít really solve most optimization problems. The good news is here, that there are huge exceptions. This entire class is about one of the exceptions.
When you do convex optimization, there are no asterisks. You solve the problem and there are no apologies. Itís the global solution. Is life convex? I would have to say no, itís not. I hope itís not. Itís not sad. If we get to that later today, youíll know why itís not. To check convexity, you have to take two instances and then form weighted averages in between. What would the average of yours and your life look like? The other thing that has to happen is that life has to be better than the average of the qualities of your lives. Letís keep that as a running topic that weíll revisit periodically. For the moment, my position is itís not.
Student:Are there other exceptions besides convex optimization?
Instructor (Stephen Boyd):Yes, there are. Singular value decomposition. Thatís one where our can compute the singular value Ė I can write it out as an optimization problem pretty easily. I could say find me the best rank two approximation of this matrix. Thatís way non-convex. Yet, we can solve it and we get the exact global solution. Thatís an example. There are some combinatorial problems. So if youíve taken things on combinatorial algorithms in computer science Ė combinatorial algorithms on their face would appear to be non-convex. It turns out a lot of them are convex. Itís quite strange.
You take something thatís a combinatorial optimization problem that on its face would not be. It turns out if you let the variables vary between zero and one, you can prove that thereís always a solution, which is on the vertices, and so thereís actually a lot of problems that we can solve but are not convex. Some of them can be secretly turned into convex problems. Getting a rank two approximation of a matrix is an excellent example. We can definitely do that and it is definitely not convex.
We have least squares or, if youíre in statistics, regression. It may have other names. I donít know. Hereís the problem. You want to choose X to minimize the [inaudible] norm squared of AX minus B. If A is full rank or skinny, you get an analytical solution, which you know if you know about linear algebra. Itís just A transpose A inverse A transpose B. In this case, itís a unique solution. In fact, we have a formula for it, which is overrated. In this class, weíre going to look at tons of problems. There will be analytical formulas to almost none of them. Youíll have to wean yourself away from analytical formulas.
The sociology of that is very interesting. Youíve been trained for 12, 14 years on 19th century math, which was all based on analytical formulas, but weíre going to move beyond that. If you look most deeply into what it means to have an analytical formula, it turns out Ė weíre going to solve AX minus B with an infinity norm there or a one norm. Thereís no analytical formula for it now. But it turns out we can calculate that in the same time it takes for you to calculate this. Having a formula is something that will mean less to you by the end of this class.
Student:So not all optimization problems have that subject.
Instructor (Stephen Boyd):When we go over this in hideous detail later, thatís the case. I should mention you should be reading chapter one and chapter two. Chapter one will have a lot of this vague stuff, and chapter two will start in on the real stuff. Youíre absolutely right. An optimization problem Ė you do not have to have any constraints, in which case itís called unconstrained, and you donít even have to have an objective. If you have neither, thatís called a stupid problem. Itís minimized. The universal solution Ė X equals zero.
If you donít have an objective, itís called a feasibility problem. In some fields, they call it a satisfaction problem. You have a bunch of inequalities and you want to find a point that satisfies all of them. Thatís what that is. Back to least squares. Much more important than the formula Ė it turns out you can write down a formula for a lot of stuff, and it doesnít do you any good if you actually want to calculate it. Here, there are reliable and efficient algorithms that will actually carry out this computation for you.
By the way, thatís why least squares is so widely used because it has computational teeth. Instead of just talking about it, which is what a formula would allow you to do, you can actually use it. You can actually compute. Weíre not going to go into too much detail in the numerics. At one point in the class, we will. Weíll talk about how you exploit structure and things like that. Here, just so you know, the computation time to solve a least squares problem goes N squared K. N is the number of variables, and thatís the small dimension. K is the number of rows in A. Itís a good thing to know. Itís the small squared times the big.
You can do this as I said not long ago in 263 with modern computing. Itís amazing what you can solve. Then, we did a couple thousand row least squares problem. You can call it a regression problem in 2,000 dimensions with 1,000 regressors. It was three seconds. By the way, if A is sparse or has special structure Ė suppose part of A has an [inaudible] embedded in it. That would come up in medical imaging. You can do that faster. In image processing, youíd have transformations. You can solve least squares that are way bigger.
I would say that least squares is a mature technology. When I do this, people who worked on all of this Ė itís a huge, active field in lots of areas Ė they get extremely angry when I use the word technology. I said by the way, I mean technology here. This is the highest praise. This is not an insult. What it means is that other people know enough about the theory, the algorithms, and the implementations are so good and so reliable that the rest of us can just type A backslash B.
Of course, if youíre building some real time system or the sizes get to a million or ten million, youíre not going to be using backslash. But thatís it. Thatís a boundary thatís growing with time. Thatís the wonderful thing about anything that has to do with computers. Just take a vacation for a year. My colleagues at the other end of the [inaudible] who actually do real things, they and all their friends around the world will make stuff twice as fast. You just go away. You go to the Greek Islands for three weeks. You come back and computers are faster. Itís great.
Of course, Iím not telling people to just use A backslash B. Everyone here has done A backslash B. Probably only a few know what it actually did. Nothing terrible has happened. Iíll come back to them and when theyíre yelling at me, Iíll say back off. Do you use TCP/IP? Sometimes, they wonít even know what Iím asking. Then Iíll say are you using TCP/IP as a black box? You donít even know whatís inside it?
Some of them will say yeah, I do. Even communications on your laptop between different components Ė Iíll say do you know what it does? Most of the time, theyíll say no. Hereís what you need to know. It is not trivial by any means to make a reliable bit pipe to transfer bits from one place to another with unreliable medium insight. Itís no more trivial than it is to solve this problem numerically. It is not trivial. All you need to know is this. Very intelligent people have thought about this problem very deeply. Theyíve thought about what can go wrong, deadlocks, all sorts of crazy stuff, and they have come up with algorithms about which they know an incredible amount.
Part two Ė other people have implemented them, and these are very reliable implementations. The result is for most of us, we can just consider certain things to be reliable bit pipes. We donít have to care about how many packets were retransmitted or how the flow control went and all that kind of stuff. Like least squares Ė if youíre doing real time control or something like that or if youíre doing some computing that requires extreme reliability, then you canít treat TCP/IP as a black box.
You might ask does this calm them down? The answer is no. This makes them more angry. I mean technology here in praise of this. When you use least squares Ė if youíve just come from 263, youíve used least squares. Thatís just the beginning. The way you use least squares is this. Itís easy to recognize a least squares problem. Sometimes, it comes to you on a platter. Somebody walks up to you and says I took these measurements. Itís a linear measurement model. Help me guess these parameters. I received a signal and went to this channel Ė this kind of thing.
If you learn a few tricks in least squares, you can do well. If you learn how to do weight twiddling and you learn about regularization, those two alone Ė you are now trained and ready to do battle with using least squares as a tool. You will do really well. Weights is basically you go into a problem and someone says thereís no limit on the sum of the squares. The limit is on this. You say no problem. I have these weights here. You look at the least squares solution. You donít like what you see. You change the weights and you do it again.
In engineering, we do this all the time. Itís called weight twiddling. Itís a slightly derogatory term, but not bad. Iím sure that you do it in statistics, but I donít know that they admit it. Good. When Iím making fun of a department, I like to have a face to look at. They like to stick to Ė if youíre doing real statistics, you go back in and change the prior distribution. I have to warn you, all of these lectures are being put on the web. Itís weird and fun. Weíll fuzz out your voice, and if the camera focused on you, weíll put the squares on it. Donít worry about it. If youíd like, we can obscure your department. We can beep it out.
Iím just going to guess that in statistics, they make a big story about the prior distribution. I bet if they donít like the way it comes out, they go back and change that prior distribution. They cover up their tracks. We do it in engineering, and weíre not embarrassed. I bet they do it. Next topic is linear programming. Some of you have seen this. How many people have seen linear programming somewhere? A bunch. Okay.
Linear programming Ė in my opinion, itís what people should learn immediately after least squares, singular value decomposition, linear programming. If youíre going to stop somewhere, thatís a very good place. Iím taking about if you really want to go out and field algorithms, do real stuff Ė thatís a great place to stop. Everybody should know about it. If you take this class, youíre going to know a lot about it. Linear programming is a problem that looks like this. Minimize a linear function subject to a bunch of linear inequalities.
Weíre going to talk about it in horrible detail later in the class, so Iím not going to go into too much detail now. I want to talk about the attributes of linear programming. The first is in general, except for very special cases, thereís no analytical formula for the solution. There is no A transpose A inverse A transpose B. By the way, donít confuse that with our inability Ė you can say a huge amount of qualitative, intelligent things about a linear program. What you canít do is write down a formula for the solution.
There are reliable and efficient algorithms and software. In fact, as a homework exercise, you will write something in the top class linear Ė it will be 50 lines of Mat-Lab or Python, and youíll write something that will solve huge linear programs quite well. Itís no asterisk on solve Ė you get the solution. The computation, by the time, is proportional to M squared N. Thatís exactly the same as least squares. If you equate rows of the least squares objective with constraints, itís identical. Thatís not an accident.
M is a number of constraints here, and K is the height of A.
Student:So this M Ė thereís still that many rows in A.
Instructor (Stephen Boyd):Yes, if I write that as a matrix inequality, AX less than B, yes, that would be Ė this M would be that K. We spent hours discussing whether this should be M or K, and we finally Ė it probably should be K.
Student:What about C?
Instructor (Stephen Boyd):X is an RN, so C is an RN, too. Actually, linear programming Ė itís a very old topic. Fourier wrote a paper about it. The modern era starts in the 30s, where else but Moscow. The modern era traces back to Stanford and George Dantzig. LP was something that you just talked about until you had computers, at which point LP looked a lot more interesting. That was 1948. I think a lot of people knew about linear programming. Something like this coupled with computers Ė thatís a mature technology.
Linear programming Ė it looks like it would be easy to recognize, and in some cases, a problem really comes to you on a platter like this. Someone comes to you and says help me solve this problem. I want to minimize this weighted sum of the variables and I have some budget constraints. There are three problems that have exactly this form. Hereís the really cool part about linear programming. You will be stunned later in the class when you see the kind of problems that we can reduce to linear programming. Things that do not look like linear programming at all we will reduce to linear programming.
What that means is theyíre solved. Unless your problem is huge or you have some super real time thing like in communications, then thereís a sense in which youíre kind of done at that point. If you do medical imaging, thatís a mistake, because the problems are too big. You canít say itís a linear program and walk away. You canít do communications. It depends on the time scale. You had to adapt to things. You canít detect the bits transmitted in a packet, because thatís coming at you way too fast. I would recommend that you go into fields that are in between in size.
Weíre going to see a bunch of tricks. Finally, Iíll say what convex optimization is, because it would seem in a class with that title, one should say what it is in the first lecture. Hereís what it is. Itís an optimization problem Ė minimize an objective subject in constraints. Hereís what has to happen. The function F0 and the FIs have to be convex. You know what convex is. It means that the graph looks like that. Thatís a convex function. The graph should have positive curvature.
Least squares problem has that form because if I look at the least squares objective and I look at the plot of it, itís a quadratic function squared, and basically, it looks like a bowl. If you take a slice at a level set, you get an ellipsoid. Itís convex. Linear programming also is a convex problem because all of the objectives are linear. Linear functions are convex. Linear functions are right on the boundary. They have zero curvature.
One way to say convex is just positive curvature. This includes least squares, and kind of the central idea at the highest level of this class is this. If you want to solve a convex optimization problem, there are no analytical solutions. There are in special cases. Weíll see tons of cases in communications and various other places where they have special analytical solutions. Youíve seen one in least squares already.
In general, there isnít an analytical solution. However, there are going to be reliable and efficient algorithms for solving these with no asterisk. You will get the solution. In fact, if someone came from another planet and landed here and asked you what youíre doing, there would be Ė it would be very difficult to make an argument that solving a convex optimization problem compared to a least squares problem was, for example, that youíd been reduced to a numerical solution, which is what a lot of people might say with a hint of Ė they donít like the idea. They say numerical method in a way that makes you want to go wash your hands.
The computation time for solving convex problems is roughly proportional to something like N cubed Ė the number of variables cubed Ė N squared M and F, where F is the cost of evaluating the functions and their first and second derivatives. We donít have to get into that, but the point is itís like least squares. You can basically solve these. You will solve them in this class. Youíll know how to write the algorithms to solve them and stuff like that. It is an exaggeration, in fact, to say itís a technology. Itís almost a technology, and every time I give this class, itís getting closer and closer.
I should say something about Ė weíll get to it later, but this is a very profound statement. It basically identifies a huge class Ė letís review what we know so far. A Ė everything is an optimization problem. B Ė we canít solve any optimization problems despite ten or twenty weeks of lectures on it. Now what Iím saying is on the positive side. Itís really cool, and itís not obvious at all. It says if the objective and the constraints all have positive curvature, then this thing is just like least squares. You can solve it exactly. You can solve it quickly. Although Iím not going to focus on it, you can say a whole lot about it theoretically.
We will say some stuff about it theoretically, but thatís not going to be the focus of the class. This is a pedantic way. You might prefer it if I wrote it this way. Four Theta N zero one. You might prefer it that way.
Student:I donít know why it has to be zero one, though. Why canít it be 99 and 100.5 or negative five?
Instructor (Stephen Boyd):There are versions where there are different constraints. If I just say Ė it said F of Alpha X plus Beta Y is less than or equal to Ė if itís Alpha F of X for all Alpha and Beta, the function would be called sub linear. Itís a different thing. This is just a definition of convex.
Student:It doesnít have any physical basis or anything. Itís not gonna turn concave if you suddenly make it more than one.
Instructor (Stephen Boyd):Weíll answer your question. This is not an accidental. For now, thatís the definition.
Student:[Inaudible] just a statement of the [inaudible] of the line between Ė
Instructor (Stephen Boyd):Thatís what it is. Itís this picture. By the way, weíre going to get to this later, so youíre not supposed to be understanding everything. First of all, Iím not saying that much. The stuff I am saying is kind of vague. Youíre not supposed to be parsing this. Youíre supposed to have your relaxed parsing mode on and let me get away with stuff. Thereíll be a time for is that really a minus sign. Thatís coming later.
You already know something you didnít know. Optimization problems where the objectives and constraints have positive curvature, roughly speaking, we can solve. The theoretical implication, I think, is extremely interesting. The practical ramifications of this are absolutely immense. It means youíre messing on some problem in your own field, and if you get it to a problem of this form Ė it wonít be obvious.
Itís much cooler when you get to it and you turn things around and you switch some variables around. All the smoke clears. There are some functions there that itís not remotely obvious theyíre convex. You show they are, and then youíre a hero, and itís fun. It means you can now solve those problems.
I give a talk about these things lots of places. People say thatís really cool. How do you learn about this? How do you know if a problem is convex or not? I go no problem. You just have to come take my class. You do ten homeworks, each of which takes 20 hours and then you do this 24-hour final. After that, for sure youíll be quite well trained. Then theyíre less enthusiastic about it.
Itís actually often difficult to recognize convex problems. Thatís going to be a theme of the class. Let me distinguish a few things there. I would distinguish problems that come on a platter convex, the ones where you have to do some work and transform them and stuff like that. Let me move on.
I want to do an example just to give a rough idea of what this looks like. For people who did 263, this will kind of tie into that. Hereís our problem. I have a surface here with patches, and I have some lamps here. Weíre going to choose the illuminating powers on the lamps. Thatís going to be our variable.
The illumination on a patch here is going to be the sum of the illumination from all the lamps here. Itís going to be proportional to the lamp power and then the proportionality constant is going to be an inverse square law. R is the distance between the patch and the lamp. Thereíll be something which is a shading effect, because obviously, if youíre coming in straight on here, then youíre going to get the full power.
If this lamp, for example, puts very little illumination on here, and this were below its horizon, if there were a lamp here, it would not illuminate this patch at all. This is a simple thing. The problem is to achieve a desired illumination which is given. You want to do this with bounded lamp powers. I have a maximum power. Powers cannot be negative. I care on a log scale, because what I care about is missing the lamp power by two percent or twelve percent. I donít care about absolute. Itís ratio compared to this one.
The question is how would our solve it? Letís take a look and see how you might solve it. If your question is do I shamelessly reuse material from one place in another, I can confirm that, yes. Are you asking have you seen that figure before? The answer could well be yes, you have.
Letís talk about how to solve it. The first thing you might do, and I would recommend this Ė the first thing you do is you say letís try some simple suboptimal schemes. One is just this Ė you set all the lamps at the same power. You vary it and you plot this objective. You do a search over it. Thereís no dishonor in it. You do that. That might work, but it might not. Thatís a good baseline design. You could say, well, I just learned least squares from 263. Youíre going to use least squares. Iím going to do this.
The objective here is not the sum of the squares. This is real life. This is the illumination engineers. Everyone uses the absolute value Ė the sums of the absolute values of the law of percentage error. Iím making it up, of course. You say well, good for you. We use the sum of the squares. You solve this problem. When you solve this problem, I guarantee you some of the Ps are going to come out negative. By the way, youíre going to do super well. Youíre going to get a nice, flat illumination pattern.
What will happen is youíll look at the powers and a whole bunch of them will be negative. It turns out you can do really well in illumination with lamps that can spray negative power out. Thatís not good. Then the heuristic. What do you do? Hereís what you do. You say if a P is bigger than the maximum power, I just set it equal to P max. If itís a negative lamp, I turn that lamp off.
Once again, you see how well you did, and you might do better than uniform Ė maybe worse. I donít know. Now, this is what someone trained in least squares would do. Theyíd say not a problem. Theyíd go over here and say I want PJ to be in the interval zero P max. Therefore, Iím going to add a regularization term which charges P for being away from the center of that interval. Everybody see what this is?
You start with all the Ws one, and you solve this problem. Then, you just start weight twiddling. Youíll do quite well here. Youíll come up with some algorithm that twiddles the weights, and how you twiddle them is totally obvious. If P comes out outside that interval, that weight needs to go up in the next cycle. If P is timid because your weight is so high, you want to decrease that weight. A couple of cycles of this, and youíre going to get a very, very good solution.
Unfortunately, you might also go on to then write an impossible to read 12-page paper with pages and pages describing your iteratively reweighted illumination thing. Hopefully, you wonít. You could also use linear programming. If you know about linear programming, you could actually solve this L one problem where you minimize an L one norm. This is closer than that. Linear programming Ė it handles the inequalities. Thereís no twiddling there. This would probably do the best of all of them.
The problem is convex, so this objective function Ė it just looks like this. Itís linear over here, and then itís an inverse over here. You look at that function and you realize a couple of things. The important part Ė thatís what youíre going to be trained on in the next ten weeks is looking for convexity. This is what we like to see.
By the way, you will see that that function is not differentiable. In a lot of other treatments of optimization, differentiability has this very high role, because a lot of things are based on gradiance and derivatives, and thereís no derivative there. So in convex optimization, differentiability is going to actually play a much lower role. In fact, itís a non-role.
This problem, even though it is non-differentiable here, it can be solved as fast as least squares if you know what youíre doing. We might even have you write from scratch a solver for it. We could also assign it. It would be four lines in a higher-level language or something like that to solve this. Thatís it. This is just an example of a problem where itís not obvious exactly how to solve this, and you can imagine a lot of people not knowing.
Letís look at a couple additional constraints.
Student:Where did that curve come from?
Instructor (Stephen Boyd):I plotted it.
Student:I mean the equation. How did you know to do [inaudible]?
Instructor (Stephen Boyd):If you go back and exponentiate my other objective, youíll find this.
Student:So you just did it on both sides.
Instructor (Stephen Boyd):Yes. If you minimize something, itís the same as minimizing the X of that thing because X is monotone increasing. Letís add some additional constraints here. Hereís one Ė no more than half the total power is in any collection of ten lamps. That would be one. Another one is no more than half the lamps are on, but you get to choose which half. I wonít go into all the gory details, because for one thing, Iím running out of time, but itís not Ė both of these sound complicated or easy or something like that. If you donít know about convexity, you wouldnít know the second one is an unbelievably difficult problem.
In fact, youíd have to check all N [inaudible] N over two sets of half the lamps and for each one, solve the problem. Basically, everything would come down to that to actually get the real answer. You could get very good heuristics that would guess an answer, but they would not be certified as the actual correct one. This one Ė no more than half the total power is in any ten lamps Ė that actually is convex, and itís not obvious. By week three of this class, you will know things like that.
This is actually cool, because these are things that look very similar. If I said them the right way, I could probably make you think that they are kind of the same question. Iíll often do that in talks, except I donít give the answer, and then I pick some poor person. The point is these are just not obvious at all, these things.
Why is it famous? Itís not famous.
Student:I mean, people have [inaudible] papers and have investigated [inaudible].
Instructor (Stephen Boyd):You mean the illumination problem? I think I probably made it up one day and then actually [inaudible] Ė I can say this because heís a colleague of mine at MIT. The next thing I knew, it was in his book, the famous illumination problem. Youíve been subjected to it in 263. That would be the 263 version of the illumination problem.
Notice that we didnít ask you to find any powers, because you would have had this annoying problem of negative powers coming out. Thatís selective, though. I donít think the problem is any more famous than itís been here. I donít know.
No, because this would subsume all Ė you have to choose which half of them are actually going to be possibly on, not which are actually on. Itís exponential in the end. Let me just say a little bit and then Iíll quit. The course goals and topics Ė the goal is to allow you to recognize and formulate problems. You should know enough to develop code for these things. Youíll see itís very simple. You might not think that about seven weeks into the quarter, but youíd be shocked at how simple some of these things are.
Weíll focus a bit on the theory of this, and do want to skip forward to something at the very end here. I know Iím over, so Iíll just take a minute or two to do this. Iíve already mentioned this. First of all, the theory of convex optimization is more than 100 years old. The first paper on this is from 1890 or something like that.
In fact, the idea is traced earlier. By 1970, it was very well known, the theory. There are some points here that I think I donít have time to go into, but you can read about this in the first chapter. What I do want to say is something about why itís interesting now, if this is 100 years old as a mathematical subfield. What makes it interesting now?
What I can say is since about 15 years ago, people have been finding applications in lots and lots of areas, and it just sort of starts coming up. It may not particularly help you in some area to know that a problem is convex, but it sure doesn't hurt. It might in some cases allow you to solve it. This was sort of not the case, with the exception of least squares and linear programming.
These have been widely used since the 50s and widely applied since the 1950s. Basically since the 1990s, there were a lot of things found in areas like machine learning and statistics and control and signal processing and stuff like that. You get a nice, positive feedback loop because once more people start discovering these problems and start writing papers on, for example, the famous illumination problem Ė once they start appearing, what happens is people who write the algorithms see two of these and say somebody should write a code for the illumination problem.
Then, they write a beautiful code, well tested, numerically stable, hopefully open source and all that, and then everyone can now solve the famous illumination problem. Only then do people realize that there never was such a problem, hopefully. I quit here, and Iíll see you, in theory in two days, but more like two months or something like that.
[End of Audio]
Duration: 80 minutes