ConvexOptimizationII-Lecture04

Instructor (Stephen Boyd):Okay. I guess weíre Ė weíve started. Well, we get the video-induced drop-off in attendance. Oh, my first announcement is this afternoon at 12:50 Ė thatís not this afternoon. I guess technically it is. Later today at 12:50, believe it or not, weíre going to make history by having an actual tape-behind where weíre going to go back and do a dramatic reenactment of the events that occurred at the first Ė on the first lecture. Thatís Ė so, I donít know, so people on the web can see it or something like that. Thatís 12:50 today. Itís here. Obviously not all of you are going to come. But those who do come will get a gold star and extra help with their projects or who knows what. Weíll figure something out. So please come because, although itís never happened to me in a tape-ahead or tape-behind, you know itís every professorís worst nightmare to go to a tape-ahead and have no one there. So, in fact, itís not even clear. Just some philosophical questions. And practical ones, like can you actually give a lecture if there was no one there? I pretty sure the answer is no. But okay. So weíll start in on continuing on the constrained subgradient method, projected subgradient method. Oh, let me make one more announcement. Homework 1 is due today, which has a couple things on it. We pipelined so homework two is currently in process. And then weíre going to put out homework three later tonight, something like that. So, hey, just listen, itís harder to make those problems up than it is to do them. Come on. We can switch. You want to make up the problems and Iíll do them? We can do that if you want. Fine with me. I can do Ė well, of course, your problems have to be well-posed and they have to actually kinda mostly be correct and kinda work. So anyway, we can switch if you want. Just let me know. Okay. So letís look at projected subgradient. So the projected subgradient method, let me just remind you what it is. Itís really quite stupid. Here it is. Itís amazing. Goes like this. You call f, f dot get subgrad. Get here at x to get a subgradient. So thatís Ė you need a weak subgradient calculus method implemented. So you get a subgradient of f. You then take a step in the negative subgradient direction with a tradition step size. Of course, this in no way takes into account the constraint. And then you project onto the constraint. Now this is going to be useful, most useful, when this projected subgradient is meant to be most useful when this projection is easy to implement. And we talked about that last time. There are several cases where projections are easy. Thatís it. Projection on the unit simplex. That was it.

Homework 3 coming up. Coming up. Okay. Project on unit simplex. Okay. So obvious cases of projection on the non-negative orthant. Projection onto the cone of positive semidefinite matrices. But youíd be surprised. Itís probably about 15, 20 sets that itís easy to project onto. Or easy in some sense. Of course, in Ė you can also project onto other sets, like for example, polyhedra or something like that. But that would involve using quadratic programming or something like that. Okay. This is projected subgradient method and a big use of it is applied to the dual problem. Now this is really a glimpse at a topic weíre going to do later. So later in the class, weíre going to look at this idea of distributed decentralized optimization. So far, kinda everything weíve been talking about is centralized. You collect all the data in one place and calculate gradients and all this kind of stuff. In fact, weíre going to see beautiful decentralized methods, and theyíre going to be based on this. So this is a glimpse into the future. Maybe not even too far a future. Maybe like a couple of lectures or something. But letís see. Okay. So we have a primal problem, minimized f0 subject to fi less than zero. And we form this dual problem, which is maximized the dual function at lambda. These are the Lagrange multipliers lambda. These have to be non-negative because these are inequality constraints like this. And the projected subgradient method is very easy, because weíre maximizing this concave function subject to the constraint that youíre in the non-negative orthant. Projection on the non-negative orthant is completely trivial. You just take the plus of it Ė of the vector Ė component by component. So the updateís going to look like this. You will find a subgradient of minus g here. So weíll find a subgradient of minus g, and then we will step in the negative subgradient direction. Actually, I suspect this is correct, but this could be a plus here. I donít know. And the rule is for 300 level classes, I donít even care. If itís a plus then you fix it or something like that. I actually think this is right. Itís confusing because we have that subgradient of minus g. Okay. So you take a subgradient step in the appropriate direction, so Iím allowed to say that in a 300 level class, and then you project here. So thatís it. Now by the way, I should mention again kinda going towards Ė when Ė if I solve this dual, what are the Ė when is it that I can actually extract the solution of the primal problem? Again, this is 364a material, which we covered way too fast, but I donít know. Does anyone remember? Anyone remember this?

Student:[Inaudible].

Instructor (Stephen Boyd):Sure, weíre going to need strong duality holds if it were strictly feasible. Weíd have Slaterís condition and strong duality would hold. That gives you zero duality gap and I guess if you donít have that, then you canít solve this at all, because the optimal values arenít even the same. So letís assume that. Thereís more, actually, to it than just that. What the sledgehammer condition is is this. What youíll need is that when you find lambda*, what you want is that the Lagrangian at lambda* should have a unique minimizer in x. If it does, then that x is actually x* up here. Okay? So thatís the condition. A simple condition for that Ė you should go back and read this, because weíre going to be doing this in a couple of weeks and actually these things are really going to matter. So weíre going to do network flow methods and all sorts of other stuff, and theyíre actually gonna matter. Hereís the Ė one sledgehammer condition is f0 here is strictly convex. Because if f0 is strictly convex then f0 plus lambdai fi, where fi are convex, is also strictly convex. For all lambda, including all lambda0 it doesnít matter. Itís strictly convex. If itís strictly convex, it has a unique minimizer. So just to go back to this, you would actually calculate the optimal lambda*. You would then go back and get the minimizer Ė the minimizer of the Lagrangian with respect over x, call that x*. And that will actually be the optimum of this problem. Okay. So letís work out the subgradient of the negative dual function. So itís actually quite cool. Letís let x* of lambda be this. Itís arg min of the Lagrangian. So itís just exactly what we were just talking about. And hereís sorta the big sledgehammer assumption is f0 is strictly convex. And by the way, in this case, you might say a lot of times some of these things are silly. Theyíre sorta things that basically only a theorist would worry about. I mean, somebody should worry about them, but they have no implications in practice. And Iím very sorry to report that this is not one of those. This actually Ė there are many cases in practice with real methods where this issue comes up. Itís real. It means that methods will or will not converge and you have to take extra effort in things like that. Okay. All right, so weíll just make this sledgehammer assumption here, the crude assumption, f0 is strictly convex. That means this is strictly convex here. And therefore, it has a unique minimizer. And weíre going to call that minimizer x* of lambda. Itís a function of lambda. Okay? So thatís x* of lambda.

And of course, if x* is the minimizer, than g of lambda is f0 of x* of lambda plus lambda1 times this. Itís the Lagrangian evaluated lambda*. Okay. So a subgradient of minus g at lambda is then given by this. Itís hi is minus fi of x* of lambda. Now this is actually quite an interesting Ė first of all, let me explain that. Let me see if I can get this right. g is the infenum over z Ė itís the infenum over z of this Lagrangian here. Thatís the infenum of Ė thatís what g of lambda is. So negative g of lambda is a supremum. How do you calculate the subgradient of a supremum? No problem. You pick a point that maximizes Ė one of the points that maximizes. In this case thereís a unique one. Thatís what this assumption says here. So you pick the maximizer, thatís this, and then you form this thing, and then you ask yourself what is the gradient, subgradient, of this thing with respect to lambda? Thatís an affline function. So the subgradient is simply this thing here up to this thing here. And so, again, modular minus sines. My guess is that this oneís correct, but I guess weíll hear if theyíre not. And weíll silently update it. But I think itís actually right. So a subgradient of minus g is this. By the way, thatís a very interesting thing. Let me say what that is. This Ė if this is positive, then letís see, what does that mean? If hi is positive Ė maybe we donít Ė well, we can work it out. If fi is negative, that means that the ith inequality constraint is satisfied. If itís positive it means itís violated. So that means that hi if itís positive is something like a slack. So hi is a slack in the ith inequality. If hi is positive it means the ith inequality is satisfied. If hi it is violated. And hi is the amount by which itís violated. Okay. So hereís the algorithm. Hereís the projected subgradient method, just translated using the subgradient. Notice how embarrassingly simple it is. So this is projected subgradient method for the dual. And it basically says this. It says you start with some lambdas. You can start with all lambdas equal to one. For that matter, start with all lambdas equal to zero. It doesnít Ė just start with all lambdas zero. It says at your current lambda, minimizes Lagrangian without any consideration of feasibility for the primal problem. Now when you minimize this thing here, and basically the lambda are, of course, prices or costs associated with the constraints. So this is sorta a net here, because itís sorta your cost plus, and then these are charges and subsidies for violations. Itís a charge if itís a violation. And it is a subsidy if fi is negative, which means you have slack. And then, actually, you derive income from it. Okay. So thatís the meaning of this. So it says what you do is you set all Ė itís basically a price update algorithm and you start with any prices you like. They have to be non-negative. Start with the prices. You then calculate the optimal x. No reason to believe this optimal x is feasible. By the way, if the optimal x is feasible, youíre done. Youíre globally optimal. So if fi of x* at any step, if theyíre all less than or equal to zero, youíre done, optimal. And not only that, you have a primal dual pair proving it.

Okay. Otherwise what you do is you do this. And this is really cool. You go over here and you look at fi of x. If fi of x, letís say, is plus one, it means that your current x is violating constraint i. Okay? It says youíre violating constraint i. That says youíre not being Ė the price is not high enough. So it says increase the charge on resource one. If Ė resource i if fi represents a resource usage. So it says pop the price up in that case and alpha tells you how much to pop the price up. In that case, the plus is totally irrelevant, because that was non-negative. You adding something. You bumped the price up and thereís no way this could ever come into play. Okay. So it says Ė I mean, this is actually Ė this is the name Ė I should say the comment, if you make a little comment symbol over here in the code you should write on the thing ďprice update.Ē Because thatís exactly what this is, the price update. So what you do then is this. If fi is negative, thatís interesting. That means youíre underutilizing resource i. Itís possible that in the final solution the constraint i is not tight, in which case it doesnít matter. Thatís fine. Thatís the right thing to do, but youíre underutilizing it. And what this says is in that case, thatís negative, this says drop the price on that resource. This says drop the price. However, now this comes into play. It says drop the price, but if the new price goes under zero, which messes everything up, because now it encourages you to violate inequality, not satisfy them, then you just make the price zero. And so, for example, if the ith inequality is, in the end, gonna be at the optimal point, this is actually gonna be not tight, then whatís going to happen is that price is gonna go to zero like that. Youíre gonna Ė at the Ė youíll be underutilized here. Thatíll be negative. Thatíll be zero for the last step. This will become negative. The plus part will restore it to zero. So this algorithm, I mean, itís actually a beautiful algorithm. It goes to the Ė variation on this go back into the Ď50s and Ď60s, and so Ė and you find them in economics. So this is Ė itís just a price update or, I think, this is one Ė this would be part of a Ė a bigger family of things. I guess they call this a TETMA process, or something like that, where Ė I donít know. Whoís taken economics and knows the name for these things? Does anyone remember? Come on, someone here took an economics class and saw some price adjustment. Okay, letís just move on. No problem.

All right. So in that method, it says that the primal iterates are not feasible. Thatís, I mean, itís actually Ė if you ever hit an interation where the primal iterates are feasible you are now primal dual optimal, quit. You quit with perfect Ė so what it means is in a method like this, projected subgradient applied to the base problem, after each step youíre feasible because you projected onto the feasible set. So thatís a feasible method. And all thatís happening is your function value is going down. I might add not monotonically, right. So these are not decent methods. But your function valueís coming down the optimal one non-monotonically. In a dual subgradient method whatís happening is that the primal iterates are not feasible. What happens is these things Ė youíre approaching feasibility. Thatís what happens. In fact, youíll never hit feasibility. If you hit feasibility you terminate at the end. And in this case, the dual function values, all of which are lower-bound Ė so the one nice part about this dual subgradient method is at each step you have a global lower bound on the original problem. You do add value at g exactly at each step. So you have a lower bound. By the way, thereís a couple of tricks. I think these are in the notes, so if you read the notes, this becomes especially cool when you have a way, some special method for taking x tilde, and from it, instructing Ė I want to say projection, but it doesnít have to be projection Ė but constructing a feasible point. Could be by projection. So you can get Ė you can construct a feasible point, then this algorithm will actually produce at each step two things, a lower bound, a dual feasible point. Youíll know g of lambda. Thatís the lower bound of your problem. It will go up non-monotonically. And youíll also have a feasible point called at x tilde of k. Tilde is the operation of constructing a feasible point from x(k). And then you get primal points whose function value is going down non-monotonically. Then you actually get a duality gap and all that kind of stuff. Okay. So I think weíve talked about all this. Thatís the interpretation of the thing. Itís really quite beautiful. The coolest part about it you havenít seen yet. And the coolest part weíre going to see later in the class. Not much later, but itís that this is going to yield a decentralized algorithm.

For example, you can do network flow control, you can do all sorts of crazy stuff with this, and itís quite cool. But thatís later. Okay. So weíll just do an example just to see how this method works, or that it works, or whatever. Oh, I should mention something here. And you might want to think about when would this algorithm be a good algorithm. When would it look attractive? And let me show you, actually, one case just right now immediately. This is trivial calculation. The only Ė the actual only work is here. And what this means is you have to do, at each step, the work is actually in minimizing this Lagrangian. So basically, at each step there will be prices, and then you minimize the Lagrangian. Thatís going to be the work. Therefore, if at any time you have an efficient method for minimizing this function, you are up and running. Okay? So, for example, I mean, I can name a couple of thing. Someplace you have a control problem and these are quadratic functions. Then if you have your Ė if this weighted sum is also going to be one of these convex control problems that it means you can apply your LQR or your Riccati recursion or whatever you want to call it to that. If this is image processing and somehow this involves something involving 2D df keys and all sorts of other stuff, the grad student before you has spent an entire dissertation writing a super-fancy, multigrid blah, blah, blah that solves this thing well. If it solves least squares problems there, and if this is a least squares problem, youíre up and running. And you wrap another loop around that where you just update weights and then repeatedly solve this problem. So itís good to be on the lookout. Any time where you see Ė where you know a problem where you have an efficient method for solving Ė actually, just a way of minimizing a weighted sum Ė this is what you want to do.

Okay. Letís look at an example. This is going to be quadratic minimization. Itís not a big deal. And weíll make p strictly positive over unit box. So notice here we can do all sorts of things with this. Oh, we could do the projected subgradient primal is easy here. So projected subgradient primal goes like this: you take x at each step and then you x minus equals alpha times p(x) minus q. Everybody follow that? That was x minus equals alpha g. And then I take that quantity and I apply sat, saturation, because saturation is how you project onto the unit box. Everybody got that? Thatís the method. Okay? So okay, that would be primal subgradient method applied to this. By the way, I should mention something here, that is, donít Ė these are not endorsements of these methods, and in fact, these methods only make sense to actually use in very special circumstances. If you just want to solve a box-constrained QP like this, and x is only 2,000 dimensions or it could be all sorts of other things, you are way better off using all the methods of 364a. Thatís just an interior point method. So actually, if someone said, ďOh, Iím using primal decomposition or dual decomposition to solve this,Ē I would actually really need to understand why. There are some good reasons. One of them is not, I donít know, just because itís cool or something. I mean, thatís not Ė I mean, here, for example, this would be so fast if you made a primal barrier method for it. It would be insane. So there are only special reasons why youíd want to do this. One would be that when you write down this dual subgradient method it turns out to be centralized. That would be Ė that works as a compelling argument. But just to remind you, these methods are slow. They might be two lines, like that. I guess if you put a semicolon here, itís one line. They might be two lines. They might be simple. But theyíre not Ė these are not the recommended Ė I just want to make that clear. Okay. So hereís the Lagrangian. And, indeed, it is positive Ė this is positive definite quadratic function for each value of lambda, because you donít even need this part. Itís positive definite already here. And so hereís x*. Itís this. Itís p plus 9 ag of 2 lambda inverse q. And the projected subgradient method for the dual, this looks like that. So you, in fact, it makes perfect sense. It even goes really back to 263 and it goes back to regularization. If you didnít know anything about convex optimization, but you knew about least squares, that describes a lot of people, by the way, who do stuff.

And by the way, people who do stuff and actually get stuff done and working. So donít make fun of them. Donít ever make fun of those people. So how would a person handle that if you hadnít taken 364? Well, itíd be 263. Youíd look at it and youíd say, ďWell, I know how to minimize that.Ē Thatís no problem. Thatís that, without the lambda there. Thatís p inverse q. Something like that. And then youíd look at it and youíd go, ďYeah, but, I mean, this is a problem.Ē So hereís how a human being would do it. Theyíd do this. They calculate p inverse q. Thatís x. If that x is inside the unit box, they would say Ė theyíd have the sense to say, ďIím done.Ē Otherwise, theyíd say, ďOh, x7 is like way big. Ouch. Thatís no good.Ē So I will add to this. I will regularize and I will put plus a number, some number, times x7 squared. Everybody cool on that? Youíre adding a penalty for making x7 big. Okay? And youíd look at this and be like x12 is also big and youíd add something there. Iím not Ė remember, donít make fun of these people. I donít. You shouldnít. So then youíd solve it again. And now Ė except it would be smaller. Now x7 is too small. Now x7 has turned out to be plus/minus Ė is 0.8. And you go, oh sorry, my weight was too big. So you back off on x7 and now other things are coming up and over. And you adjust these weights until you get tired and you announce, ďThatís good enough.Ē Okay. I mean, listen, donít laugh. This is exactly how engineering is done. Least squares with weight twiddling. Period. Thatís how itís done. Like if youíre in some field like machine learning or something you think, oh now, how unsophisticated. People in my field are much more sophisticated. This is false. All fields do this. This is the way it really works in those little cubicles down in Santa Clara. This is the way itís done. Youíre doing imaging. You donít like it. Too smoothed out.

You go back and you tweak a parameter and you do it again. So no shame. All right. So, actually, if you think about what this method is, this is weight twiddling. Thatís what this says. Itís weight twiddling. It says pick some regularization weights, because thatís what these are, and then it says update the regularization weights this way in a very organized way. It just Ė you just update them this way. So this is, in fact, a weight twiddling Ė an economist would call this a price update algorithm. And maybe an engineer might call it a weight twiddling algorithm. They might even Ė thereís probably people who invented this and didnít know it. Anyway, everyone see what Iím saying here? Okay. Let me ask you a couple of questions about it, just for fun, because Iíve noticed that the 364a material has soaked in a bit. If p Ė not fully. If p is banded, how fast can you do this if p is banded? Letís say itís got a bandwidth around k. N is the size of x. How fast can you do that?

Student:[Inaudible].

Instructor (Stephen Boyd):With n squared k? Thatís your opening bid? Thatís better than n cubed, right. If p is full, thatís n3. Thatís a Cholesky factorization and a forward and backward substitution, right? Letís make p banded. You said n squared k, that was your opening?

Student:[Inaudible].

Instructor (Stephen Boyd):Oh, even better. So nk squared. Youíre right. Thatís the answer. Okay. So just to make it Ė I mean, you want me to Ė let me just make a point here. If this is full, you probably donít want to do this for more than a couple of thousand. Three thousand, 4,000, you start getting swapping and stuff like that on something like that. You have a bunch of machines, all your friendsí machines, and you run MPI and all that stuff. Whatever. You can go up to 5,000, 10,000, something like that. But things are getting pretty hairy. And theyíre getting pretty serious at that point. If this thing is blocked Ė if p is block-banded or something like that, itís got a bandwidth of ten, how big do you think I could go? For example, my laptop, and solve that. Remember, the limit would be 1,000. I could do 2,000. Itís growing like the cube, so every time you double it goes up by a factor of ten or eight or whatever. So whatís a rough number? Well, put it this way, we wouldnít have to worry at all about my laptop about a million. I want to make a point here that knowing all this stuff about structure and recognizing the context of problems puts you in a very, very good position. By the way, where would banded structure come up in a least squares problem? Does it ever come up?

Student:[Inaudible].

Instructor (Stephen Boyd):Structures that are, yeah, that actually banded Ė what does banded mean? Banded means that x(i) only interacts with x(j) for some bound on i minus j. So if you had a sort of a trust or some mechanical thing that went like this and things never Ė bars never went too far from one to the other, that would be a perfect example. Let me give you some others. Control, dynamic system. So just control is one, because there, itís time. And, for example, if you have a linear dynamical system or something like that, the third state at time 12, it interacts, roughly, with states one step before and one after. But then thatís banded. How about this? How about all of signal processing? Thereís a small example for you. All of signal processing works that way, more or less. Because there the band structure comes from time. Signal processing means that each x is dependent only on a few Ė you know, a bounded memory for how much it matters. Now the whole problem is coupled, right? Okay. This is just for fun, but Iím going to use Ė itís good to go over this stuff, because, okay, I just use it as an excuse to go over that. Okay. So hereís a problem instance. So hereís a problem instance where I guess we have 50 dimensions and took a step at point one. Oh, I should Ė I can ask a question here about this. In this case, it turns out g is actually differentiable. So if g is differentiable, that actually justifies theoretically using a fixed step size. Actually, in practice as well, because in a Ė if you have a differentiable function, if you apply a fixed step size, and the step size is small enough, then you will converge to the true solution. So this goes g of lambda. These are lower bounds on the optimal value, like that. They converge. And this is the upper bound, found by finding a nearby feasible point. And then let me just ask you Ė I donít even know because I didnít look at the codes this morning on how I did this, but why donít you guess it?

At each step of this algorithm, here, when you calculate this thing Ė by the way, if this thing is inside the unit box, you quit and youíre done. Youíre globally optimal because youíre both primal and dual. End of story. Zero duality gap. Everythingís done. So at each step at least one of these Ė at least one component of this pops outside the unit box. Please give me some guesses Ė give me just a uristic for taking an x and producing from it something thatís feasible for this problem.

Student:[Inaudible].

Instructor (Stephen Boyd):Simple? What do you do?

Student:If the [inaudible] is negative one, you make it one. If itís less than negative one, you make it negative one.

Instructor (Stephen Boyd):There you go. You just project. So in this case itís too easy to calculate the projection. You just calculate the projection and so, in fact, this Ė whoops. This thing here Ė Iím sure thatís what this is, but x tilde is simply the projection of x(k) onto the unit box, so thatís what that is. Okay, so thatís that. Weíre going to come back and see a lot more about projected subgradient methods applied to the dual later in the class. Okay. So letís look at a more general case thatís going to be subgradient method for constrained optimization. So here, instead of describing the constraint as just a constraint set, weíll write it out explicitly as some convex inequalities. So this goes like this. Hereís the update. I mean itís really dumb. Iíll tell you what it is. You simply do a subgradient step. And hereís what you do. If the point is feasible, you do an objective subgradient step. If itís not feasible, then you find any violated constraint and you use a subgradient of that. Okay? Does this make sense? So itís really quite strange. In fact, whatís kinda wild about it is that it actually Ė I mean, that it actually works. So, I mean, you realize how myopic this is. Itís very, very silly. It basically Ė so the algorithm goes like this: youíre given x and you start walking through the list of constraints. So you evaluate f1, f2, f3. If those are less than or equal to zero, you go to the next one. The first one Ė I mean, thatís just a valid method. The first time you hit a violated constraint, that j is positive, fj, you simply call fj dot get subgrad, or something like that, to generate a g, and you take a step in that direction. Does that reduce fj? No. Subgradient method is not a decent method. Thereís no reason Ė so basically you go down, you find the 14th inequality. If violated, you take a subgradient step, and that could, and often does, make the violation of the 14th inequality worse. I mean, the whole thing is like Ė these algorithms are just highly implausible.

I mean, the kinda things where you need the proof because the algorithms themselves are so ludicrous. Okay. Now here we have to change a few things. fk best is the best objective value we have over all the points that are feasible, and this can actually be plus infinity if you havenít found any feasible points yet. So fk best is initialized as plus infinity. Okay, so the convergence is basically the same. I wonít go into the details. It just works. I mean, thatís the power of these kinda very slow, very crude methods. In fact, thatís going to come up in our next topic. What are you going to say about a subgradient method is theyíre very unsophisticated, theyíre very slow, but actually, one of the things you get in return for that is that they are very rugged. In fact, in the next lecture, which weíll get to very soon, youíll see exactly how rugged they are. There it kinda makes sense. Anyway, so there it is. Thatís a typical result. And I think the proof of this is in the notes. You can look at it. But letís just do an inequality form LP, so letís minimize C transpose x subject to Ax less than b. Itís a problem with 20 variables and 200 inequalities. Letís see, the optimal value for that instance turns out to be minus 3.4. We have one over k step size here. Oh, by the way, when we do the feasible step, you can do a Polyak step size, because when youíre Ė if youíre doing a step on fj, which is a violated inequality, what youíre interested in is fj equals zero. You are specifically interested in that. So your step size can be the Polyak. And this would be an example of sorta the convergence f minus f*. I guess if f* is minus 3.4, then Ė well, this is not bad, right? I mean, this is Ė I donít know. Letís find out where 10 percent is. Thereís 10 percent.

So took you about 500 steps to get 10 percent or something. So each step here costs what? Whatís the cost of the step here, assuming dense, no structural? Whatís the cost of a step in solving? Letís write that down. So weíre gonna Ė letís see Ė weíre gonna solve this problem. Weíre gonna minimize C transpose x subject to Ax less than b. Whatís the cost of a subgradient method step here? Youíre exempted because you canít see it. Is that true you canít see it? No, you canít see it. You can see the constraints. Thatís actually the really important part. Whatís the cost?

Student:[Inaudible].

Instructor (Stephen Boyd):How did you do it? Whatís the method? If you were in matlab, how long would it be? For that matter, itís just as easy to write it in lapad, but letís write it in matlab. What would it be? How do you implement this method here? Not this one, but Ė by the way, of course all the source code for this is online, but thereís a method, right? So whatís the method? Somebody tell me the method. Homework three. Youíll be doing this shortly enough. Well, hereís the lazy way. You just evaluate Ax and you compare to b. If Ax is less than or equal to b, whatís your update on x? Itís x minus equals alpha c, right? Otherwise, if Ax is not less than or equal to b, you sorta, you find Ė for example, you might as well find the maximum violated one, or Ė I mean, it doesnít matter. Thatís the point. You can take any violated one. So but if you evaluate all of them Ė and thatís just from laziness Ė you evaluate all of them, then whatís the cost, actually, of evaluating this? Thereís your cost right there. Itís just multiplying Ax. Whatís that?

Student:[Inaudible].

Instructor (Stephen Boyd):Are you saying mn? Thank you, good. Okay, mn. This is Ė itís irritating Ė I know Ė the thing is, you should know these things. This should not be abstract parts from three days of 364a. You should just know these things. You should know what the numbers are on modern processors and things like that. Just for fun everyone should Ė after a while, then, we quit and then you go back and itís just Ax and stuff like that. So the cost is mn per step. So what that says Ė whereas, how about Ė whatís the cost on an interior point method on this guy? Whatís an interior point step? In fact, whatís the interior point method complexity period, end of story, on this guy? Just minimize C transpose Ax less than b. At each step you have to solve something that looks like Ad, A transpose, something or other, right? And thatís going to be n cubed Ė I mean, unless thatís [inaudible], thatís n cubed. But forming Ad A transpose, thatís the joke on you. Thatís m(n)2. m is bigger than n. Thatís the dominant term. So itís m(n)2. How many interior point steps does it take to solve this problem?

Student:[Inaudible].

Instructor (Stephen Boyd):Thank you. Twenty. So itís 20. So whatís the overall complexity of solving this problem? m n squared. And remember what that is that follows my mnemonic. Itís the big dimension times little dimension squared.

This assumes you know what youíre doing. If you do it the wrong way, itís the big dimension squared times the small dimension. Always ask yourself that. So itís m n squared versus mn. So basically, a subgradient step costs a factor of n more. n is 20. I mean, it doesnít Ė is that right? Okay, so that says here you really should divide these by 20. And so that Ė so you said 20 steps. So this is 25. It would Ė you would actually have solved the problem here. Youíve solved it through about 10 percent accuracy with this subgradient type method here, roughly ten percent. Maybe a little bit better. But in an interior point method, in this amount of effort, roughly, in this amount of effort youíd have the accuracy of ten to the minus ten. Okay. Everything cool? All right. Is that going to work? I donít know what Iíve done. Okay. So, our next topic is the stochastic subgradient method. And weíre gonna get to some of the first things we can actually do with subgradient-type methods that donít really have an analog in interior point methods. Weíre not Ė so far theyíre just cool because theyíre three lines that proof of convergences, four lines and so on. Weíre gonna see now some very cool things about subgradient methods later, but now weíre gonna see something thatís actually different and isnít 364a compatible. So weíre gonna do stochastic subgradient method. And let me just give you the rough background on it.

The rough background on it is that these methods, these subgradient methods, theyíre slow, but theyíre completely robust. They just donít break down. Theyíre three lines of code, one really, two, something like that. One or two lines of code, two lines of code. Theyíre very slow and boy are they robust. They just cannot be messed up. And weíre gonna see a very specific example of that. In fact, whatís gonna turn out is you can add noise, not small noise, to the subgradient calculator. Everyone would guess Ė and look, if youíre doing stuff in double precision floating points, youíre always adding noise every time you all anything. In an interior point method you get Ė you say get me the gradient. That comes back with noise. I guess in ee, we would say with something like 200 decibel signal noise ratio, because thatís what a i triple-e floating point gives you. But it basically comes back already with noise, but itís on the order of 1(e) minus 8 or 1(e) minus 10 times the size of the thing youíre calculating. Thatís standard. Itís gonna turn out there. So no one would be surprised if barrier method continued to work if there was noise that was in the sixth figure of your gradient calculation. That would hardly be surprising. Fifth figure, again. Fourth figure you can start imaging having some trouble now. The subgradient methods, they work this way. Hereís how stupid and robust they are. Not only Ė you can actually have a signal of noise ratio thatís quite negative. So notice you can have basically a subgradient where the signal noise ratio is one. In other words, that means basically when the person says the subgradient is that direction, the true subgradient can actually be back there. Itís just sorta if you ask them 50 time or 100 times or something, they should be kinda average out vaguely to the right direction. Everybody got this?

So it went to school, actually. It has lots of applications. So okay, so let me define a noisy unbiased subgradient. So hereís what it is. So I have a fixed Ė this is a deterministic point, x, and then I have a noisy unbiased subgradient for f at x is this. It is a vector, a random vector g tilde that satisfies this, that its, on average, its expected value is a subgradient. Now by the way, this means, of course, that for any particular g tilde, this inequality if false, I mean, obviously, need not hold, right? However, on average Ė so basically think of your f dot get subgrad as being ran Ė itís not deterministic. When you call it, it gives you different gís. If you call it a zillion times on average that would give you something close to the mean. Thatís close to a subgradient. Everybody got it? So weíll see lots of practical examples where you get things like that. Okay. Another way to say it is this, is that what comes back is a true subgradient plus a noise, which is zero mean. Thatís a stochastic subgradient. Now this error here, it can represent all sort of things. It can be Ė first of all, it can just be computation error that basically when you calculate subgradient youíre sloppy or you do it in pix point or I donít know. Anything like that. But it can also be measurement noise.

Weíre going to see itís going to be Monte Carlo sampling error, so if, in fact, the function itself is an expected value of something, and you estimate an expected value by Monte Carlo, then youíre right, itís unbiased Ė I mean, itís an unbiased estimated. You write it down as Ė well, itís an unexpected then the average is the right. But you get Ė itís unbiased, and then v is actually the difference between Ė itís a random variable and itís the difference between the what you actually get is your Monte Carlo sampling error. Okay. Now if x is also random, then you say the g tilde is a noisy unbiased subgradient if the following is true: for all z this holds almost surely that this is the conditional expectation of g tilde. Thatís the noisy subgradient condition on x. Now thatís a random variable, so this right-hand side is a random variable. Thatís not a random variable. And itís also a random variable because x is a random variable. So the whole thing on the right is a random variable. And if this inequality holds almost surely, then you call it a noisy unbiased subgradient. So thatís what it is.

Okay. And thatís the same as saying the following: it says that the conditional expectation of g tilde given x is a subgradient of f at x almost surely. So thatís what it means. For the conditional one, if x is not random, itís like that, I can Ė I donít need the condition on x and I can erase that. So, letís see, I donít know what this means. Anyway. This is a random vector. Thatís a random vector and the idea is Ė and thatís actually a random set. And so it basically says that that inequality holds almost surely. So okay. Now hereís a stochastic subgradient method. Ready? Here. In other words, itís the subgradient method. So it says you got a noisy subgradient, Iíll just use it. Iíll just use it. Nothing else. You basically update like that and thatís it. Now I want to point something out. You get a Ė this is now a stochastic process, because even if x(0), if your initial x was deterministic, then g(0) is already a random variable, and therefore, x(1) is a random variable, the first update, because it depends on g(0). And so this is now a stochastic process, this thing. Okay. So we now have the stochastic process, which is the stochastic subgradient Ė the trajectory of the stochastic subgradient method. And here you just have any noisy unbiased subgradient. And then weíll take the step size, the same as always, and then f(k) best is going to be the min of these things. Thatís a Ė by the way, thatís a random variable there. Because thatís now a stochastic process here. So thatís a stochastic process and thatís a random variable. It is f(k) best. Okay. So hereís some assumptions. The first is weíll assume that the problem is bounded below. These are much stronger than you need, but thatís good enough. Weíll make this global Lipschitz condition here. More sophisticated methods you can relax these, but thatís okay. And weíll take the expected value of x(1) minus x* Ė x(1), by the way, could be just a fixed number, in which case you donít even need this expected value.

Itís the same as before. Now weíre going to have the step sizes, theyíre going to be square-summable, but not summable. So, for example, one over k would do the trick. So youíre going to take a l2, but not l1. Little l2, but not little l1 sequence of steps. One over k is fine. Okay. Here are the convergence results. Okay, Iíll summarize this one. It works. This says that the Ė that it converges in probability. And, in fact, you have almost sure convergence. Weíre not going to prove this one, although itís not that hard to do, this one. We will show Ė actually, weíll show this. This will follow immediately from that, since these are f(k) minus Ė f(k) is bigger than f*. So thatíll follow immediately. So before we go on and look at all this, I just want to point out how ridiculous this is. So first of all, the subgradient method by itself I think is ridiculous enough. It basically says you want to minimize Ė you want to do minimax problem. It says no problem. At each step go around and find out which of the functions is the maximum. If thereís more than one, arbitrarily break ties. Return, letís say gradient of that one, and take a step in that direction. Thatís totally stupid, because if youíre doing minimax, the whole point is when you finish, a lot of these things are going to be tied, and the whole point is you donít want to just step greedily to improve one of these functions when you have a bunch of them. It just says do it, and the one over k step size are going to take care of everything. Whatís wild about this is that that method, though, is so robust that, in fact, your get subgradient algorithm can be so bad that it can actually, as long as, on average, itís returning valid subgradients, itís gonna work. So single to noise ratio can be minus 20 decibels. You can be getting the Ė whenever you get a subgradient, you could be adding to that a noise ten times bigger than the actual subgradient. Everybody see this? The whole thing is completely ridiculous.

Now, how Ė will the convergence be fast? No. It canít be. I mean, it can hardly be fast if someoneís only giving you a subgradient, which is kind of a crappy direction anyway for where to go. But now if they give you a subgradient where the negative 20 decibels signal noise ratio, in other words with Ė basically it says that you canít even trust the subgradient within a factor of ten. Youíd have to call Ė youíd actually ask for a subgradient direction like ten or 100 Ė youíd call it 100 times and average the answers. And thatís the only time you could start getting some moderately sensible direction to go in. Everybody see what Iím saying here? The whole thingís quite ridiculous. And the summary is, it just works. These are kinda cool. What? Itís also Ė this is known and used in a lot of different things, signal processing and all sorts of other areas. Actually, thereís a big resurgence of interest in this right now in what people call online algorithm thatís being done by people in CS and machine learning and stuff like that. So letís look at the convergence groove. Itís tricky. You wonít get the subtleties here. But you can look at the notes, too. Itís not simple. I donít know. I got very deeply confused and you have to go over it very carefully. That itís subtle you wonít get from this, but letís look at it. It goes like this. Youíre going to look at the conditional expectation of the distance to an optimal point given x(k) Ė the next distance here. Now this thing is nothing but that, so we just plug that in. And we do the same thing we did with the subgradient method. We split it out and take this minus this. Thatís one term. And we get this term. Now that, and this is conditioned on x(k). So x(k) conditioned on x(k) is x(k). So this loses the conditional expectation condition on x(k). Thatís a random variable, of course. But you lose the conditional expectation. Itís the same.

Then you get this. You get two alpha times the conditional expectation of now itís the cross-product of this minus this and that term. And thatís this thing conditioned on x(k). And the last term is you get alpha squared times the conditional expectation of the subgradient squared given x(k). And weíre just going to leave that term and leave it alone. Now this term in here, weíre going to break up into two things. Weíre going to write it this way. Itís the Ė I can take here the x* is a constant and so condition on x(k), that just x*. And then this term, g tilde k transposed x*, thatís linear in this, so conditional expectation commutes with linear operators, so that comes around and you get this thing. Now that Ė this thing here, the definition of being a subgradient noisy stochastic subgradient, or a stochastic subgradient, if you like, is that this thing here should be, I guess itís bigger than or equal to whatever the correct inequality is this, to make this thing true. So thatís how that works. And so you end up with this. Now if you go back and look at the proof of the gradient method, subgradient method, it looks the same, except thereís not the conditional expectations around. And thereís a few extra lines in here because of the conditional expectation. So letís look at this. And this inequality here is going to hold almost surely. Everything here is a random variable. Thatís a random variable. This entire thing over here is a random variable. So this inequality holds almost sure this thing is less than that. And now what you can do is the following: we can actually telescope Ė I mean, we can actually now telescope stuff, the same as before. If we take Ė I should say, if we take expectation of this now, then the expectation of this is just the same as the expected value of that. Thatís a number, and thatís less than the expected value of that minus, then, the expected value of that.

Expected value of that, that just drops the conditional part there. And so hereís what you get. You end up, if you take expectation of left- and right-hand sides of the inequality above, which was an inequality that holds almost surely, you get this. The expected distance to the optimal point in the next step is less than the expected Ė the current distance to the next point minus two alpha k times the expected value of your suboptimality here, plus alpha squared times the expected value of the subgradient squared. This thing will replace with just the number g squared here, and weíll apply this recursively and you end up with this, that the expected value of x(k) plus 1 squared minus x* is less than the expected value of x(1) minus x squared. This is going to be less than r squared here, this thing. Thatís our assumption. And then again we get the good guy and the bad guy. Thatís bad. This is good because this is Ė I mean, this thing is always bigger than that by definition. So whatever this is here, itís a number. Whatever this number is, itís non-negative here. I guess itís a positive, or something like that. Thatís a positive number. Thatís a negative sign here, so this is on our side. Itís actually making this distance smaller. And the nice part about that is that goes in alpha, this goes in alpha squared, and so the bad guy, at least for small enough alpha, is gonna lose. And then you just simply turn this around and you get the following: you get the minimum of i equals 1 to k of the expected value of your suboptimality, which is less than or equal to r squared plus g squared and norm alpha squared. Itís the same thing as before. So itís exactly what we had before. Okay. Except now itís the minimum of the expected value of the suboptimality. Thatís actually a random variable here. So thatís the difference.

Okay. Now this tells us immediately that the expected value of Ė that this sequence actually converges the min of these expected values converges to f*. Thatís what it tells us. Actually, I believe that the fact is, you donít even need the min here. This converges. Thatís a stronger result, but we donít need it. Weíll get to that. Now weíre going to apply Jensenís inequality. So Jensenís inequality says the following: this is Ė Iím gonna commute expectation and min. Min is a concave function, so what that says is the following: Here I have the expectation of the min and here I have the min of the expectation. And I guess the inequality goes this way. But this thing here is a random variable and itís a random variable f(k) best. And so expected value of f(k) best is less than or equal to this. This thing goes to zero. So weíre done. By the way, I never remember which way Jensenís inequality goes. So Iím not ashamed to admit it. So I usually have to go to a quiet place or draw a little picture with this and some lines. Because every time you do it itís something different. Itís either a concave or whatever. Itís important which way it goes, so I just go somewhere quiet and I draw a picture and then come back and see. Iím trusting myself that itís correct. I think it is.

But once you know this, youíre done. But this is Ė now youíre done with Ė you can get convergence of probability very easily, because these random variables are non-negative. So these are non-negative. So the probability that a positive random variableís bigger than epsilon is less than that, and we already know the numerator goes to zero, so for any epsilon, this goes to zero. And so you get convergence in probability. So by the way, this is not Ė this is not simple stuff. I mean, itís not complicated. It did fit on two slides with giant font size. But trust me, itís not Ė itís not totally straightforward. So and I think the notes has this in more detail. Okay. Letís do an example. So hereís an example. Least lines linear minimization. This is what weíre going to do. Weíre going to use the stochastic subgradient method, except that Ė how do you get a subgradient of this thing? How do you get a subgradient of this? What do you do? Yeah, you evaluate like all Ė you have to evaluate all of these. Because otherwise you donít know what the maximum is. You evaluate all of them, find the maximum value, then go back and find one of them that had maximum value. Break ties arbitrarily. Could be the last one that had maximum value. And then you return ai. So hereís what weíre going to do. We will actually artificially weíll just add zero mean random noise to it. Weíll just add noise. So weíll make, basically weíll make Ė weíll add noise in our f dot get subgrad method. Thatís what weíre gonna do. And hereís just a problem instance with 20 variables. M equals 100 Ė you have 100 terms. Youíve seen this before. f* is Ė the same example. One over k step size. And the noises are Ė have about a four Ė they have about 25 percent of the sizes of the subgradient, because the average size of the subgradient is about four and the noise is the average size is Ė letís see, each of these is about .7 or something like that. So thatís about 25 percent, something like that.

Okay. And hereís what happens. If you have Ė hereís the noise-free case. So you get this. And so, indeed, so I should say what this means is youíre getting the subgradient at about how many bits of accuracy are you getting in your subgradient? When you call subgradient, how many bits of accuracy are you getting here? I mean, if your noise is on the order of a quarter, I just want a rough number. How many bits of accuracy are we talking here? Is this 20? Would you Ė itís not Ė you have the same signal noise ratio, one to four Ė no, four to one. Roughly what Ė how many bits? Itís not a hard Ė itís not a complicated question, so people are probably way over-computing or something like that.

Student:[Inaudible].

Instructor (Stephen Boyd):Itís two, roughly. What? You said two and a half. You believe two?

Student:[Inaudible].

Instructor (Stephen Boyd):You think itís 12? Itís two, right? Basically means if I tell you a component Ė if I tell you the subgradient is this, you can be off by as much as 25 percent. I mean, this is just all hand waving. It roughly means itís about two bits of accuracy. Itís not a whole lot of accuracy, right? So thatís the point. These are really quite crude. You can see what happens is actually interesting. Thereís one realization and hereís another. Actually, in this one weíre really lucky. The errors in the subgradient were such that they didnít mess us up very much, and thatís another one where it did, although it canít be stopped. These are big numbers here. Thatís about Ė that tends to Ė the interesting thing is to get the 10 percent accuracy you probably multiplied your Ė the number of steps by four or something like that. The really cool part is that you would get Ė what would happen if the signal to noise ratio were inverted so what if the signal were four times as big as the Ė suppose when you got subgradient we took this to be, I guess, whatever the .5 Ė suppose the signal noise ratio were reversed and the signal to noise ratio was .25, not four, roughly? What do you think would happen here? Well, first of all we know the theory says itís going to work. But think about how ridiculous that is, basically. It Ė you calculate the worst g, that says go in that direction. And to that vector you add a galcion, which is four times bigger. Which means, basically, itís all Ė it would be very difficult to distinguish between your get subgrad method and this completely random, like, ďWhich way should I go?Ē And youíre like, ďOh, that way.Ē You know? And itís like, ďReally? Can you verify that?Ē And you go, ďThat way.Ē Itís just totally random. All youíd have to do that like 1,000 times and average them to even see lightly that thereís some signal there. Everybody see what Iím saying? What would happen, of course, is that would now mess it up much more.

These would be that. Thatís what would happen. So this shows you what happens is 100 Ė you do 100 realizations, you generate 100 stochastic processes, which is the stochastic subgradient method running forward in time. And this shows you the average here. And this shows you the standard deviation. Thatís a long scale, so thatís why these things look weirdly asymmetric. So on a linear scale this is plus/minus one standard deviation. This is also plus/minus one standard deviation, but itís on a long scale. But thatís what it is. By the way, itís actually kind of interesting, these points down here correspond to cases where the noises, the noise was kinda bad. Sorry, the noise accidentally pointed you in the right direction, and as a result, you did actually quite well. And of course, these are cases where the noise kinda was hurting you as much as possible. Makes sense? So I guess the summary of this is that the subgradient method you can make fun of it, itís very slow, and all that kinda stuff, but the wild part is actually any zero mean noise added to it does hurt it. And weíre not talking noise in the fifth digit. Weíre talking, if you like, noise in the minus fifth digit, if you want. So you can actually, I mean, which is quite ridiculous if you think about it. Donít try a Newton method when the Ė when youíre calculating your gradients or your Hessians with 25 percent noise.

For that matter, donít try it if your signal to noise ratio is one to four, so itís off the other way around. Okay. So hereís a Ė these are empirical distributions of your suboptimality at 250, 1,000, and 5,000 steps here. And they look like this, and you can actually see these would be the ones at the top of that plot, those arrow bars. And then these would be the ones at the bottom. But you can sort of see that the distribution is very slowly going down like that. So thatís the picture. Let me ask one question about this problem. How would you deal with this in a 364a context? Suppose I told you, you need to minimize a piece-wise linear function, but unfortunately, the only method Ė the source code I wonít let you look at. The only thing that calculates this thing only does it to two bits of accuracy. Or another way to say it is every time you call it, youíre going to get a subgradient plus a noise, which is as big as a quarter the size of the actual subgradient. How would you deal with that in 364a? I mean, we didnít really have a method to deal with this, but now just tell me what would you do?

Student:Call that function 10,000 times.

Instructor (Stephen Boyd):Right. Right. So 10,000. And, good. Ten thousand was a good choice of number, by the way. So you call the number 10,000 times, and then youíd average those subgradients and what would be the, roughly, how much error is in the one thatís 10,000 times? I was hoping for you to say I was going to go down by square root of 10,000. Thatís why I was complimenting your choice of 10,000, because it had a nice square root of 100. So instead of being an error being 25 percent, it would be .25 percent. So that might be enough to actually run a gradient method. It probably would work okay. So what would happen is at the end game it would kinda start being erratic or something, but youíd get a pretty good answer pretty quickly. By the way, if you evaluate it 10,000 times, I should point something out, this is beating you. So itís not clear Ė anyway, youíre right, thatís what youíd do. Okay. So this is actually maybe a good time to talk about stochastic programming. Actually, at some point I want to make a whole lecture on this, because itís quite cool. Everybody should know about it. And itís this. In stochastic programming, youíre going to explicitly take into account some uncertainty in the objective in the constraints. So thatís stochastic Ė thereís something called robust programming, where you have uncertainty, but you problem it in a different way and you look for worst case type things. But for stochastic thatís a very common, very old method, something like that. I should mention, itís kinda obvious that this comes up in practice all the time. So anytime anybodyís solving an optimization problem, you just point to any data as Ė oh, by the way, I should mention this. If you were not at the problem session yesterday, you should find someone who was and ask them what I said. I donít remember what I said, but some of it is probably useful.

So you take any problem, like a linear program, and what you do is you then ask the person solving the linear program, you point to a coefficient, not a zero, because zeros are often really zero. Also oneís, those are also not good choices, because a one is often really one. But you point to any other coefficient thatís not zero or one and you ask them what is that coefficient. And that coefficient has a provenience. It traces back to various things. If itís a robot it traces back to a length and a motor constant and a moment of inertia, or whatever. If itís a finance problem, it traces back to a mean return, a correlation between two asset returns or Ė who knows what? If itís signal processing, it goes back to a noise or a signal statistics parameter, for example. Everybody see what Iím saying? And then you look at them and you say, ďDo you really know that thing to a significant figures?Ē And now if theyíre honest theyíll say, ďOf course not.Ē And the truth is they really only know it to, it depend on the application, but it could be three significant figures. In a really lucky case it could be four. It could be two, could be one. And, actually, if you get some economist after a couple glasses of wine, theyíll look up and say, ďWe get the sine right, weíre happy.Ē So anyway, but until then, they wonít admit that.

So okay. So the point of all this is itís kinda obvious that if youíre solving a problem, the data, if you point to a data value and Ė it has a provenience and it traces back to things that probably you donít know better than like a percent. I mean, it depends on the application, but letís just say a percent. By the way, if you donít know any of the data, if you barely know the sine of the data, my recommendation with respect to optimization Ė well, my comment is real simple. Itís why bother? So if itís really true that you donít know anything about the model, then you might as well just do it by intuition and do your investments or your whatever you want. Just do it by intuition and guess. Because if you donít know anything, using smart methods is not going to really help. So typically youíll know one significant figure, maybe two, maybe three or something like that. And then, by the way, all the stuff from this whole year now starts paying off a lot. And there are weird sick cases where you know things through high accuracy. I mean, GPS is one, for example, where you point to some number and they go, ďYou really know that to 14 decimal places?Ē And theyíre like, ďYes.Ē I mean, I just find it weird. But anyway, normal stuff is accurate between one Ė zero is like why bother for this. One, two, three, five, six, I guess in some signal processing things, you can talk about 15 bits or something like that, 20. But rarely more than that. Okay. So thereís a lot of ways of dealing with uncertainty. The main one is to do a posterior analysis. Thatís very common. Let me tell you Ė people know what a posterior analysis is? So posterior analysis goes like this: youíre making Ė it doesnít really matter Ė letís make a decode Ė youíre making a control system for a robot, I donít care, something like that.

So you sit around and you work out a control system and when you work out the control system, you can trace your data back and there is a stiffness in there and thereís a length of the link to and thereís an angle and thereís all this stuff and thereís a motor constant. Thereís all sorts of junk in there. And they have nut values. And you have a robot controller and you get some controller and now you, before you implement it on the robot Ė thatíd be the simplest way Ė the first thing you do is something called a posterior analysis. Posterior analysis goes like this: you take the controller or the optimization variable, whatever it is, design on the basis of one particular value, like a nominal value of all those parameters. You take that and you resimulate it with those values, multiple instances of those values, generated according to plausible distributions. Everybody see what Iím saying? And by the way, if you donít do this, then itís called stupid, actually. This is just absolutely standard engineering practice. Unfortunately, itís done in Iíd say about 15 percent of cases. Donít ask me why. So in other words, you design a robot controller, you optimize a portfolio, anything you do, you do machine learning Ė actually, in statistics this is absolutely ingrained in people from when theyíre small children in statistics. You do this. Itís the validation set or something like that. So hereís what you do. You design that controller on the basis of a length and a motor constant, which is this Ė motor constant depends on temperature and all sorts of other crap. You ask somebody who knows about motors and you say, ďHow well do you know that motor constant?Ē And theyíd say, ďI donít know, plus/minus 5 percent, something like that.Ē

You go to someone in finance and you say, ďYou really believe that these two assets are 57.366 percent correlated?Ē And theyíd go, ďNo, but itís between 20 and 30 percent correlation, maybe.Ē And you go, ďThank you.Ē Then what you do is you take that portfolio allocation and you simulate the risk in return with lots of new data, which are randomly and plausibly chosen. Everybody see what Iím saying? You change the motor constant plus 5 percent, minus 5 percent. Moment of inertia, change it. The load youíre picking up, you donít know that within more than a few percent. You vary those. And then you simply simulate. And you see what happens. If you get nice height curves, so in other words, that means that your design is relatively insensitive, everythingís cool, and now you download it to the actual real time controller. Or you drop it over to the real trading engine, or whatever you want to do. So thatís how that works. So thatís a standard method. Thatís posterior analysis. Stochastic optimization is actually going to be dealing with the uncertainty directly and explicitly, and I guess weíll continue this next time. Let me repeat for those who came in late, plea, grovel, and Iím not sure what the word is. At 12:50 to 2:05 today here weíre having the worldís first, I believe Ė I havenít been notified from SCPG thatís itís not true, the worldís first tape-behind. Weíll have a dramatic reenactment of lecture one. So come if you can.

[End of Audio]

Duration: 79 minutes