ConvexOptimizationII-Lecture04

Student:[Inaudible].

Instructor (Stephen Boyd):Sure, we’re going to need strong duality holds if it were strictly feasible. We’d have Slater’s condition and strong duality would hold. That gives you zero duality gap and I guess if you don’t have that, then you can’t solve this at all, because the optimal values aren’t even the same. So let’s assume that. There’s more, actually, to it than just that. What the sledgehammer condition is is this. What you’ll need is that when you find lambda*, what you want is that the Lagrangian at lambda* should have a unique minimizer in x. If it does, then that x is actually x* up here. Okay? So that’s the condition. A simple condition for that – you should go back and read this, because we’re going to be doing this in a couple of weeks and actually these things are really going to matter. So we’re going to do network flow methods and all sorts of other stuff, and they’re actually gonna matter. Here’s the – one sledgehammer condition is f0 here is strictly convex. Because if f0 is strictly convex then f0 plus lambdai fi, where fi are convex, is also strictly convex. For all lambda, including all lambda0 it doesn’t matter. It’s strictly convex. If it’s strictly convex, it has a unique minimizer. So just to go back to this, you would actually calculate the optimal lambda*. You would then go back and get the minimizer – the minimizer of the Lagrangian with respect over x, call that x*. And that will actually be the optimum of this problem. Okay. So let’s work out the subgradient of the negative dual function. So it’s actually quite cool. Let’s let x* of lambda be this. It’s arg min of the Lagrangian. So it’s just exactly what we were just talking about. And here’s sorta the big sledgehammer assumption is f0 is strictly convex. And by the way, in this case, you might say a lot of times some of these things are silly. They’re sorta things that basically only a theorist would worry about. I mean, somebody should worry about them, but they have no implications in practice. And I’m very sorry to report that this is not one of those. This actually – there are many cases in practice with real methods where this issue comes up. It’s real. It means that methods will or will not converge and you have to take extra effort in things like that. Okay. All right, so we’ll just make this sledgehammer assumption here, the crude assumption, f0 is strictly convex. That means this is strictly convex here. And therefore, it has a unique minimizer. And we’re going to call that minimizer x* of lambda. It’s a function of lambda. Okay? So that’s x* of lambda.

Okay. Otherwise what you do is you do this. And this is really cool. You go over here and you look at fi of x. If fi of x, let’s say, is plus one, it means that your current x is violating constraint i. Okay? It says you’re violating constraint i. That says you’re not being – the price is not high enough. So it says increase the charge on resource one. If – resource i if fi represents a resource usage. So it says pop the price up in that case and alpha tells you how much to pop the price up. In that case, the plus is totally irrelevant, because that was non-negative. You adding something. You bumped the price up and there’s no way this could ever come into play. Okay. So it says – I mean, this is actually – this is the name – I should say the comment, if you make a little comment symbol over here in the code you should write on the thing “price update.” Because that’s exactly what this is, the price update. So what you do then is this. If fi is negative, that’s interesting. That means you’re underutilizing resource i. It’s possible that in the final solution the constraint i is not tight, in which case it doesn’t matter. That’s fine. That’s the right thing to do, but you’re underutilizing it. And what this says is in that case, that’s negative, this says drop the price on that resource. This says drop the price. However, now this comes into play. It says drop the price, but if the new price goes under zero, which messes everything up, because now it encourages you to violate inequality, not satisfy them, then you just make the price zero. And so, for example, if the ith inequality is, in the end, gonna be at the optimal point, this is actually gonna be not tight, then what’s going to happen is that price is gonna go to zero like that. You’re gonna – at the – you’ll be underutilized here. That’ll be negative. That’ll be zero for the last step. This will become negative. The plus part will restore it to zero. So this algorithm, I mean, it’s actually a beautiful algorithm. It goes to the – variation on this go back into the ‘50s and ‘60s, and so – and you find them in economics. So this is – it’s just a price update or, I think, this is one – this would be part of a – a bigger family of things. I guess they call this a TETMA process, or something like that, where – I don’t know. Who’s taken economics and knows the name for these things? Does anyone remember? Come on, someone here took an economics class and saw some price adjustment. Okay, let’s just move on. No problem.

For example, you can do network flow control, you can do all sorts of crazy stuff with this, and it’s quite cool. But that’s later. Okay. So we’ll just do an example just to see how this method works, or that it works, or whatever. Oh, I should mention something here. And you might want to think about when would this algorithm be a good algorithm. When would it look attractive? And let me show you, actually, one case just right now immediately. This is trivial calculation. The only – the actual only work is here. And what this means is you have to do, at each step, the work is actually in minimizing this Lagrangian. So basically, at each step there will be prices, and then you minimize the Lagrangian. That’s going to be the work. Therefore, if at any time you have an efficient method for minimizing this function, you are up and running. Okay? So, for example, I mean, I can name a couple of thing. Someplace you have a control problem and these are quadratic functions. Then if you have your – if this weighted sum is also going to be one of these convex control problems that it means you can apply your LQR or your Riccati recursion or whatever you want to call it to that. If this is image processing and somehow this involves something involving 2D df keys and all sorts of other stuff, the grad student before you has spent an entire dissertation writing a super-fancy, multigrid blah, blah, blah that solves this thing well. If it solves least squares problems there, and if this is a least squares problem, you’re up and running. And you wrap another loop around that where you just update weights and then repeatedly solve this problem. So it’s good to be on the lookout. Any time where you see – where you know a problem where you have an efficient method for solving – actually, just a way of minimizing a weighted sum – this is what you want to do.

And by the way, people who do stuff and actually get stuff done and working. So don’t make fun of them. Don’t ever make fun of those people. So how would a person handle that if you hadn’t taken 364? Well, it’d be 263. You’d look at it and you’d say, “Well, I know how to minimize that.” That’s no problem. That’s that, without the lambda there. That’s p inverse q. Something like that. And then you’d look at it and you’d go, “Yeah, but, I mean, this is a problem.” So here’s how a human being would do it. They’d do this. They calculate p inverse q. That’s x. If that x is inside the unit box, they would say – they’d have the sense to say, “I’m done.” Otherwise, they’d say, “Oh, x7 is like way big. Ouch. That’s no good.” So I will add to this. I will regularize and I will put plus a number, some number, times x7 squared. Everybody cool on that? You’re adding a penalty for making x7 big. Okay? And you’d look at this and be like x12 is also big and you’d add something there. I’m not – remember, don’t make fun of these people. I don’t. You shouldn’t. So then you’d solve it again. And now – except it would be smaller. Now x7 is too small. Now x7 has turned out to be plus/minus – is 0.8. And you go, oh sorry, my weight was too big. So you back off on x7 and now other things are coming up and over. And you adjust these weights until you get tired and you announce, “That’s good enough.” Okay. I mean, listen, don’t laugh. This is exactly how engineering is done. Least squares with weight twiddling. Period. That’s how it’s done. Like if you’re in some field like machine learning or something you think, oh now, how unsophisticated. People in my field are much more sophisticated. This is false. All fields do this. This is the way it really works in those little cubicles down in Santa Clara. This is the way it’s done. You’re doing imaging. You don’t like it. Too smoothed out.

You go back and you tweak a parameter and you do it again. So no shame. All right. So, actually, if you think about what this method is, this is weight twiddling. That’s what this says. It’s weight twiddling. It says pick some regularization weights, because that’s what these are, and then it says update the regularization weights this way in a very organized way. It just – you just update them this way. So this is, in fact, a weight twiddling – an economist would call this a price update algorithm. And maybe an engineer might call it a weight twiddling algorithm. They might even – there’s probably people who invented this and didn’t know it. Anyway, everyone see what I’m saying here? Okay. Let me ask you a couple of questions about it, just for fun, because I’ve noticed that the 364a material has soaked in a bit. If p – not fully. If p is banded, how fast can you do this if p is banded? Let’s say it’s got a bandwidth around k. N is the size of x. How fast can you do that?

Student:[Inaudible].

Instructor (Stephen Boyd):With n squared k? That’s your opening bid? That’s better than n cubed, right. If p is full, that’s n3. That’s a Cholesky factorization and a forward and backward substitution, right? Let’s make p banded. You said n squared k, that was your opening?

Student:[Inaudible].

Instructor (Stephen Boyd):Oh, even better. So nk squared. You’re right. That’s the answer. Okay. So just to make it – I mean, you want me to – let me just make a point here. If this is full, you probably don’t want to do this for more than a couple of thousand. Three thousand, 4,000, you start getting swapping and stuff like that on something like that. You have a bunch of machines, all your friends’ machines, and you run MPI and all that stuff. Whatever. You can go up to 5,000, 10,000, something like that. But things are getting pretty hairy. And they’re getting pretty serious at that point. If this thing is blocked – if p is block-banded or something like that, it’s got a bandwidth of ten, how big do you think I could go? For example, my laptop, and solve that. Remember, the limit would be 1,000. I could do 2,000. It’s growing like the cube, so every time you double it goes up by a factor of ten or eight or whatever. So what’s a rough number? Well, put it this way, we wouldn’t have to worry at all about my laptop about a million. I want to make a point here that knowing all this stuff about structure and recognizing the context of problems puts you in a very, very good position. By the way, where would banded structure come up in a least squares problem? Does it ever come up?

Student:[Inaudible].

Instructor (Stephen Boyd):Structures that are, yeah, that actually banded – what does banded mean? Banded means that x(i) only interacts with x(j) for some bound on i minus j. So if you had a sort of a trust or some mechanical thing that went like this and things never – bars never went too far from one to the other, that would be a perfect example. Let me give you some others. Control, dynamic system. So just control is one, because there, it’s time. And, for example, if you have a linear dynamical system or something like that, the third state at time 12, it interacts, roughly, with states one step before and one after. But then that’s banded. How about this? How about all of signal processing? There’s a small example for you. All of signal processing works that way, more or less. Because there the band structure comes from time. Signal processing means that each x is dependent only on a few – you know, a bounded memory for how much it matters. Now the whole problem is coupled, right? Okay. This is just for fun, but I’m going to use – it’s good to go over this stuff, because, okay, I just use it as an excuse to go over that. Okay. So here’s a problem instance. So here’s a problem instance where I guess we have 50 dimensions and took a step at point one. Oh, I should – I can ask a question here about this. In this case, it turns out g is actually differentiable. So if g is differentiable, that actually justifies theoretically using a fixed step size. Actually, in practice as well, because in a – if you have a differentiable function, if you apply a fixed step size, and the step size is small enough, then you will converge to the true solution. So this goes g of lambda. These are lower bounds on the optimal value, like that. They converge. And this is the upper bound, found by finding a nearby feasible point. And then let me just ask you – I don’t even know because I didn’t look at the codes this morning on how I did this, but why don’t you guess it?

At each step of this algorithm, here, when you calculate this thing – by the way, if this thing is inside the unit box, you quit and you’re done. You’re globally optimal because you’re both primal and dual. End of story. Zero duality gap. Everything’s done. So at each step at least one of these – at least one component of this pops outside the unit box. Please give me some guesses – give me just a uristic for taking an x and producing from it something that’s feasible for this problem.

Student:[Inaudible].

Instructor (Stephen Boyd):Simple? What do you do?

Student:If the [inaudible] is negative one, you make it one. If it’s less than negative one, you make it negative one.

I mean, the kinda things where you need the proof because the algorithms themselves are so ludicrous. Okay. Now here we have to change a few things. fk best is the best objective value we have over all the points that are feasible, and this can actually be plus infinity if you haven’t found any feasible points yet. So fk best is initialized as plus infinity. Okay, so the convergence is basically the same. I won’t go into the details. It just works. I mean, that’s the power of these kinda very slow, very crude methods. In fact, that’s going to come up in our next topic. What are you going to say about a subgradient method is they’re very unsophisticated, they’re very slow, but actually, one of the things you get in return for that is that they are very rugged. In fact, in the next lecture, which we’ll get to very soon, you’ll see exactly how rugged they are. There it kinda makes sense. Anyway, so there it is. That’s a typical result. And I think the proof of this is in the notes. You can look at it. But let’s just do an inequality form LP, so let’s minimize C transpose x subject to Ax less than b. It’s a problem with 20 variables and 200 inequalities. Let’s see, the optimal value for that instance turns out to be minus 3.4. We have one over k step size here. Oh, by the way, when we do the feasible step, you can do a Polyak step size, because when you’re – if you’re doing a step on fj, which is a violated inequality, what you’re interested in is fj equals zero. You are specifically interested in that. So your step size can be the Polyak. And this would be an example of sorta the convergence f minus f*. I guess if f* is minus 3.4, then – well, this is not bad, right? I mean, this is – I don’t know. Let’s find out where 10 percent is. There’s 10 percent.

So took you about 500 steps to get 10 percent or something. So each step here costs what? What’s the cost of the step here, assuming dense, no structural? What’s the cost of a step in solving? Let’s write that down. So we’re gonna – let’s see – we’re gonna solve this problem. We’re gonna minimize C transpose x subject to Ax less than b. What’s the cost of a subgradient method step here? You’re exempted because you can’t see it. Is that true you can’t see it? No, you can’t see it. You can see the constraints. That’s actually the really important part. What’s the cost?

Student:[Inaudible].

Instructor (Stephen Boyd):How did you do it? What’s the method? If you were in matlab, how long would it be? For that matter, it’s just as easy to write it in lapad, but let’s write it in matlab. What would it be? How do you implement this method here? Not this one, but – by the way, of course all the source code for this is online, but there’s a method, right? So what’s the method? Somebody tell me the method. Homework three. You’ll be doing this shortly enough. Well, here’s the lazy way. You just evaluate Ax and you compare to b. If Ax is less than or equal to b, what’s your update on x? It’s x minus equals alpha c, right? Otherwise, if Ax is not less than or equal to b, you sorta, you find – for example, you might as well find the maximum violated one, or – I mean, it doesn’t matter. That’s the point. You can take any violated one. So but if you evaluate all of them – and that’s just from laziness – you evaluate all of them, then what’s the cost, actually, of evaluating this? There’s your cost right there. It’s just multiplying Ax. What’s that?

Student:[Inaudible].

Instructor (Stephen Boyd):Are you saying mn? Thank you, good. Okay, mn. This is – it’s irritating – I know – the thing is, you should know these things. This should not be abstract parts from three days of 364a. You should just know these things. You should know what the numbers are on modern processors and things like that. Just for fun everyone should – after a while, then, we quit and then you go back and it’s just Ax and stuff like that. So the cost is mn per step. So what that says – whereas, how about – what’s the cost on an interior point method on this guy? What’s an interior point step? In fact, what’s the interior point method complexity period, end of story, on this guy? Just minimize C transpose Ax less than b. At each step you have to solve something that looks like Ad, A transpose, something or other, right? And that’s going to be n cubed – I mean, unless that’s [inaudible], that’s n cubed. But forming Ad A transpose, that’s the joke on you. That’s m(n)2. m is bigger than n. That’s the dominant term. So it’s m(n)2. How many interior point steps does it take to solve this problem?

Student:[Inaudible].

Instructor (Stephen Boyd):Thank you. Twenty. So it’s 20. So what’s the overall complexity of solving this problem? m n squared. And remember what that is that follows my mnemonic. It’s the big dimension times little dimension squared.

So it went to school, actually. It has lots of applications. So okay, so let me define a noisy unbiased subgradient. So here’s what it is. So I have a fixed – this is a deterministic point, x, and then I have a noisy unbiased subgradient for f at x is this. It is a vector, a random vector g tilde that satisfies this, that its, on average, its expected value is a subgradient. Now by the way, this means, of course, that for any particular g tilde, this inequality if false, I mean, obviously, need not hold, right? However, on average – so basically think of your f dot get subgrad as being ran – it’s not deterministic. When you call it, it gives you different g’s. If you call it a zillion times on average that would give you something close to the mean. That’s close to a subgradient. Everybody got it? So we’ll see lots of practical examples where you get things like that. Okay. Another way to say it is this, is that what comes back is a true subgradient plus a noise, which is zero mean. That’s a stochastic subgradient. Now this error here, it can represent all sort of things. It can be – first of all, it can just be computation error that basically when you calculate subgradient you’re sloppy or you do it in pix point or I don’t know. Anything like that. But it can also be measurement noise.

We’re going to see it’s going to be Monte Carlo sampling error, so if, in fact, the function itself is an expected value of something, and you estimate an expected value by Monte Carlo, then you’re right, it’s unbiased – I mean, it’s an unbiased estimated. You write it down as – well, it’s an unexpected then the average is the right. But you get – it’s unbiased, and then v is actually the difference between – it’s a random variable and it’s the difference between the what you actually get is your Monte Carlo sampling error. Okay. Now if x is also random, then you say the g tilde is a noisy unbiased subgradient if the following is true: for all z this holds almost surely that this is the conditional expectation of g tilde. That’s the noisy subgradient condition on x. Now that’s a random variable, so this right-hand side is a random variable. That’s not a random variable. And it’s also a random variable because x is a random variable. So the whole thing on the right is a random variable. And if this inequality holds almost surely, then you call it a noisy unbiased subgradient. So that’s what it is.

Okay. And that’s the same as saying the following: it says that the conditional expectation of g tilde given x is a subgradient of f at x almost surely. So that’s what it means. For the conditional one, if x is not random, it’s like that, I can – I don’t need the condition on x and I can erase that. So, let’s see, I don’t know what this means. Anyway. This is a random vector. That’s a random vector and the idea is – and that’s actually a random set. And so it basically says that that inequality holds almost surely. So okay. Now here’s a stochastic subgradient method. Ready? Here. In other words, it’s the subgradient method. So it says you got a noisy subgradient, I’ll just use it. I’ll just use it. Nothing else. You basically update like that and that’s it. Now I want to point something out. You get a – this is now a stochastic process, because even if x(0), if your initial x was deterministic, then g(0) is already a random variable, and therefore, x(1) is a random variable, the first update, because it depends on g(0). And so this is now a stochastic process, this thing. Okay. So we now have the stochastic process, which is the stochastic subgradient – the trajectory of the stochastic subgradient method. And here you just have any noisy unbiased subgradient. And then we’ll take the step size, the same as always, and then f(k) best is going to be the min of these things. That’s a – by the way, that’s a random variable there. Because that’s now a stochastic process here. So that’s a stochastic process and that’s a random variable. It is f(k) best. Okay. So here’s some assumptions. The first is we’ll assume that the problem is bounded below. These are much stronger than you need, but that’s good enough. We’ll make this global Lipschitz condition here. More sophisticated methods you can relax these, but that’s okay. And we’ll take the expected value of x(1) minus x* – x(1), by the way, could be just a fixed number, in which case you don’t even need this expected value.

Now, how – will the convergence be fast? No. It can’t be. I mean, it can hardly be fast if someone’s only giving you a subgradient, which is kind of a crappy direction anyway for where to go. But now if they give you a subgradient where the negative 20 decibels signal noise ratio, in other words with – basically it says that you can’t even trust the subgradient within a factor of ten. You’d have to call – you’d actually ask for a subgradient direction like ten or 100 – you’d call it 100 times and average the answers. And that’s the only time you could start getting some moderately sensible direction to go in. Everybody see what I’m saying here? The whole thing’s quite ridiculous. And the summary is, it just works. These are kinda cool. What? It’s also – this is known and used in a lot of different things, signal processing and all sorts of other areas. Actually, there’s a big resurgence of interest in this right now in what people call online algorithm that’s being done by people in CS and machine learning and stuff like that. So let’s look at the convergence groove. It’s tricky. You won’t get the subtleties here. But you can look at the notes, too. It’s not simple. I don’t know. I got very deeply confused and you have to go over it very carefully. That it’s subtle you won’t get from this, but let’s look at it. It goes like this. You’re going to look at the conditional expectation of the distance to an optimal point given x(k) – the next distance here. Now this thing is nothing but that, so we just plug that in. And we do the same thing we did with the subgradient method. We split it out and take this minus this. That’s one term. And we get this term. Now that, and this is conditioned on x(k). So x(k) conditioned on x(k) is x(k). So this loses the conditional expectation condition on x(k). That’s a random variable, of course. But you lose the conditional expectation. It’s the same.

Then you get this. You get two alpha times the conditional expectation of now it’s the cross-product of this minus this and that term. And that’s this thing conditioned on x(k). And the last term is you get alpha squared times the conditional expectation of the subgradient squared given x(k). And we’re just going to leave that term and leave it alone. Now this term in here, we’re going to break up into two things. We’re going to write it this way. It’s the – I can take here the x* is a constant and so condition on x(k), that just x*. And then this term, g tilde k transposed x*, that’s linear in this, so conditional expectation commutes with linear operators, so that comes around and you get this thing. Now that – this thing here, the definition of being a subgradient noisy stochastic subgradient, or a stochastic subgradient, if you like, is that this thing here should be, I guess it’s bigger than or equal to whatever the correct inequality is this, to make this thing true. So that’s how that works. And so you end up with this. Now if you go back and look at the proof of the gradient method, subgradient method, it looks the same, except there’s not the conditional expectations around. And there’s a few extra lines in here because of the conditional expectation. So let’s look at this. And this inequality here is going to hold almost surely. Everything here is a random variable. That’s a random variable. This entire thing over here is a random variable. So this inequality holds almost sure this thing is less than that. And now what you can do is the following: we can actually telescope – I mean, we can actually now telescope stuff, the same as before. If we take – I should say, if we take expectation of this now, then the expectation of this is just the same as the expected value of that. That’s a number, and that’s less than the expected value of that minus, then, the expected value of that.

Expected value of that, that just drops the conditional part there. And so here’s what you get. You end up, if you take expectation of left- and right-hand sides of the inequality above, which was an inequality that holds almost surely, you get this. The expected distance to the optimal point in the next step is less than the expected – the current distance to the next point minus two alpha k times the expected value of your suboptimality here, plus alpha squared times the expected value of the subgradient squared. This thing will replace with just the number g squared here, and we’ll apply this recursively and you end up with this, that the expected value of x(k) plus 1 squared minus x* is less than the expected value of x(1) minus x squared. This is going to be less than r squared here, this thing. That’s our assumption. And then again we get the good guy and the bad guy. That’s bad. This is good because this is – I mean, this thing is always bigger than that by definition. So whatever this is here, it’s a number. Whatever this number is, it’s non-negative here. I guess it’s a positive, or something like that. That’s a positive number. That’s a negative sign here, so this is on our side. It’s actually making this distance smaller. And the nice part about that is that goes in alpha, this goes in alpha squared, and so the bad guy, at least for small enough alpha, is gonna lose. And then you just simply turn this around and you get the following: you get the minimum of i equals 1 to k of the expected value of your suboptimality, which is less than or equal to r squared plus g squared and norm alpha squared. It’s the same thing as before. So it’s exactly what we had before. Okay. Except now it’s the minimum of the expected value of the suboptimality. That’s actually a random variable here. So that’s the difference.

Okay. Now this tells us immediately that the expected value of – that this sequence actually converges the min of these expected values converges to f*. That’s what it tells us. Actually, I believe that the fact is, you don’t even need the min here. This converges. That’s a stronger result, but we don’t need it. We’ll get to that. Now we’re going to apply Jensen’s inequality. So Jensen’s inequality says the following: this is – I’m gonna commute expectation and min. Min is a concave function, so what that says is the following: Here I have the expectation of the min and here I have the min of the expectation. And I guess the inequality goes this way. But this thing here is a random variable and it’s a random variable f(k) best. And so expected value of f(k) best is less than or equal to this. This thing goes to zero. So we’re done. By the way, I never remember which way Jensen’s inequality goes. So I’m not ashamed to admit it. So I usually have to go to a quiet place or draw a little picture with this and some lines. Because every time you do it it’s something different. It’s either a concave or whatever. It’s important which way it goes, so I just go somewhere quiet and I draw a picture and then come back and see. I’m trusting myself that it’s correct. I think it is.

Okay. And here’s what happens. If you have – here’s the noise-free case. So you get this. And so, indeed, so I should say what this means is you’re getting the subgradient at about how many bits of accuracy are you getting in your subgradient? When you call subgradient, how many bits of accuracy are you getting here? I mean, if your noise is on the order of a quarter, I just want a rough number. How many bits of accuracy are we talking here? Is this 20? Would you – it’s not – you have the same signal noise ratio, one to four – no, four to one. Roughly what – how many bits? It’s not a hard – it’s not a complicated question, so people are probably way over-computing or something like that.

Student:[Inaudible].

Instructor (Stephen Boyd):It’s two, roughly. What? You said two and a half. You believe two?

Student:[Inaudible].

Instructor (Stephen Boyd):You think it’s 12? It’s two, right? Basically means if I tell you a component – if I tell you the subgradient is this, you can be off by as much as 25 percent. I mean, this is just all hand waving. It roughly means it’s about two bits of accuracy. It’s not a whole lot of accuracy, right? So that’s the point. These are really quite crude. You can see what happens is actually interesting. There’s one realization and here’s another. Actually, in this one we’re really lucky. The errors in the subgradient were such that they didn’t mess us up very much, and that’s another one where it did, although it can’t be stopped. These are big numbers here. That’s about – that tends to – the interesting thing is to get the 10 percent accuracy you probably multiplied your – the number of steps by four or something like that. The really cool part is that you would get – what would happen if the signal to noise ratio were inverted so what if the signal were four times as big as the – suppose when you got subgradient we took this to be, I guess, whatever the .5 – suppose the signal noise ratio were reversed and the signal to noise ratio was .25, not four, roughly? What do you think would happen here? Well, first of all we know the theory says it’s going to work. But think about how ridiculous that is, basically. It – you calculate the worst g, that says go in that direction. And to that vector you add a galcion, which is four times bigger. Which means, basically, it’s all – it would be very difficult to distinguish between your get subgrad method and this completely random, like, “Which way should I go?” And you’re like, “Oh, that way.” You know? And it’s like, “Really? Can you verify that?” And you go, “That way.” It’s just totally random. All you’d have to do that like 1,000 times and average them to even see lightly that there’s some signal there. Everybody see what I’m saying? What would happen, of course, is that would now mess it up much more.

These would be that. That’s what would happen. So this shows you what happens is 100 – you do 100 realizations, you generate 100 stochastic processes, which is the stochastic subgradient method running forward in time. And this shows you the average here. And this shows you the standard deviation. That’s a long scale, so that’s why these things look weirdly asymmetric. So on a linear scale this is plus/minus one standard deviation. This is also plus/minus one standard deviation, but it’s on a long scale. But that’s what it is. By the way, it’s actually kind of interesting, these points down here correspond to cases where the noises, the noise was kinda bad. Sorry, the noise accidentally pointed you in the right direction, and as a result, you did actually quite well. And of course, these are cases where the noise kinda was hurting you as much as possible. Makes sense? So I guess the summary of this is that the subgradient method you can make fun of it, it’s very slow, and all that kinda stuff, but the wild part is actually any zero mean noise added to it does hurt it. And we’re not talking noise in the fifth digit. We’re talking, if you like, noise in the minus fifth digit, if you want. So you can actually, I mean, which is quite ridiculous if you think about it. Don’t try a Newton method when the – when you’re calculating your gradients or your Hessians with 25 percent noise.

For that matter, don’t try it if your signal to noise ratio is one to four, so it’s off the other way around. Okay. So here’s a – these are empirical distributions of your suboptimality at 250, 1,000, and 5,000 steps here. And they look like this, and you can actually see these would be the ones at the top of that plot, those arrow bars. And then these would be the ones at the bottom. But you can sort of see that the distribution is very slowly going down like that. So that’s the picture. Let me ask one question about this problem. How would you deal with this in a 364a context? Suppose I told you, you need to minimize a piece-wise linear function, but unfortunately, the only method – the source code I won’t let you look at. The only thing that calculates this thing only does it to two bits of accuracy. Or another way to say it is every time you call it, you’re going to get a subgradient plus a noise, which is as big as a quarter the size of the actual subgradient. How would you deal with that in 364a? I mean, we didn’t really have a method to deal with this, but now just tell me what would you do?

Student:Call that function 10,000 times.

Instructor (Stephen Boyd):Right. Right. So 10,000. And, good. Ten thousand was a good choice of number, by the way. So you call the number 10,000 times, and then you’d average those subgradients and what would be the, roughly, how much error is in the one that’s 10,000 times? I was hoping for you to say I was going to go down by square root of 10,000. That’s why I was complimenting your choice of 10,000, because it had a nice square root of 100. So instead of being an error being 25 percent, it would be .25 percent. So that might be enough to actually run a gradient method. It probably would work okay. So what would happen is at the end game it would kinda start being erratic or something, but you’d get a pretty good answer pretty quickly. By the way, if you evaluate it 10,000 times, I should point something out, this is beating you. So it’s not clear – anyway, you’re right, that’s what you’d do. Okay. So this is actually maybe a good time to talk about stochastic programming. Actually, at some point I want to make a whole lecture on this, because it’s quite cool. Everybody should know about it. And it’s this. In stochastic programming, you’re going to explicitly take into account some uncertainty in the objective in the constraints. So that’s stochastic – there’s something called robust programming, where you have uncertainty, but you problem it in a different way and you look for worst case type things. But for stochastic that’s a very common, very old method, something like that. I should mention, it’s kinda obvious that this comes up in practice all the time. So anytime anybody’s solving an optimization problem, you just point to any data as – oh, by the way, I should mention this. If you were not at the problem session yesterday, you should find someone who was and ask them what I said. I don’t remember what I said, but some of it is probably useful.

So you take any problem, like a linear program, and what you do is you then ask the person solving the linear program, you point to a coefficient, not a zero, because zeros are often really zero. Also one’s, those are also not good choices, because a one is often really one. But you point to any other coefficient that’s not zero or one and you ask them what is that coefficient. And that coefficient has a provenience. It traces back to various things. If it’s a robot it traces back to a length and a motor constant and a moment of inertia, or whatever. If it’s a finance problem, it traces back to a mean return, a correlation between two asset returns or – who knows what? If it’s signal processing, it goes back to a noise or a signal statistics parameter, for example. Everybody see what I’m saying? And then you look at them and you say, “Do you really know that thing to a significant figures?” And now if they’re honest they’ll say, “Of course not.” And the truth is they really only know it to, it depend on the application, but it could be three significant figures. In a really lucky case it could be four. It could be two, could be one. And, actually, if you get some economist after a couple glasses of wine, they’ll look up and say, “We get the sine right, we’re happy.” So anyway, but until then, they won’t admit that.

So okay. So the point of all this is it’s kinda obvious that if you’re solving a problem, the data, if you point to a data value and – it has a provenience and it traces back to things that probably you don’t know better than like a percent. I mean, it depends on the application, but let’s just say a percent. By the way, if you don’t know any of the data, if you barely know the sine of the data, my recommendation with respect to optimization – well, my comment is real simple. It’s why bother? So if it’s really true that you don’t know anything about the model, then you might as well just do it by intuition and do your investments or your whatever you want. Just do it by intuition and guess. Because if you don’t know anything, using smart methods is not going to really help. So typically you’ll know one significant figure, maybe two, maybe three or something like that. And then, by the way, all the stuff from this whole year now starts paying off a lot. And there are weird sick cases where you know things through high accuracy. I mean, GPS is one, for example, where you point to some number and they go, “You really know that to 14 decimal places?” And they’re like, “Yes.” I mean, I just find it weird. But anyway, normal stuff is accurate between one – zero is like why bother for this. One, two, three, five, six, I guess in some signal processing things, you can talk about 15 bits or something like that, 20. But rarely more than that. Okay. So there’s a lot of ways of dealing with uncertainty. The main one is to do a posterior analysis. That’s very common. Let me tell you – people know what a posterior analysis is? So posterior analysis goes like this: you’re making – it doesn’t really matter – let’s make a decode – you’re making a control system for a robot, I don’t care, something like that.

So you sit around and you work out a control system and when you work out the control system, you can trace your data back and there is a stiffness in there and there’s a length of the link to and there’s an angle and there’s all this stuff and there’s a motor constant. There’s all sorts of junk in there. And they have nut values. And you have a robot controller and you get some controller and now you, before you implement it on the robot – that’d be the simplest way – the first thing you do is something called a posterior analysis. Posterior analysis goes like this: you take the controller or the optimization variable, whatever it is, design on the basis of one particular value, like a nominal value of all those parameters. You take that and you resimulate it with those values, multiple instances of those values, generated according to plausible distributions. Everybody see what I’m saying? And by the way, if you don’t do this, then it’s called stupid, actually. This is just absolutely standard engineering practice. Unfortunately, it’s done in I’d say about 15 percent of cases. Don’t ask me why. So in other words, you design a robot controller, you optimize a portfolio, anything you do, you do machine learning – actually, in statistics this is absolutely ingrained in people from when they’re small children in statistics. You do this. It’s the validation set or something like that. So here’s what you do. You design that controller on the basis of a length and a motor constant, which is this – motor constant depends on temperature and all sorts of other crap. You ask somebody who knows about motors and you say, “How well do you know that motor constant?” And they’d say, “I don’t know, plus/minus 5 percent, something like that.”

You go to someone in finance and you say, “You really believe that these two assets are 57.366 percent correlated?” And they’d go, “No, but it’s between 20 and 30 percent correlation, maybe.” And you go, “Thank you.” Then what you do is you take that portfolio allocation and you simulate the risk in return with lots of new data, which are randomly and plausibly chosen. Everybody see what I’m saying? You change the motor constant plus 5 percent, minus 5 percent. Moment of inertia, change it. The load you’re picking up, you don’t know that within more than a few percent. You vary those. And then you simply simulate. And you see what happens. If you get nice height curves, so in other words, that means that your design is relatively insensitive, everything’s cool, and now you download it to the actual real time controller. Or you drop it over to the real trading engine, or whatever you want to do. So that’s how that works. So that’s a standard method. That’s posterior analysis. Stochastic optimization is actually going to be dealing with the uncertainty directly and explicitly, and I guess we’ll continue this next time. Let me repeat for those who came in late, plea, grovel, and I’m not sure what the word is. At 12:50 to 2:05 today here we’re having the world’s first, I believe – I haven’t been notified from SCPG that’s it’s not true, the world’s first tape-behind. We’ll have a dramatic reenactment of lecture one. So come if you can.

[End of Audio]

Duration: 79 minutes