ConvexOptimizationII-Lecture05

Instructor (Stephen Boyd):All right. We’ll get started. I guess we’ll announce, probably by Thursday, when the actual sorta preliminary project proposal is due. You should be thinking about projects, and you probably shouldn’t have completed it by now. But you should certainly have thought about it a fair amount and tossed around some ideas. And you should probably be in communication with us if you’re not, just to get a little bit of feedback on what you’re doing. And we’ll probably make this preliminary project proposal due maybe early next week or something like that. And before we make it due, we will put something on the web, or we’ll announce something that’s going to be very specific as to the format, because we – it should be two pages and I think we might just even iterate on that this week. We’ll just reject it until it’s clear enough for us. Clear and simple. So that’s what we’re going to do. Okay. So let me continue on the material on stochastic programming. So this is going to be one of the examples for stochastic subgradient. So stochastic programming is a – by the way, it’s a huge field. You can teach an entire course on it. There are multiple books on this one topic. And a stochastic programming problem looks like this. You have an objective and constraints that depend on some random variable. And in fact, this can – any optimization problem, any one that actually is for some application, is going to look like this to some extent, no matter what. Period. In other words, that any number in the data in your problem will trace back to something, which is not going to be known that well. So in stochastic programming, you assume a probability distribution on these unknown parameters, and you minimize an expected value. And constraints are on expected values. Now expected value might not be what you care about. You might care about things like the variance and things like that.

Actually that’s gonna be also handled in the same way. You’ll see how in just a minute. So the exact same formulation. So don’t think it just means if one of these is – if that’s a cost, that’s expected cost. You actually care about the variance of the cost. We’ll see immediately that this can be incorporated into a framework like this. Okay. Now an expectation, of course, is something like an integral. And obviously, an integral of convex functions is convex. So this preserves convexity. So these are convex – that’s a convex function. In fact, I should say something a little bit more. When you integrate like this, generally speaking, these expected values – it depends on the distribution on omega, of course, but if that distribution is sorta continuous, or for that matter, has a distribution that’s analytic or something like this, these things actually inherit. They become much smoother. So even if f0 or fi is piecewise linear, for example, or non-differentiable, these are often quite smooth. So that’s not relevant. Doesn’t mean you can use Newton’s method, because, well, you can’t calculate any of these numbers, except in very special cases. Okay. Now one sorta very obvious thing to do is to ignore the uncertainty. And you ignore the uncertainty by simply taking for this parameter value its expected value. And that gives you this problem here. That’s the – some people call this the certainty equivalent problem. And I’ve heard, although I don’t quite understand it, other people refer to this as the mean field problem. Don’t ask me where that comes from or something like that. It has something to do with statistical mechanics or machine learning and stuff like that. But anyway, so this is the certainty equivalent problem. And what it means is you simply replace this parameter with this expected value and you solve this problem. Now by Jensen’s inequality, this is actually favorable compared to that. So when you solve this problem, you’ll actually get a lower bound on the optimal value of this one. So that’s actually – in fact, that’s the way I remember Jensen’s inequality. Jensen’s inequality says for a convex problem, variation messes you up. Which makes perfect sense. If something curves upward, and you vary around something, when it goes down and it can go up, but the positive curvature says that, on average, you get messed up, which is to say the average cost goes up.

And by the way, when you’re solving – if you do anything involving stochastic programming, you should solve this immediately, and then this gives you a lower bound. So now it’s not, of course, the solution to this when you calculate the x for this. Although there’s a quick way to see how much you’ve been messed up. I should mention something. This is what’s normally done, although not explicitly. So basically when you solve an optimization problem, when you know some parameter’s varying and you just ignore the variation, you’re just doing this. What you can always do at that point is take the x you find here – this problem, the optimal value gives you the optimal value, gives you a lower bound on the optimal value here. You then take that x and estimate these numbers by Monte Carlo. And if you find that with the x found from here, when you estimate these numbers by Monte Carlo, this number is not much larger than that. And these numbers are not much larger than that. Then you can actually safely presume that you have solved this problem within some reasonable accuracy. By the way, people do that typically informally that’s called a posterior analysis. That’s a standard method in engineering for dealing with uncertainty is step one, ignore it. Step two, pretend the uncertainty isn’t there, then step two is take the design obtained – assuming that all the numbers, the masses, moments of inertia, Young’s moduli, all these things, they’re all accurate out to full doubles, 12 significant figures – you take that design, and then you subject it to a posterior analysis, which is when you estimate these. Now if this is ten times as big as that, then you back off and you go, “I had a very nice design, but it’s not robust.” And it means you’re going to have to deal with this somehow head-on. Okay. Now there are lots of variations on this. You can actually put in between the expectation and the function any increasing convex function you like, non-decreasing convex function. That’s obvious because if you take a non-decreasing convex function of a convex function, that’s convex. And this allows you, actually, to shape all sorts of things that you might like. So for example, in a constraint, a very common one is to take the expected value of the plus part, non-negative part of this, and the non-negative part of a constraint is a violation. But it’s the amount of violation. So this is the expected violation.

By the way, if fi of x omega is less than or equal to some – if that constraint represents something like a resource usage, it means this is the expected over-utilization. Something like that. It can be all sorts of things. If it is a timing constraint, saying something has to be done by a certain amount, this is called the expected tardiness. If your basic inequality said this job has to be finished by this time, that’s the expected tardiness. That works because the plus function is non-decreasing and convex and so on. Now another one would be this, you could take the expected value of the maximum of all the constraints, and this is the expected worst violation. That’ll work, as well. So these are common measures here. And I should mention a few others here. But let me – I’m going to mention a few others in just a minute here, which actually should be on this slide. I’ll add them. But the one that’s most obvious is this, is a chance constraint. These are called chance constraints. And it looks like this. It’s the probability that you satisfy the ith constraint should be bigger than eta. And you should think of eta as, you know, 95 percent, two nines, three nines, this kind of thing. I mean, that’s how you’d imagine this. So basically it would say something like this: optimize this thing for me, but what I want to do is I want the probability that my link goes down or has inadequate signal to noise ratio should be – I want the probability that I have inadequate signal to noise ratio to be something like 99 percent or something. At least 99 percent. And that would be something like, by the way, in lots of contexts this would be called – more specific contexts – this would be called something like an availability constraint. If, for example, this represented a communication link having adequate signal to noise ratio or something like that, this would be an availability constraint. You’re saying that link has to be available at least 99 percent of the time. Or it’s an average constraint. Just flip it around. Here’s the bad news. Bad news is that these are very nice constraints. You can write whole books on them. But generally speaking, they’re intractable. And they’re not convex. However, you know one case where this is. I don’t know if anybody remembers it?

Student:[Inaudible].

Instructor (Stephen Boyd):What’s that?

Student:[Inaudible].

Instructor (Stephen Boyd):Precisely. So when f is affine, it’s actually biaffine, and it is affine in x and in omega. In other words, it’s a linear constraint with uncertain coefficients. That’s the most basic one. Then this constraint right here is a second-order cone constraint. And that’s actually when you think of a second-order cone constraint, you should almost think if – I mean, when you think of second-order cone programming, you can almost think of it as chance constrained linear programming. That’s what it is. So that’s one case, but in almost all cases, these are not convex. Now turns out there’s all sorts of good uristics for doing this, anyway. Let’s talk about a few more things up here. There’s actually one more I should add here. How about adding an almost sure constraint? How would you do that? Let me write down what that is. An almost sure constraint – let’s write down this one. Suppose what I want is fi. Can I handle that and is it convex? That’s the question. Of course, obviously, most cases it’s going to be infeasible. So it’s gonna have, you know, if your distributions have infinite or galceon have – so infinitely bad things can happen with small probability, constraints like this will just make the problem infeasible instantly.

But in some cases, this would make sense. And the question is, can you handle it? Can you use this mechanism, and if so, how? How do you do that? Can you write that as the expected value of something? What do you think? What would do this? You could write it this way. It’s just the expected value of the indicator function. So this is, in fact, less than or equal to zero. And the indicator function here, this function is the following function: it’s zero, if you’re negative. And if you’re positive, it’s plus infinity. That’s increasing in convex. Okay? Everyone agree? I mean, it’s rather extreme. It goes from zero to plus infinity, but the point is that’s increasing in convex. This function is, in fact, if you apply this to f, you can pose this function with f, all that happens is you redefine f to be plus infinity if you violate the constraint. If you say the expected value of that is less than zero, it’s pretty much identical to saying that this is true almost surely. Okay? So you can handle almost sure constraints. And I should mention one more that’s kinda widely used. Let me see if I can get it right. I think I can get it right. I might – I shouldn’t do this because I’m going off script, or whatever, but that’s okay. Here it is. A very common version of this is the following: you take x of something like that, so that’s clearly increasing – gamma’s a parameter, a shape parameter – x, of course, goes up very steeply, and so this gives you an increasing function. And then you simply put an expected value here. Okay? Now, in fact, you can also do one more thing. That’s convex, but then so is that. And I’ll just leave it this way. And we’ll see why in a minute. So if you do this, and let’s just look at the objective for this. So if you look at this function here, first of all, let me go over a couple of things.

I’ll just say this briefly because, in fact, it’s likely I’m going to make a terrible mistake and regret that I’ve gone off on this tangent. But anyway, let me try it. Everyone knows that log sum x is convex and non-decreasing. This is like a generalization of log sum x, because it’s log of expected value of x. So this is, in fact, a convex function here of x. Now the question is – oh, I’m sorry, you put a gamma in front. That’s just for the normalization. Minimize – actually, you don’t need the log, you don’t need the gamma, because that’s convex right there. This just makes it prettier. Now it turns out if you work out what this is, if you do an expansion on what this is, the first term is actually gonna be something like the expected value of f0. That’s as if you just minimized expected value of f0. The next term in the expansion of this – I guess this is the moment – not the moment – which one?

Student:[Inaudible].

Instructor (Stephen Boyd):Yeah, it’s the – there’s the moments and then there’s the central moments, or whatever, the centralized ones. This is whatever –

Student:Cumulate?

Instructor (Stephen Boyd):It is gonna be something – let’s call it the moment generating function, except it’s not the moment generating function. Maybe it is the cumulate generating function. Good. The first term in this is expected value of f0. The second term is the variance, and that scales with gamma. So when you see this, this is just a general measure for risk aversion. It’s a general construction for handling risk aversion. So let me explain that. Make a note. We’re gonna add this. By the way, although what I’m gonna say now is gonna be wrong, possibly wrong and all that, I don’t guarantee any of it, by time we stick it onto the slides here, it will be correct or closer, anyway. So let me give an example of this. If f0 is what you want to minimize, for example, it’s negative revenue in some finance application, so it’s a negative revenue, then you’d say, well, if you just do nothing here, if you just do the basic thing like this, you’re asking please maximize my revenue. This is just – if you just take the expected value it says maximize my expected revenue. But then the variance is irrelevant, so you might actually find some point which maximizes the expected revenue, but as huge variance. And you might care about that. And you might say, well, no but I want to trade off the expected revenue and the variance, as well. You would do something like that and then gamma is called the risk aversion parameter. And as you tune gamma, you would trade off expected value versus variance in the first order. So okay, that’s just – like I said, you can make a whole – we could easily make an entire lecture on it. You could make a course on it. And there’s multiple books on it. Okay. So let’s look at how this works with stochastic subgradient. Suppose you have a function that’s convex in x for each omega. And I have a subgradient that’s g of x in omega. That’s a selection of a particular subgradient, subgradient with respect to x of the function f. Little f is going to be the expected value of this thing.

You average out over the omegas. And that’s gonna be convex. Then a subgradient of f is nothing but this. A subgradient is the expected value of that gradient selection function over there. And now comes the interesting part. Suppose you generate M samples from the distribution and you evaluate capital F of x omega i. Now if you average capital F of x omega i, if you just average that, that’s an unbiased estimate of little f of x. And in fact, you’re gonna do this anyway, because you’re gonna want to estimate this number. So this is something you’re gonna have to do whether you like it or not, because when you say here’s my number, you wanna test what’s happening, you wanna test your yield or whatever your value is, you’re gonna actually do this anyway with capital F here. But it says that if at each point you’ve calculated subgradient and then averaged the subgradient, then it turns out this thing is actually a noisy unbiased subgradient. Okay. So by the way, this fits my meta-assertion that if you can evaluate a function, you can evaluate a subgradient. And by the way, this is an example where technically you cannot evaluate the function, because basically, you can’t evaluate this exactly except in a handful of silly cases.

I mean, they come up and they’re used, right? You know all the things you can – I mean, basically how many integrals can you calculate exactly? The answer is 15 or 17 or whatever the number is. But the point is, in general, you can’t do it and you’re gonna have just by Monte Carlo. So here’s a case where what you can do is by Monte Carlo you can compute this number approximately, and as accurately as you like, and at the same time, with very little effort, other than calling f dot subgrad here or something to get subgrad here, you can actually also calculate an unbiased subgradient. Okay. So let’s do an example and see how this works. We’re gonna minimize the expected value of a piecewise linear function. Now what’s interesting here is that the coefficients ai and bi are going to be random. So a piecewise linear function. I want to minimize it, but the expected value – but ai and bi are random and you’re gonna commit to an x. If you wanna visualize what goes on here. You commit to an x and then you’ll pull random samples of the ai and bi. You’ll evaluate this function and then you’ll take the expected value of that. And that’s gonna be your value. I mean, if the ai and bi are over the map, it’s not that interesting, this thing. If the variation in the ai and bi is in the eighth significant figures, it’s also probably not that interesting. It’s interesting somewhere where they’re in the 10 percent range, 5 percent range, 20 percent range, something like that. That’s where the value of being smart – there’s actually a positive value to being smart. If the variation is just all over the map, there’s nothing you can do except, I don’t know, take x equals zero or something like that. If the variation’s tiny, you just ignore the uncertainty altogether. Somewhere in the middle you can actually do a good job.

Okay. So what we’ll do is I’ll explain this. We’ll do a couple things. We’re going to compare this certainty equivalent. In certainty equivalent, what you do is you simply minimize one piecewise linear function. So that you do by linear programming. You just plug in the expected value of a, expected value of b. This will give you a lower bound on the optimal value here. And there’s a lot of uristics you could do. But here’s one. One is this. Let’s see – we’ll plug in the expected value of b and what’s interesting here is when you take the expected value, let’s ask about variation in a. Variation in a gets multiplied by x here. By the way, this one – oh no, this does matter, because the i’s have different bounds. Variation in a gets multiplied by x. So here you might, basically you want smaller x is gonna be less sensitive to variation in a. So you just add this. It’s a uristic. There’s lot of other uristics, by the way. Okay. So here’s a problem instance. And let’s see how these things look. So the ai, those are the row vectors, and then I guess they have that there – how wide are they? They’re 20 wide. I mean, each ai is – there are 20 variable.

I guess each entry has a standard deviation of square root of five. And b has also very substantial – sorry, let me see if I’ve got this right. That’s right. Then ai bar – so these are – actually, I think these are ai bar here, I’m guessing, although it’s probably about the same either way. Okay. So here is the certainty equivalent just ignores everything and you get, let’s see, you get – where am I? Here we go. This is the certainty equivalent problem. And then this shows you the distribution of values when you just do Monte Carlo here. So this is Monte Carlo here. And you can see that there’s all sorts of values here – these are very unfortunate things when you committed to an x and then the a and the b were wrong for it. This is the uristic. And you can see that this is the distribution here. And you can see that the uristic is already doing a lot better. I guess this is the mean of this. So it’s an estimate, well, modulo 100 Monte Carlo samples. That’s an estimate of the expected value of the objective value here. The uristic is done well, because it’s penalized large x and so you’ve moved away from that or whatever, and you get a tighter distribution here. And then this is f stochastic and this is computed using the stochastic subgradient, something like that. So this is presumed to be something close to the global optimum. And these dashed lines I am gonna guess the following: they’re a mystery. We can guess that that’s the optimal value. If that is an optimal value of this, then that is a lower bound on this problem. The stochastic subgradient gives you something there, and so if I’m on the stand and I’m being deposed or there’s lawyers present, I could only say something like this. This, again you have to believe in Monte Carlo, but modular Monte Carlo, that’s an upper bound on the optimal value and this one here is the lower bound over here. It’s a mystery as to what these are. I’ll have to find out what they are.

Okay. By the way, if you’re doing stochastic optimization, this is how you judge one x or another, is you – you’re gonna want to run histograms anyway, and so you’re gonna want to look at things like this. Okay. This is 1.34. There we go. All right. So let’s see, if you run the stochastic subgradient for a very long time, this is sorta what happens. You get, I guess the theory says it was gonna converge to the optimum always. It’s just a question of how long it’s gonna take. And you can see here in this case you’re taking 100 and you’re taking 1,000 Monte Carlo samples to produce the expected subgradient. Actually, in those cases, what’s kinda – you can almost argue in these cases that you’re actually computing the gradient of the function almost exactly. So these are the gradients. And in fact, this is what sorta the gradient stuff does. It goes like that. And it’s gonna keep going down. And I guess it’s estimated to be something like 1.34. And this is how close you get to the optimal value. So that’s that. By the way, this is not a – depends on here the effort in each of these is running the Monte Carlos. So in fact, these things, the green one here, for example, is 1,000 times more costly than the other one. So if you actually write these in terms of – if you actually count these things, then you’d find probably that either this one or this one is gonna be by far the best one here. In terms of if you rescale the axis not to be in iterations, but actually in total number of Monte Carlo simulations, then actually this one, somewhere in here is gonna end up being the best one. Almost certainly this one here, something like that. Okay. So that’s Monte Carlo estimation. Is that on your current homework? It’s on your next homework. Will be. Homework 4. Good. Okay. So we’ll look at another area where this stuff has been discovered and so on. It’s called online learning and adaptive signal processing. So that’s something. Some of you have probably seen this in some contexts. So let’s see what this is. This is the simplest possible problem. It goes like this. We have x and y where x is a vector and y is a scaler, and they come from some joint distribution. And the simplest possible problem is something like this, is to find a weight vector in our end for which w transpose x is a good estimate of y. So that’s what you want.

These come from a joint distribution, so you don’t know – well, we’ll talk about that in a minute. So one way to do this is you choose w, minimize the expected value of some convex loss function like this. Now, by the way, if the loss is a square, then this is a mean square error, and all you need to know are the second-order statistics of x and y to trivially write out what the estimator is. And it would be linear and it would be – sorry, actually in this case, over all estimators, in fact, the optimal would be linear or affine or something like that. Okay. But for example, if you take the loss to be absolute value, there’s going to be no simple analytic description for this. It’s gonna depend on the distribution. It’s certainly gonna depend on more than the second moments of the distribution. Okay. And the simplest context for this comes up in adaptor signal processing. What happens is you’re given a sample at each time step. That’s the right way to do it. So you get a vector x, you get a vector y. That’s all. And you want to come up with a vector w. That’s all. That’s what you want to do. That minimizes some loss function here. Okay. Now let’s see how this works. We’ll assume l is differentiable or whatever, if you take a subgradient otherwise. It doesn’t matter.

And they’re all based on the idea of localization and cutting-planes and actually, these are beautiful, the ideas are beautiful, they have lots of applications and distributed optimizations and distributed design and it’s good to know about these things. Okay. So what is the idea in a localization method? In a localization method what happens is – you know what these are? Let me explain it. These are basically generalizations of bisection in R. So bisection on R, how does that work? Well, it works like this. You have an interval and you know the solution is in this interval. So you maintain an l and a u. This is bisection. Then you do something like you check the middle, and then by analyzing something in the middle, like the derivative or something like that, what you know is now – after you know that, you say that actually the solution is either on the left or the right. This is the essential idea of bisection. And so at the next step you have a new interval, which is exactly half the size of the interval before. So that’s bisection. We all know about that. Okay. Now the cool thing about bisection is that your ignorance, as measured by the width of the interval, goes down exactly by a factor of two at each step.

Student: [Inaudible].

Instructor (Stephen Boyd):Okay, then what do you have to do? What computations do you have to do?

Student:[Inaudible] vector product and then add two vectors.

Instructor (Stephen Boyd):Okay. How do you say that in blast?

Student:Axby?

Instructor (Stephen Boyd):Yeah, axby. Daxby, I guess. It depends on what you want to do. So in other words, it’s about the most pathetically small amount of – it’s an order n update. n, by the way, is the number of variables. So if you merely – that means basically it’s the effort of just sorta writing to the variables and so on. So it’s nothing. Just accessing the variables, doing one or two flops per variable. That’s all. So a good approximation of the computational effort per step of the subgradient method is nothing. All the effort is in calculating the subgradient. And the other one will be free. Okay. So these are gonna require – we’re actually gonna have real calculations here. These can be much more efficient. In theory, for some of them, and in practice for some, and they don’t completely coincide. So we’ll see ones that are efficient in theory but not practice. We’ll see – no, that’s not true. All the ones that actually work well in practice actually, I’m pretty sure, are also efficient in theory. But the sets are not gonna be the same. Okay. So it’s all based on the following idea. You have an oracle, and it’s actually better – well, one way to do this is to abstract this away from subgradient. Subgradients will come in later. But you have an oracle, and your oracles gonna go like this: what we’re gonna do is we just want to find a point in the convex set, or determine that the set’s empty. And the only access to x is gonna be through a cutting-plane oracle. And what the cutting-plane oracle is, it’s gonna go like this: you’re gonna query or call the cutting-plane oracle at a point x. And what’s gonna come back is either a token or something that tells you you’re in the target set, in which case, by the way, you’re done. Or it’s gonna come back and give you a separating hyperplane between little x and big X. This is called, actually, cutting-plane or a cut, since it eliminates all half space of x’s from consideration. So let me show you what this looks like. Actually, let me just draw a picture, just to show you how this works or what it looks like. In fact, this one you have to draw pictures for everything. So the way it works is this. There’s some set here. We don’t know this set. This is known only – so we don’t have access to this set. This is capital X. And then what happens is all you can do is actually query an oracle at a point. So if you query an oracle at this point, the oracle simply returns a token that says you’re done, or it’s in it. However, if you query the oracle here, the oracle has to do the following, it has to return a cutting-plane, and a cutting-plane is anything that separates that point from this set.

That’s a cutting-plane. And the idea is basically when you know that – what you know is that this entire half space cannot contain any point in here. And therefore, there’s no reason to look in this half space anymore. When you get this half space the exact semantics is this. And by the way, it does not tell you that this set – this set could be empty, in which case, by the way, you’re allowed to return the semantics, the agreement there, just to conform to the semantics. If the set capital X is empty, you can return any cutting-plane you like, because it is true that there is no point in here that is also in your set, because actually there’s no point in your set. Okay. So that’s how this works. So this is the idea, and by the way, this is something like bisection. Because in bisection, what happens is you query at a point on the line. This is bisection. And in fact, what bisection tells you is it says either the solution is there or there. It actually returns – that’s a cutting-plane. And cutting-plans in R are a little bit boring. They’re basically sorta a point saying it’s either to the left or to the right or something like that of some specific point. So that’s a cutting-plane. And if you want to think in your head that this is something like you’re getting one bit of information, which is not quite right, you’re welcome to, because it kinda gives you the idea. Now this is an idea of a neutral – by the way, if this is the point you query at, you can actually return lots of cutting-planes. Obviously any plane, there’s tons of planes that fit in between this point and this set. Any one of them is an absolutely valid cutting-plane to be returned. I mean, you could return this, for example. That would be fine. You could return something like this. It just doesn’t matter. By the way, if this cutting-plane goes through the point, the query point, then it’s called a neutral cutting-plane. So it’s called a neutral cutting-plane, and that looks like this.

So here is your point x where you query, and what comes back is a neutral cutting-plane, so it basically says do not look to the right here. If this set even exists, if it’s even non-empty, it’s entirely contained on the left here. So a neutral cutting-plane goes through that point like this. A deep cut is one where you actually slice off – actually, not just x, but like a little ball around x is what it means. So it basically means not only is your point not optimal, but in fact, it makes a little ball around it and says it’s not optimal, too, because if this is deep, it says that there’s a little ball here that is not optimal – sorry, not in capital X. By the way, there’s also an idea of a shallow cut. A shallow cut is one where you get something over here that’s not really a cut. Thos come up, too, although we won’t look at them. Okay. So let’s do unconstrained minimization. So let’s do this. We want to minimize a convex function. We’ll let capital X be the set of optimal points, so any minimizer. And what we’re gonna do is this: we take the basic inequality for a subgradient, then that says, of course, that if you evaluate a subgradient at x, you get a cut, you get a cutting-plane. And that’s actually sorta the point, is that if you evaluate a subgradient, you immediately get a neutral cutting-plane at that point.

Basically the way that works is this, you have the sub-level sets. Let me draw a function here. So here’s some non-differentiable function. We query at this point here, and you get a subgradient that looks like that. There’s many subgradients, but they all give you a cutting-plane for the minimizer. Because when you get a subgradient, what you’re told is that for this entire half space defined by the subgradient passing through that point, this half space, there’s no reason to look anywhere here, because every point here has a function value larger than or equal to the function value there. There’s no reason ever to look there, because you’ll only find worse points than this one. So that’s a neutral cutting-plane. Actually, the fact is you can also get a – this is the picture which I just drew anyway – looks like this. Basically what you should think of is this: when you evaluate a subgradient, you are actually getting a cutting-plane with respect to what the solution is. Now you can actually get a deep cut, and deep cut goes like this. This is much more interesting. It goes like this. Suppose you know a number, f bar, that is bigger than f*, but lower than the current value? And one common choice of f bar here is actually the following: this f bar can be the best point found so far. That’s what we called before f best k. And there’s no reason – if the last point was your best point, then of course, these are equal and none of this holds. But if the current point is worse than the best point you found so far, it means, actually, it tells you that your current value is bigger than that, and then this says that if f(z) is bigger than f(x) plus this thing, you must have this. If this holds, then you have f(z) is bigger than f bar, which is bigger than f*, and that says that you’re not on the optimal point. So you have a deep cut, and the deep cut is this. So and there’s lots of other examples of this. If you knew f*, for example, that would be another example. You just put f* right here and you’d get a deep cut immediately. So these are – that’s a deep cut. Okay. In a feasibility problem, you’ll take x to be the set of feasible points, and if x is not feasible, so here’s how you generate a cutting-plane. Someone gives you x, you cycle through all the fi’s. If all of them are less than or equal to zero, you have to return a short token saying done, x isn’t in capital X. If not, it means that one of these fails. So in fact, you could even stop at the first one will generate a cutting-plane.

So the first time you get a violation, what you do is you evaluate a subgradient and you get a – here you can actually do a deep cut, and this gives you a deep cut. So that gives you a deep cut. Here’s the deep cut is right there. So that’s how this works. So the idea is if you want to solve a feasibility problem you cycle through the constraint functions. If you get to the end of the list, everybody’s satisfied, done. You’re feasible, it’s all over. Otherwise, you take any violated constraint. The first one you encounter, for example, call get subgradient on it and return a deep cut plane at that point. Okay? And if you have an inequality constraint problem – so actually what I’m going now is I’m explaining how to create a cutting-plane oracle, how to wrap an oracle method around various problems, like feasibility, unconstrained optimization, constrained optimization and so on. So that’s what we’re doing here. How do you do an inequality constraint problem? That’s easy. If x is not feasible, then you get a feasibility deep cut. And what you do is if someone gives you an x, you cycle through all the constraints. If you reach the end of the list, you set a flag saying that x is feasible, and you evaluate f0.

By the way, when you evaluate f0, you possibly will update the best value found so far. Best feasible value found so far. However, if you don’t make it through your list of constraints, you stop at the first violated constraint here, you get a subgradient for the violated constraint, and you return this deep cut here. And this deep cut is very simple. You could actually not only return the deep cut, but return an annotation or a comment to the caller, to the oracle caller, although it’s not required. The semantics doesn’t require you to give a comment back. You can give a comment, which is basically like here’s a deep cut, and if someone asks you why is that a deep cut, you can say, “Well, because anybody who doesn’t satisfy this inequality here, I guarantee is gonna violate the 13th constraint.” That’s what it means. You’re not required to do this, but you could do this. Okay. Now if all of these are correct, if x is feasible you just have a neutral objective cut, or a deep cut if you have an objective value like that. So now you know how to generate a cutting-plane oracle for pretty much any problem you would encounter, constraints, inequality constraints, all those sorts of things. And now I can give you a completely generic localization method. It goes like this – it’s conceptual because that means something in here is vague. I’ll get to that. It works like this: you start with an initial polyhedron known to contain x. It doesn’t have to be a polyhedron. It can be any convex set known to contain – it can be a ball or something like that – known to contain initial point. Sorry, to contain this, which is no guarantee that this is not answered. Okay. And it works like this: you choose a point inside this polyhedron. Now this part is vague and we’re gonna get to that in great detail.

So you choose a point in this polyhedron and you call the oracle at that point. If the oracle returns to token, the success token, done. Algorithm’s over. You’re done. If it doesn’t, it has to add a new – it has to return a cutting-plane, and you simply update your polyhedron. So the idea of the polyhedron is it summarizes what you know at each point. Or another way to say it if you want to think of it from the negative point of view is it summarizes your ignorance, is what the polyhedron is. Because basically, the true semantics is this: Pk at step k or whatever, if something like it’s the set – that’s all you can really say about where capital X is. All I can say is capital X is a subset of Pk, nothing else. Now if that’s empty you can certainly quit and then you can certainly say the problem is infeasible. And you said k equals k plus one. So here’s the picture. And the picture is draw here for a neutral cut. So here’s the way it works. Pk – and by the way this is nothing but a generalization of bisection. In bisection you have a polyhedron in R. Polyhedra in R – we can have a very short discussion of the theory of polyhedra in R – polyhedra in R are intervals. So it’s not that interesting. Okay. So here you have a polyhedron, you query at this point here, and a, in this case, a neutral cutting-plane comes back and it says that it points in this direction. And that says you have just ruled out all of this stuff here. This is impossible. And so that’s your old set of where capital X might be, if it exists. That’s our new set. Oh, and in fact, you could talk about all sorts of measures, but a very good measure would be – and absolutely defensible, I think, in information theory – would be something like this: the log of the volume of Pk is a very good measure. If it’s log based two, it’s a very good measure of sorta the actual number of bits of information you’ve gained here. And so, actually, I’m gonna ask you for this particular iteration that went from this to this, how many bits of information did we obtain?

Our ignorance went down by a certain number of bits. What’s the number? Twelve? 0.1? What is it? Just eyeball it.

Student:[Inaudible].

Instructor (Stephen Boyd):Yeah. If it was one bit, it means that volume went down by a factor of two. Or the area, it’s the area. Does the area look like it went down by a factor of two?

Student:[Inaudible].

Instructor (Stephen Boyd):More than two. Did it go down by a factor of four? No, probably not. So one and a half was a perfectly good estimate. So our ignorance just went down 1.5 bits, roughly, here. And actually, let me ask something for this problem. What’s really interesting about this now is when we queried at this point, we had no control over what cutting-plane was gonna come back. And so let’s actually draw some pictures about what could have – what’s the best thing that could have happened, what’s the – well, let’s think about that. I don’t have to have the same set. Oh, here it is. Okay. So there is it. And I query at this point. And let’s do – first, let’s do deep cuts. All right. So I queried the oracle here. And what would be the best thing that could possibly happen with a deep cut? My lucky day. I have tipped or bribed the oracle and what happens?

Student:[Inaudible].

Instructor (Stephen Boyd):Yeah. So the oracle comes back saying if there are any feasible points or any points in capital X, they’ve got to be – I guess it points that way. That’s our protocol. It’s that way. So basically it eliminates all of this and leaves only, quite conveniently, this one vertex. This could happen. At which point, by the way, because really literally, just that single point there, we can now say with certainty, we’re done. Well, I’m sorry, I have to call the oracle at that one point, and then have it return to me the token saying you’re in. Right? So that’s the – so this is the best thing that could happen. And in fact, let’s suppose it wasn’t quite that good, but it let a little tiny baby triangle here. But in this case, what’s our ignorance reduction ratio? Huge. Absolutely huge, right. I mean, it went down by – I guess if it went from the whole thing to a point, then our ignorance reduction ratio was infinite, because we went from something with some area to zero area. So arbitrarily good things can happen. Okay. Well, it’s silly to base an algorithm on that. Not only silly, but it won’t work. So let’s talk about what’s the worst thing that can possibly happen? Well, the worst thing that could happen is you’re gonna get a neutral cut. And let’s talk about the direction of the neutral cut. If the direction of the neutral cut in this case looked like that, how many bits did you get? Just eyeball it.

Student:Two.

Instructor (Stephen Boyd):Two, three, something like that. You got a couple of bits. You were lucky because you’re getting a full two bits. By the way, this is sorta like – you’re now beating bisection. In bisection, if you query at the middle and you go down by a factor of two, you get exactly one bit per iteration, in ignorance reduction you get one bit. Here you were lucky. But suppose, in fact, it came back like that. So your prior knowledge – although really probably the correct name for this thing is your prior ignorance – your prior ignorance is this polyhedron here. And your posterior is this thing with this part cut off. And so what’s your ignorance reduction now? Way under a bit, right. In fact, if you think that this was, I don’t know, one-eighth the size of that or, I don’t know, one-sixth or something, so it’s between two and three bits, then it basically says the area went down only by a factor of seven-eighths, so you did quite poorly.

Student:[Inaudible] if you cut doesn’t go through the polyhedron?

Instructor (Stephen Boyd):No, no. You have to – you’re always gonna gain something. That’s your guarantee. Well, no, no, hang on. Let’s look at some query points. Let’s now discuss query points. So let me ask you this, suppose I query here. What can’t happen if you query there? It can’t come back and say, oh, you got it. Otherwise, something is wrong. Because basically, according to all the contract, I’m supposed to be able to certify that any point that’s actually in capital X is in this polyhedron, and I called the oracle there and it said, yep, that’s it. So something is wrong. However, could something good happen calling here? Could I call the oracle there and have something good happen? Yeah, absolutely. What could come back could be this plane here, in which case I say, great, I win. I got it in one step. However, bad things could happen, and I’ll show you one of them. You could simply call at this point and get this thing back, and how much ignorance have you reduced now? None. So okay, so basically querying outside is, at least in the worst case, as dumb as you can get, because if you query outside the localization region, there’s a chance that your ignorance won’t go down at all. So how about querying here? And let’s do neutral cuts. If you query there – by the way, if you’re lucky, what’s gonna happen? Neutral cut through that point, what happens? It’s gonna go way down. What’s gonna happen is it’s your lucky day, you point in that direction, you have a huge reduction in ignorance. Okay. On the other hand, it could just as well have worked this way. And if it worked that way, your reduction in ignorance, though positive, is extremely small. So now you begin to think, if you wanted to do something that’s gonna work well, even if the worst case, you obviously want to query this somewhere near the center. I mean, that’s clear.