ConvexOptimizationI-Lecture17

Instructor (Stephen Boyd):Well today, we’ll finish up Newton’s method, probably won’t take up much of the day though. But – then, we’ll move on to our absolute last topic, which is Interior Point Methods for Inequality Constraint Problems. So we’re studying methods, Newton’s method, for solving the following problem. You want to minimize f of x, which is smooth, subject to a x = b. So we’re assuming here that a x = b is feasible. I mean, otherwise, the problem is, the whole problem is infeasible; and that we have a starting point x zero that satisfies a x zero = b. So we assume we have a feasible point.

And we’re actually gonna – the first method we’re gonna look at, which is the basic Newton’s method, is a feasible method. That means that every iterate will be feasible. They’ll all satisfy a x = b. The reason for that is simple: The Newton’s step, at a point x is the solution – it’s actually the – this is the Newton step. It’s found by solving this set of equations here and that’s the Newton’s step; we looked at what it means last time. Now the second block equation says a delta x Newton = zero. So it basically says that this search direction is in the Null space of a.

Now what that means is very simple. It says that, if you have a point that satisfies a x = b, and you add any multiple, that’s the step-length times an element in the Null space of a, obviously you continue to satisfy a x = b. So feasibility is guaranteed, all iterates are going to be feasible no matter the step-lengths, and so on. This is the basic Newton method. Okay. So Newton’s method looks like this, with equality constraints. I guess you’ll be implementing one; some of you may already have. In fact, we know some of you already have because you’ve written with questions. So. But all of you will be shortly writing up a Newton’s method with equality constraints.

So it works like this. You’re given a point in the domain of f, of course, with a x = b and you compute this Newton’s step, that’s by solving this set of equations. This set of equations has got lots of names; it’s – I guess, mostly this KKT system, is what this is because that’s – nah, it’s not the KKT system, this thing. I guess this is called the linearized KKT system; that’s called the KKT matrix because that’s where it comes up. A lot of people would just call it the Newton system. Do you solve? You compute the Newton step. If the Newton decrement is small, you quit. Otherwise, you’d do a line search by – you just do a standard backtracking line search.

So you try t = 1, if that gives you enough decrease, fine. Otherwise, you’d choose beta times t, beta squared and so on – beta squared times 1, otherwise known as beta squared. And then, you finally update. Now this method has the following attributes. The first is, of course, it’s always that point x is always feasible because every step, everything direction, all of these things are in the Null space of a. And if the original a satisfies a x = b, that’s preserved because of this. So it’s a feasible method. And of course, it’s a descent method, so f goes down with each step. And that’s ensured by the line search.

Okay. Now this method is affine invariant, that’s easy to show. If you change coordinates, it’s – you generate exactly the same iterate, so you get a communicative diagram there. It has all the features you’d expect to see of Newton’s method, which is the affine invariants. What’s interesting about is, in fact, this is identical to applying unconstrained Newton’s method to an eliminate – after you use elimination of variables, linear elimination of variables, so it’s absolutely identical. That means – it means we don’t have to have a new convergence theory or anything like that. You don’t have to see a new proof because you’ve already seen it.

It also means that all the practical inform – stuff you’ve seen about Newton’s method, which is like the fast convergence, the quadratic, all that sort of stuff, that’s absolutely preserved here. Okay. So you might ask if it’s the same, why do you have to go to all the trouble of formulating it this way, and not by elimination of variables. And actually, it’s a good question and I’ll answer it right now. The reason is this: There’s no reason – there’s a lot of people who would think eliminating variables is better, and there’re actually cases where it is better. And I’ll tell you, right now, when and why it is not.

If this matrix here has structure and then you eliminate variables, eliminating variables can destroy the structure. So actually, if there’s one thing you should take – a couple of – you should take away, like I don’t know – we’ll make a little tri-fold card. That’s too much, even. We’ll make a little pocket – a wallet card of the things you should remember when you leave this class. But one of them should be it is not always better to solve a smaller set of equations. In fact, in many, many cases, it’s far better to solve a much larger set of equations, provided A.) Those equations are structured and B.) You intelligently exploit the structure.

Okay. So I mean, these are kind of obvious, but you’d be shocked how many people don’t know these things, or know them, but kind of don’t apply them to their everyday life, or something like that. Okay. So all right. This is Newton’s method and you’ll be implementing one that works and so on and so forth, has all of the attributes you’ve seen, Selfkin coordinates all identical. Then you get an absolute complexity-bound on the number of steps required, and so on.

There’s a variation on Newton’s method that’s very interesting. It won’t play a big role for us, but it’s actually important to point out. So here it is. There’s a variation on Newton’s method called an infeasible Newton method. And in an infeasible Newton method, it’s actually very, very interesting how it works. You want to minimize f of f x subject to a x = b, something like that. In the feasible Newton method, x – the original x is feasible and then every iterate after that is feasible. So you will always satisfy a x = b.

There’s another version of this, which works like this. You’re actually allowed, for intermediate steps, to actually not be feasible. So I can start with any x I like. So here’s an example. You’re solving this one, or you will solve shortly, on your homework, this. You’ll minimize minus sum log x i – I think. I can’t remember; is this what you’re solving? I think it is. Yeah, something like subject a = b. Okay? So actually, finding a – I think this is what your solving, right? Yeah, it is.

So here, in an unfeasible method, you’re allowed the following. By the way, in an infeasible method, you cannot be outside the domain of f. So you can have – you must have x positive, actually, with a feasible method as well, obviously. You must have x positive at each step. However, you are absolutely allowed to have a x = b violated. So in other words, you can start with x zero it’s just like 1. Okay? Then you’re nicely in the domain here, but for example, unless you’re very lucky, you don’t satisfy a x = b.

Now if you have a method like that, there’s a couple of things that get trickier. The convergence gets a little bit trickier, not much. But the first thing is you cannot have a descent method. In other words, in this problem, you cannot expect the objective to go down because the objective actually is meaningless until you’re feasible. So some kind of combination – what should go down is gonna be some combination of the objective or some measure of sub optimality and the residual. By the way, once x is feasible, then it makes sense to talk about the objective.

But until then, it hardly matters. And indeed, it’s clearly not a feasible, sorry, a descent method because if you start with an x for which f is very small, smaller than the optimal value, it’s gonna have to increase as iterations go. Okay? So watching this algorithm, you’re gonna want to look at two things. You’re gonna want to look at things like a x minus b. That’s the so-called primal residual, and the other thing you look at is the dual residual, which will give you the optimality conditions. So here, the way it works is this.

In this case, it’s no different, you simply linearize the optimality conditions and you get a set of equations that look like this. By the way, it’s identical, if you’re feasible, this is identical to the Newton direction for a feasible point because in that case, the primal residual is zero. Okay? So whatever this is, this Newton step at infeasible point, it is a generalization of the Newton step when you’re feasible. Okay? However, it allows a non-zero point here and this – so you can actually – let’s see; we can actually work out a couple of interesting things.

We can see the following: If it always the case that x plus delta x Newton is feasible and you can see that directly by this bottom row. This bottom row says x delta x = minus a x minus b. So in other words, if you multiply – well, like well, multiply it by this, I get a x plus a delta x Newton. And from this bottom equation here, a delta x Newton is minus a x plus b, and I’m done. Okay? So a x is b. So if you take a full Newton step, you will be feasible. Okay? By the way, in this method, once – if you ever take a full Newton step, the following iterate will be feasible. Once it’s feasible, the Newton step you’re calculating by the infeasible start is identical to the Newton step for the feasible Newton method.

Okay? So all right, that’s actually a good thing to know. I’ll say a bunch more about that. So in other words, what happens is – we haven’t even gotten to the infeasible-start Newton method – but what’ll happen is, if you ever take a step of size 1, a full-undamped step people call that, then the next iterate will be feasible and all future iterates after that will be feasible. So you’ll achieve feasibility. Okay. And let’s see what – and people, you can interpret this many, many ways.

You could say, simply, that you’re linearizing the non-linear optimality conditions. You can also – maybe that’s the best one; that’s maybe the simplest way to do it. Right. Some people would also call this a primal-dual Newton method because you’re updating both – you’re updating both the primal variable and this dual variable at the same time. Of course, that’s true of Newton’s method, as well. But still, I guess that’s the idea. So this infeasible start Newton method looks like this. You start with a point; it has to be in the domain of the function, but it does not have to satisfy a x = b. That’s – it does not.

And you start with some initial Legrange multiplier; actually, it turns out the initial Legrange multiplier is totally irrelevant, but that doesn’t matter. You compute the primal and dual Newton steps; you compute these. Then, here’s the important part. You cannot do a line search on the function value; makes absolutely no sense. In fact, in the feasible – infeasible start Newton method, the objective can, and will and, in fact indeed in some cases, it must increase. So it can’t go – and the reason is very simple. Well here, I’ll just give you a – let’s just do a really, really simple example.

Here. Here’s your equality constraint. Gives you – that’s the feasible set. And here’s your – here are the sub-level sets of your objective. So here, I’ll just draw the optimal point; there’s the optimal point. Okay? What if you start right there? Well f is now at its unconstrained minimum; that’s the optimal point. So f has to go up, going from here to here. So obviously, a descent method is not gonna do the trick, I mean, at all. Right. By the way, in a problem like this, if the function were quadratic, what would the Newton step be here? The infeasible start Newton method – step, what would it be from here?

Student:[Inaudible]

Instructor (Stephen Boyd):What would it be? It would be – the step you’d take would be 1, but the Newton step, which is actually the increment, would point exactly to the optimal point because what you’re doing is, you’re making a quadratic approximation. The function’s quadratic, it’s not an approximation. Then you’re solving the quadratic problem by the linear equations, and you will get this. You will take a step of one, and you’ll converge to global solution in one step. If it’s non-quadratic, it might not point there, it might point you know, over here. By the way, the infeasible start Newton method will always point to point on this line because if you make a single step of size 1, you are feasible, and therefore, you’re on this line.

So it might even point – I mean, I don’t know, seems unlikely, but you know, it could point over here. I mean, I could make some sick function that would do that. It would point over here. But if you made a full step, you are then feasible and all subsequent steps will be on this line – will stay in this line. You make a half-step; you might be here, and so on. So that’s the picture. You must – the line search is actually done on the norm of the residual and let me tell you what that is. The residual is r dual and r primal. It’s like this. It’s this norm. Okay?

So this is; I guess people call it a Marek function or Leethanol function for this process. That’s a function, which proves convergence of something by – in this case, it proves that r d and r p goes to zero. It certifies r p and r d going to zero because it’s a function that is, you know, non-negative, non-negative positive only and zero only if you’re zero, and goes down at each step. So this is what goes down. It’s not very different. And then r primal and r dual, that’s just these. No, I’m missing, that’s not quite it. Sorry. This – you have to have plus – there it is. Sorry, it’s right there.

That’s – this is r dual and that’s r primal. Okay? So when you run one of these, you’ll want to look at r primal and r dual. Here’s what you want to see. You want to see r primal go to zero and stay zero. Actually, if it goes to zero and later becomes non-zero, you have a bug. And then, r dual will then, very rapidly, go to zero. Okay? It’ll, in fact, quadratically in the end-game. So you have the norm. By the way, it’s the norm, it is not the norm squared, here. So just to make a point here. Obviously, both of those go down but if you worked out the following, if you worked out the d d t of the norm of the residual as a function of the step-length, at t = zero and in the Newton direction, you will actually get, I believe it’s minus 1 or something like that. Or it’s actually minus the residual. There you go, it’s that. I haven’t – this is worked out in the book, but you can work this out; it’s this thing. What this says is that, if you take a step of alpha, you ex – sorry, if you take a step of t, the predicted decrease is simply this. It just scales the residual down; that’s your predicted decrease. You have to degrade that because this is a lower bound by convexity; it’s a lower bound. And you scale it back and alpha can be any of the constants you like, point 1, point 0 1. Actually, it appears to not have any diff – make any difference what value of alpha you use, over a huge range. So this is what goes down as a residual. That’s your line search. By the way, sometimes when you get feasible, you can switch your line search back to f because now, it coincides precisely with the original Newton method and so on.

Okay. This is what I was trying to – this is – I was trying to recover this, here so that was the – this is the descent cond – that’s the directional derivative of this at t = zero plus is this thing. Okay. Actually, you don’t have to worry about all these things. This is just sort of to get a rough idea of how the method works. It is – this is actually a very, very good method choice, in some cases. If you have a point that’s like not quite feasible, or something like that, but it’s easy – for example, if you’re doing a so-called warm start. Warm start means you’ve solved a problem just like the last one, except now it changes a little bit. So you’re doing – doesn’t really matter what you’re doing – but you just solved a problem just like the last one. The data changes a little bit; one more piece of information comes up in – comes in, in an estimation problem, something like that. One more data point becomes available; one more time-step has evolved or something like that. Then what you do is; you can use this with warm start. And in fact, if you’re lucky, it would be something like, you know, just two Newton steps, or one would get you right back where you want, incorporating the new data. So if you want to do an online estimation, update stuff as new information becomes available, for example in maximum likelihood, this would work very, very well. Okay. Let’s talk about solving KKT systems. This is very important because well, I guess a lot of people don’t seem to know it. I guess people who do these things know it, but then – I don’t; they don’t get around much or something like that. So to solve a KKT system, and that’s for infeasible start Newton method or for standard Newton method, with equality constraints, you have to solve a system of equations that looks like that. And that’s called a KKT system just because this thing comes up. By the way, it doesn’t only come up in optimization; it comes up in mechanics; it comes up in – comes up in tons of places. I’ve seen it – I guess I opened the first book of some book on simulating flows in porous media and, like on Page two, I saw this. So I mean, they didn’t call it a; they didn’t call it h; but it was, basically, this system. So this system of equations is probably worth knowing something about. So here’s how you could solve it. I mean, the first thing is just treat the whole thing as dense; it’s m plus n. Right? Because this is, that’s m by n and that’s m high. And so in the worst case, you just commit to this and solve that, just using absolutely standard methods. Which is to say, in other words, this is the h – let’s see if I can get it right – h, you know a prime a and then zeroes of the right size backslash v w. So that’s you’re – well, that was all wrong. Minus and this should be g h. Okay? So this is – that’s your just default basic method. This will simply treat this matrix as dense. It is true there’s a block of things here that are zeroes and this is fine. It’ll do whatever pivoting it needs to make this stable and it’ll – it may even do the right thing. One thing it cannot possibly do is, it can’t do a Cholesky factorization on this This matrix is symmetric but it is, for sure, not positive definite because the zero’s on the diagonal here. And in fact, your homework which we – I forgot to announce that. We’ve backed off and said that you don’t have to do prob – is it 10.1b, I think it is? Yeah, so. Some people have already done it and we heard from them, last night. Anyway. Were you one of them?

Student:[Inaudible]

Instructor (Stephen Boyd):No, okay. So all right. So it’s certainly not positive that you can’t use a Cholesky factorization because this has negative eigen values. Okay so if you – one way to do this, probably the simplest, is just an LDL trans – I mean sort of the correct method, if you’re just gonna solve the equations, is LDL transpose factorization. That’s a generic factorization for a matrix which is symmetric, but not positive definite. If it was positive definite, of course you’d use a Cholesky factorization and you’d use a LDL transpose. Now in fact, there are – there’s a whole sub-field now on KKT solvers. And you can go to Google and type “KKT solver,” you’ll get maybe 30 papers on it, and codes if you want.

So this comes up enough that it’s studied, carefully. But if you wanted to use a generic methods, you’d use something like an LDL transpose factorization. By the way, this might not be a bad idea, if this is sparse like crazy, here. So if your a is huge and sparse, if h is sparse – and what does h sparse mean, in terms – about f? I mean, actually it’s not all uncommon that h is diagonal in these systems. In fact, I guess that’s your homework. In your homework, h will be diagonal. Okay? So then this thing has just got tons of sparsity; it’s a screa – I mean structure; it’s just screaming at you. If you have a di – you know, ignoring it, that’s a little block of zeroes. You’d ignore that; that’s maybe okay.

But a giant, like diagonal matrix just sitting, there and if you ignore that structure, that’s kind of irresponsible. That’s over the line, I think. Anyway, LDL transpose factorization will probably work quite reliably here and would take you up to some very, very, very large systems at this work. You can also do it by elimination and the elimination is interesting. It works –it’s – well you eliminate the first block and you get this. So the first thing you have to solve is this equation. This is the – one of these you’ll see is the Schur complement is going to come up, somewhere. Well, I guess this is the Schur – well, one of these is the Schur. That’s the Schur complement.

This is zero minus – it’s zero minus this times the inverse of that times that. So that is the Schur complement. Actually, it’s the negative Schur complement because there’s a minus sign there. Here’s what’s interesting if you do this method. This is interesting in cases where h is easily inverted. If h is dense, and there’s 1,000 variables and 100 inequalities, this is actually not at all a good method to use. You might even just use LDL transpose, here. Well, I don’t know. This would be equivalent to it. However, if h is, for example, diagonal, meaning that f is separable, then this is – h inverse is a non-issue, if h is banded.

If h is sparse enough, h inverse is something very, very easily computed. Actually, I should say correctly. Although, that’s what people would say; they don’t mean it. You don’t actually compute h inverse, obviously. What you’d do is you’d have a Cholesky factor of h sparse Cholesky factor, hopefully. And you would back substitute the columns of a transpose. Okay, but you know that, so I won’t say that. So people would still say form this, when you’re not really – you do form this matrix. What you don’t form is you don’t form that, for sure.

Okay. I mean, unless h is diagonal, then you would form it. But if h is banded, you would not because h inverse is full. Okay. So you’d form this. The interesting thing is that, the – when you solve the reduce system, you get a negative definite system of equations. I flipped the sign on it because if it’s negative definite, it means you flip the sign, it’s positive definite. And this is the Schur complement – well, assuming h is invertible, then this can be solved by Cholesky. So in fact, if you solve this by elimination method, you will have two calls to a Cholesky factorizer.

The first one is the h, itself. By the way, that might not even be a real call, if h is diagonal then there’s no call because diagonal is trivial. But in other cases, there’d be a call that you’d do a Cholesky on h. You’d form this matrix; that’s the Schur complement and then, you’d actually solve this by some kind of Cholesky factorization. So that’s the – so I guess people – so that’s another way. So people would call that, maybe, the elimination in Cholesky, would call – but you’re gonna make – you may make – you may be making up to two calls to Cholesky factorization.

Student:[Inaudible]

Instructor (Stephen Boyd):1. Right. So real quadratic phase is characterized by unit steps and quadratic convergence. That would be out here. What would you expect the step-lengths to be here?

Student:[Inaudible]

Instructor (Stephen Boyd):What? Well they might – they certainly could be less than 1, but they can also be 1. You don’t – I mean, you don’t – the answer is you don’t know. But you wouldn’t be surprised if you started some, seeing if there were lots of steps out here where you were less than 1. Okay? So this is just four different – four different, maybe, starting points, here. It’s the four different trajectories. Okay that’s the first. This is the method you’ll implement. Now you can also – you can also solve the dual here, but totally different size. The dual has got far fewer variables, right? Well, not far fewer; it’s got m and I’m assuming sort of here, well m has to be less than n, but it’s got fewer variables.

So it’s got m variables. Okay. Oh by the way, here’s an extreme case. Well, let’s go all the way to the caricature. Ready? Here’s the extreme case. Find the analytic center with one inequality – one equality. So that’s just a transpose x = b x positive. Okay? Here, I guess there’s a couple of things you could say. Let’s see how to solve it. The dual has how many variables?

Student:[Inaudible]

Instructor (Stephen Boyd):One. How do you solve a one-dimensional convex optimization problem?

Student:[Inaudible]

Instructor (Stephen Boyd):How?

Student:[Inaudible]

Instructor (Stephen Boyd):No, you could a line – I mean you could do it by section, if you like. I mean, you could do anything you like at that point. Okay. And you’d think, “Oh, wait a minute, wait. I’ve got a 1,0000,000 variable problem with 1.” So this might be an image; that’s an equality constraint. By the way, it’s a stupid problem, but anyway, you have an image with an equality constraint and you want to calculate the analytic center, which would be some – and you could wrap some big story around it, like, “It’s the maximum entropy image.” or something with this – just something, who knows, satisfying one equality constraint.

Now, you could do it by bisection, here. Your dual has one variable and you’d say, “That’s ridiculous. How could the primal, that has 1,000,000 variables – how could that be as easy to solve as the dual?” It seems suspicious, doesn’t it? So we’ll get to it. Actually, it’s pretty straightforward, how that works out. The KKT system would then look like this; the Newton system would look like this, where that’s diagonal. Right? So it would look exactly like that. Indeed, it is 1,000,000 by 1,000,000. I would not recommend storing it as a full matrix, just so you know.

But if you stored this exactly in this form, you should look at that and say, “Well, that might be 1,000,000 equations by – with 1,000,000 variables.” How fast can you solve this, roughly?

Student:[Inaudible]

Instructor (Stephen Boyd):It’s in m steps, okay? So actually, what are we talking about, for 1,000,000?

Student:[Inaudible]

Instructor (Stephen Boyd):Milliseconds, at most. Okay? So anyway, so these are very important to know, these things, and to recognize them. I guess, there’s a lot of people who know you could solve that fast. Although, a lot of them haven’t totally internalized it. But what they haven’t also done, they haven’t, like, propagated this knowledge into the rest of their life. So they’re doing signal processing or statistics, or whatever, and they haven’t put it all together. They don’t know what a statistics problem looks like, that happens to be one that you can solve unbelievably efficiently, like that.

So, okay. All right, let’s look over here. So this is a Newton applied to the dual. By the way, this is interesting, totally different initialization requirement. But you need to – here, is you need to find a vector nu for which a transpose nu is a positive vector because that’s how you need to initialize this problem. It’s completely unconstrained. Okay? And you’re gonna – so you need this, totally different problem, initialization requirement. And this one has, you know, nine steps to converge, and again, you see this thing roll over. That’s the good – that gives you that good Newton feeling.

Okay, if you apply infeasible start Newton method to this problem – so I initialize it with, let’s say x = just all ones. Okay? Just all – I just plug in all ones. You’re very happily in the domain, here but you don’t satisfy a x = b. And that takes – here are the four things, here. Oh, what I haven’t marked here is where primal feasibility is achieved. And so let’s make some guesses. If I told you that primal feasibility was first achieved on this run here, what would you tell me?

Student:[Inaudible]

Student:[Inaudible] apart from history of solving [inaudible] problem, you don’t have any constant. Is it a benefit because it’s easier to find the starting point?

Instructor (Stephen Boyd):No, no, no. Not at all, I mean, you could use the infeasible start Newton method. [Crosstalk]

Instructor (Stephen Boyd):You don’t have a feasible point.

Student:Because you could have a feasible start point then you’d have more [inaudible] conditions.

Instructor (Stephen Boyd):No, I’ve been – I know what you’re saying. I would like – I mean, I also had this urge and, in the past, did not know these things. Yeah, you want to say one of these is – you would like to say one of these methods is better, right? Well, we’ve just discussed the computational complexity issue. If you know – if you know what your doing, that’s a big “if,” it’s identical. So the cost per Newton step is the same in all cases, not because you’re solving the same set of equations. Because you’re solving – you’re not solving the same set of equations. What you’re doing is, you’re solving a set of equations with exactly the same structure. That’s why it’s comparable. Okay?

Then, you could ask, “Does one take more in steps than another?” Well, I don’t know, you know. No not really, I mean within a factor of 2, they’re all the same. Okay? So you know, if you – here, in this example, the dual method looks, you know solving Newton to the dual looks better. I can make another example where it’s worse. Okay? Okay. So then, everything comes down to initialization and, there, I can think of situations, practical situations, where all three methods are the best one. So or there’s just no difference between them.

So I hate to say it, but I think they all have, at the moment, they all have equal status. The main thing is the initialization, as to which one is better or whatever, depends on the situation. If you have an obvious dual feasible point, fine. If you have an – if there’s an obvious primal feasible starting point, fine. If you have a guess of a point, which is nearly feasible, but not quite, you might want to use infeasible start Newton method and so on. So okay.

So now, we’ll look at some other examples. They’re fun and they all have the same theme, which is, you know, you look at a Newton method and then we’ll talk about forming it. And, but what you’re really doing is, you’re looking for structure to solve here. So let’s look at network flow optimization. So network flow optimization is, this is a single commodity flow; it’s a beautiful problem. You should all know about it. So you have a network, a directed graph like this; it doesn’t really matter, I mean. And you have flow on the edges. And a flow – these arrows actually only tell you the orientation of the flow.

So it’s like an electri – in a circuit analysis, you make an arrow here and that tells you that, if current flows that way, you will regard it as positive. If current flows this way, you’ll regard it as negative. So that’s the – now, it could be that, in fact, things can only flow this way. There are many cases where that’s the case. But generally, when I mark these, this is what you have. So you have – and you have things like you – this could be a single commodity network or something like commodity flow. So these – then you have external inputs to the network. Oh, you have one very important requirement. The requirement is that in elect – in circuit theory, you call it KCL, which is Kirkoff Current Law and it basically says that, at each point, flow is conserved. The sum of the flows at any node, is zero. Okay? So that’s what it says. And sum of the flows means, if things – it says, some people would say the sum of the in-flow is equal to the sum of the out-flow. Another – but this is the same and you get this in lots of things. You get in physics; you get it all over the place, in anything. It’s mass conservation, for example, or it’s just a conserva – it just basically says the flow is conserved at nodes. And if it’s a differential equation, I guess you’d say it’s a divergence free or something like that, whatever it is there. Okay. So and you could think of this as any way you like. These could be flows of packets on a network, and the fact that it’s conserved says the buffer is bounded or, you know, there’s either no buffer or the buffer’s bounded and it’s not growing or something like that. And it’s in steady state. It could be actual current; these could be amperes and it says you’re in steady state, or something like that. Okay. Okay, so this is – that’s flow conservation. You have some external sources and syncs. So it’s generally, if you’re pumping stuff in, you call it a source. If you’re pulling stuff off, it’s called a sinc. Actually, I’m just drawing these as if all the variables were positive here, for example. But of course, this could be a source if this flow is negative. Now, in this case, you have to have the sum of the source as zero. That’s very easy to work out. Otherwise, it doesn’t make any sense. So this is – that’s a network. Now, there’ll be lots if you fix the external flows. So if this is a distribution network or a current network, or something like that – we’ll make it a product distribution network. If this one is pumping in something, this is pumping in something; this is pumping in something, then what you’re pulling off , here at the sinc has to be equal to the sum of these. But there are lots of flows in here that satisfy the flow conservation equations. So you can actually route the flow any way you’d like. Routing the flow, basically means, what comes in here, some you ship there; some you ship there; some you ship there and so on. Okay? So that’s the picture. Anyway, this is just written as a x = b. B is the source vector here, at each node and a is the incidence matrix. So that’s the matrix – you’ve seen this, should’ve seen it, that looks like this, where you have nodes and edges like that. And you have a plus 1 or a minus 1 to indicate where that edge goes, the from and to node. So that’s just a x = b. Now, among all the possible feasible flows – by the way, there’s all sorts of interesting things you can say . The Null space of a, for example, is called the circulation because, if I add something to Null space here – so the circulation might be something like that. Notice that it does not affect the external flow. It adds to this flow, adds to this one, subtracts from this one and subtracts from that one. It subtracts because I’m going – my circulation is going against these things. So that’s an element of the Null space. You might ask why on earth would you want to have anything to do with one of these? Well, we’ll see why. What you want to do is, you have a cost now, on every edge, and you want to satisfy the flow equations, and you want to minimize the total cost. So by the way, let me just quickly ask something. Right here, there’s something that should pop out if you’re fully trained in all the neural pathways or hooked up correctly. Created. You should be able to look at that and without even thinking, 50 milliseconds later, a picture of the KKT structure should show up in your head. And what is it?

Student:[Inaudible]

Instructor (Stephen Boyd):Okay. When you see that sum of scalar functions, what is the hessian?

Student:[Inaudible]

Instructor (Stephen Boyd):Thank you. So you should just look at that and see in your head, this, without anybody saying anything else. That’s what you should see, when you see this. Okay? So I mean, it should be just an autonomous response. Okay. All right. So anyway, we’ll go through this; this is the reduced incidence matrix. And let’s look at the KKT system for this. The KKT system looks like that. So h is diagonal. That’s the important part, is that of course, you should always see this when you see an equality constraint minimization problem, but in this case, that’s very important; that’s diagonal.

You can solve this via elimination. You have a h inverse a transpose a transpose w = this thing. Now, of course, when you actually go to write the code, you have to get all the right-hand sides correct, and all that stuff or it’s not going to work at all. But in your first pass over a problem, you want to think about the structure. And so basically, in this problem, the inverse shouldn’t bother you at all because that’s a diagonal matrix. So it’s – although normally an inverse would set up a green – a red flag, which is, you need to look at that. Is that hard, easy to do and so on? In this case, it just goes away. It just – yeah, of course, you have to actually carry it out correctly in your code. But non-issue.

All right. So basically, what it says is, if you want to solve a network flow problem, each step of Newton’s method is gonna require you to solve a systems of form a diagonal a transpose. By the way, that’s a theme; you might have noticed it. You will hear about a diagonal a transpose, actually just everywhere. You might ask like, you know, what do you do when, you know, if you wanted to ask, “What does spice do?” You know what spice is, right? Circuit simulator program? You might ask, “What does spice do?” Would you like to know what it does if you profile it? It’s solving systems that look like a d a transpose. That’s the only thing it does.

Everything. And I’m talking about for all calculations it does. Everything, you know, [inaudible] as if it – actually, how many people know what I’m talking about? Anyway, or know about spice? Okay. So you get the same, you know, if you do a thermal simulation, right, you’re solving a d a transpose, with a finite element method. If you solve a Poisson equation, what do you think you’re doing? You’re solving a systems perform a d a transpose. So it’s just, you know, there’s no reason – I mean, you might as well accept the fact that it’s everywhere. And you will – you have seen it; you will see it; you will use it. You’ll use it, many times when you don’t even know you’re using it.

Okay. So I’ll have to think about where you’re using it in the machine learning, or something like that, but I’m sure you are somewhere. I just don’t know where exactly, right now but I’ll think of it and tell you. I’m sure there’s somewhere where you do this, some base network calculation or something. But anyway. Okay. So you have to solve that. By the way, the sparsity pattern of a diagonal a transpose is extremely easy to work out when a has a graph incidence matrix. It’s very, very easy. It turns out that a diagonal a transpose i j is not zero, if and only if – that’s a nodes by nodes matrix that you’re solving

And that’s true, if and only – that’s non-zero if and only if it finds j r connected by an arc. So that even tells you, now, about the sparsity pattern you’re gonna see. In fact, I can ask you some questions now and we’ll see how well you’ve entered – this isn’t – I just want you to guess or whatever and you tell me. Suppose I have a network. I want to do network flow optimization. On my network, the fan-out or the, sorry, the degree of each node on the network is like small, like 3 or 5. But it’s got 10,000 nodes. Do you think we can solve – we can do optimal flow fast on it? Yeah, probably because if you form a transpose a, it basically says – it tells you something about the number of row – non zeroes per row in this thing, right? And the answer is, you know, each row’s got only – you’re only – if each node is only connected to, like, three others, that’s the number of non-zeroes per row or column. Okay? So that’s a, you know, how fast can you do it? I don’t know; it depends on the gods of heuristics sparsity ordering methods and whether they’re smiling on you that day or not. But the point is, it’s likely you can solve that problem very, very fast. Now, let me ask you this. Let’s do a – let’s have a node, a single node is connected to a huge number of others. What does that do in sparsity pattern in a h inverse a transpose? Everybody see what I’m saying? So in fact, let me draw my network. I have a network here, which is kind of sparse; everybody is connected to just a few others, you know. I don’t know, it doesn’t really matter. Okay? But I’ve got some super node here and it’s actually connected to a whole bunch of them. What does it do?

Student:[Inaudible] it to the bottom, won’t it just preserve your arrow structure?

Instructor (Stephen Boyd):Okay, as you permute it. Okay, very good. So your proposal is that this should be ordered last. Okay? And what is the sparsity pattern on a h inverse a transpose, if you order this last? In this case, what’s the sparsity pattern on this thing? It’s arrow – yeah, well no, it’s like sparse arrow or something like that. It’s – you get a sprinkling of entries from these guys. But from this one guy attached to all of them, that is gonna give you a dense row and column in this matrix. Okay? Now by the way, you could order it to be the end, but in fact, any decent heuristic for ordering is gonna put that guy at the end, anyway.

So but you’ll know it’s gonna work. In fact, you’ll know that if you had a network with 10,000 nodes, 10,000 nodes and the degree of most edges is 3, but 100 of them are connected to giant piles of nodes – we’ll call those “super nodes” or whatever. It doesn’t matter, whatever you want to call them, right? Because you’re maybe optimizing the flow on a peer to peer network or something like that – you would know what the structure is. The structure will be dense, with 100 much denser rows and columns and you’ll know, actually, that can be solved very, very well. Maybe you’ll know, but anyway. Okay. So.

I didn’t expect everyone to follow all of this, but this is – actually, this is the kind of thinking you want, I think you want to do for this stuff. You want to just be on your toes, connecting sparsity and structure to the actual problem at hand. And the problem at hand, it’ll be things like, you know, bottlenecks – you know, bottlenecks, super nodes, that’s for this one. But for every problem, there’s gonna be something like that.

Okay, we’ll look at one more. It’s an even more advanced one, but it comes up in a lot of things. You want to calculate the analytic center of a matrix inequality. Let me, just first, motivate it a little bit. This comes up all the time, but I can tell you what this is. This is the following problem. I have a co-variance matrix and I tell you some things about – well, let’s see. I have a random variable z, let’s say, and let’s make it galcion or something like that; let’s just say it’s galcion. Okay? So it’s n zero x – actually I’m gonna call this co-variance matrix x. Okay?

And here’s what I’m gonna tell you: I’m gonna tell you some variances of linear functionals of z. So I will tell you, for example I’ve got a vector c, c i and c i transpose z. I’ll tell you the variance of that. Now this is nothing but c i transpose x c i. Everyone agree with that? It’s nothing but that. Therefore, that’s the same as the trace of x c i c i transpose. The linear – the trick – the variance of this linear combination is a linear functional of x, a co-variance matrix. Everybody agree? Okay.

Student:[Inaudible]

Instructor (Stephen Boyd):Well, the gradient. What’s the gradient? Like, minus 1. What’s the hessian there? Oh, by the way, please don’t calculate anything, just use your eyeball. What’s the hessian?

Student:[Inaudible]

Instructor (Stephen Boyd):What is it? Is it zero? I don’t think so. The hessian of the absolute value’s function is zero. It’s not – the hessian of this is not zero, but you’re very close. What is the hessian?

Student:[Inaudible] Instructor

Almost zero. Thank you. It’s almost zero. What’s the Newton step, at that point then? It’s like some insane huge long step, right? Okay. But more than that, how well is Newton’s method gonna work on a sanded-down absolute value function? Well, what do you have to ask yourself? You have to ask yourself this question. If you form a quadratic model of this function at a point, how good an approximator of the function is it? Well, what do you think? Do you think Newton method is gonna work on this? Well, of course, our theorists tell us, yes it will. And, indeed, it’s a proof, of course it will. But how well do you think it’s gonna work in practice, Newton’s method on a sanded absolute value function? You get two guesses and then, we’ll quit. The first is, like, super-well. And the second is the opposite. It doesn’t work, like at all in practice. So which do you think it is?

Student:[Inaudible]

Instructor (Stephen Boyd):The second one, you’re right. Okay.

Student:Why?

Instructor (Stephen Boyd):Because it’s obviously – and if you – and then you say, “Well, how does that work in the theory?” Right down here, the hessian, this third derivative, goes insane. Right? Because the hessian goes from, you know, it goes through something where it goes super high; it’s super low; it gets way high then it gets low again. You have to get down to the scale of 10 to the minus 5 to actually see this curvature. So the point is you’re right. Newton’s method, on a sanded down thing, isn’t gonna work at all. So I’m just saying that you shouldn’t – when people do this business, where they go, “Oh, it’s non-differentiable. Not a problem, give me my sandpaper.” you shouldn’t – generally speaking, you shouldn’t believe stuff like that. This one’s gonna work out, but this one does not. Yeah?

Student:You were saying, before, that if we’re solving a problem like that, that we have to do something to make it differential before we even start this process.

Instructor (Stephen Boyd):Right.

Student:So if that’s not how you do it, then what would you do to make it.

Instructor (Stephen Boyd):No that is what you would do and that’s the correct way to treat L 1 constraints, in most cases. You do exactly what I said; you add new variables, new constraints and then, use these types of methods. So okay, we’ll quit here.

End of Audio]

Duration: 79 minutes