ConvexOptimizationII-Lecture12

Instructor (Stephen Boyd):The first is your revised proposals are due tomorrow. So  and we would actually like those maybe stapled  not stapled to  or maybe paper clipped to the original one, so we can see what the original one looked like and so on. As I've said a couple of times, you're more than welcome to grab any of us, me or the TAs, between now and then to sort of just glance over what you have or -something like that. The way you know it's right is that you can explain it to somebody in way under one minute. So if you can explain it in under a minute to another person, not working with you on the project, then that's a sign that you can articulate it completely clearly and all that sort of stuff. Okay. So those are due tomorrow.

Let's see, a couple of other things. Oh, you should watch for email, and maybe the web site, because we may tape ahead next -Tuesday's lecture tomorrow some time. So  and we would. You'll hear about that. Okay. So, I can't think of any other administrative things. I guess we posted a new homework, but I guess most of you know that. Any questions about last time? Because if not, what we're gonna do is we're gonna finish up this topic of Heuristics based on convex optimization for solving the non-convex problem. So that's the first thing  we're just gonna finish that up, and then we'll move on to another little coherent section of the course, which is basically how to solve absolutely huge problems. So for some of you  I mean if you've already chosen a field, and it's too late, this will be relevant. -And if you make the mistake of -choosing a field where you have to solve huge problems then this will be  these will be the  it would be two or three lectures  these will be for you. For those of you who haven't chosen a field yet, this will give you  it'll let you  this will be encouragement to not choose a field where vast and huge problems are requires. So that's my  that will be my recommendation. Okay. So let's do difference of convex programming. Let's finish up this step. I think we looked at this last time. So this is a topic or method where you write all the objective and inequality constraint functions. You can do with equality constraints functions, too, but I'm not gonna go into it. And what you do is you write them as differences of convex functions. All right, so convex plus a concave function. And then the obvious linearization  convexification, I should say, is this. You linearize the concave functions. So only the concave portion  that's minus a convex one  is  you linearize that. And then this is roughly speaking, based on the idea  I mean you can make this precise, but it doesn't need to be  is the that the best affine approximation of a  the best convex approximation of an affine function is an affine function. All right. So  okay, so you do this and the result  this is now convex because the convex part you left exactly as it was, and the only thing you've done that's a constant is linearized  that's this thing  the concave portion. So  by the way, in a  this  the way to do this is, for example, in  it's actually pretty straightforward. -The way you would do this for a quadratic, you can work out how that is. If you have a quadratic that's neither convex nor concave, that's split eigenvalues. You split into the positive part, and that's gonna be this thing. And then the negative part, you simply linearize, but you drop  you will drop the quadratic term. And then that gets linearizing, it's just dropping the quadratic term. That's only for the concave part. And actually, you'll find out it's quite interesting. If you have a function that's convex, that this -difference of convex, concave, it'll actually have some negative curvature. And basically, what you're doing is your just flattening the negative, but you've retained, exactly, all the positive curvature -portions. Okay. So an example we'll look at is from the book. It's estimation of a covariance matrix from samples. So you form the sample mean, and the negative log likelihood function is log debt sigma plus trace sigma inverse one.

Now, I mentioned this last time that the general approach now is to do the following: is not to estimate sigma, but sigma inverse, which is  that's a one-to-one transformation, so you're perfectly welcome to do that. In which case, this function is convex in sigma inverse because that's linear, and that's convex because it's logged at sigma inverse. Well, it's logged at sigma inverse inverse, so  which is convex. So that's the standard method. Now, that works, provided the constraints you want to impose on sigma are convex in the inverse of the function  sorry, inverse of the covariance matrix. Actually, there's a lot of interesting constraints that are convex in the inverse of a covariance matrix  many. But some are not, and actually, we just want to illustrate DCC, or whatever this is, a convex, concave procedure, something like that. So we just picked on that's not. So suppose I told you that all the variables are non-negatively correlated. There are no negative correlations. So that would  that's this. And this is not preserved under inversion. So the set of matrices whose inverses are non-negative is not convex. Okay. So that leads us to this problem here, and we have a convex part and we have a concave part. And we linearize the  we're gonna linearize the concave part. That's this. I mean this doesn't matter because it's a constant. And we're gonna preserve this convex part. So  and the results here, I think we looked at this last time. I just wanted to make sure I went over this  well, is this. And my guess is that this is actually globally optimal. And the reason is actually there's  well, there's even an exercise in the book that goes over this. It turns out that in some cases, this problem is actually convex insigma, depending on the final estimated sigma versus the covariance matrix. So that's this. It's just to show you how it works. Okay. We'll look at alternating convex optimization. That's our last topic. And we'll look at a couple of examples. So this is another idea that comes up. Lots of times, it's discovered periodically, will be discovered periodically in the future by many people in many different  maybe one of you will discover this again sometime in ten years, or even less. That's fine. I mean, actually, there's nothing wrong with discovering it. The only thing wrong with discovering it is imagining that, in fact, you're the first to discover it. I mean that's the only actual problem here, which there are people who do. So, okay, all right, so this works like this. You partition  these aren't actually started  not partitions. I have index subsets here. And you assume that for each of these index subsets, the problem is convex in a subset of variables.

So that's sort of the  that's the picture. Okay? And in fact, we can -even, if you like, look at an example just for fun because that's the only way this really makes any sense. I wonder if you can raise this screen up to about here. That'd be great. Okay. That's good, -right there. So let's take a look at that, and actually, let's look at something extremely simple. Things don't come simpler than that. And let's imagine the variables are actually, for some reason, A, X, and B. So that's three groups of variables. And now, I want to know is that convex in A? What I mean by that, of course, is that X and B are fixed. Is that convex in A? Of course it is. It's affine. It's a norm of an affine function. So it's convex in A. Is it convex in X? Obviously. Is it convex in B? Of course it is. Okay, now, let's do pairs. Is it convex in X and B?

Student:Yes.

Instructor (Stephen Boyd):Yeah. Is it convex in A and B? Obviously, right? And the final one, is it convex in A and X? And the answer's no. Okay? So this would be a case where there's all sort of index sets. So the index might be something like this. And you might as well get maximal ones, right? So AB  I guess that's about it, right? If you're gonna do that. And you could have XB. Maybe these are the maximal  sets, something like that. Those are the maximal sets, index sets over which the problem is convex in each set. Okay. Now, the method now goes like this. I mean it's really kind of dumb. For each of these index sets, you  it's convex optimization to solve that problem. So you do that, and then you kind of go back and forth and so on and so forth. And hope that this works. So actually, let's even look at an example. Suppose I ask you to solve this problem. So here would be the problem. I should remember those sets, actually, AB and, I think XB. So, in this case, we're asked to minimize this  note my period  okay, subject to that X in some  X  this is convex. Right. A and some script A and B and some script B, okay? These are all convex sets here. So obviously a non-convex problem, that's clear. It's non-convex. Why? Because of the AX there; actually lots of problems come up this why, incidentally. In fact, there's even projects that are literally, directly this blindly convolution. I think  is that yours? Yeah. It's yours. So there's plenty of things that come up looking like this. And so this method would look like this. It would alternate, and it would actually say first, fix X to the current value. And optimize over A and B. And that's a convex optimization problem. That would update A and B. Then is says fix A and B, and optimize over X and B. So notice that B is being optimized, actually, every step. I mean this is not a big deal, but it's just to illustrate kind of how this works. So that's the picture. Let's see. What can you say about this? Nothing. I mean that holds for this entire lecture.

But, I don't know. You can  I mean you can say trivial, stupid things. But after a while they get kind of  you know, you realize, like, why bother? I mean, you know, if there's theorists  I don't know. It's  look, obviously when you do this, every time you do it, the objective goes down. So amazing, isn't it? The objective actually  it's a number that goes down. Therefore, it converges  I mean, if it's bounded below. So  I don't know. And then, so what? So  okay. I mean I don't know, that's better than nothing, or whatever. But it's so obvious that I don't even see the point of stating it. Okay. So a special case is actually when you partition the problem into two variables. And you just alternate. But there is lots of variations on this. That's the nice part is because it's an algorithm that often works, but need not. It means, actually, that for this whole lecture on non-convex optimization is that there's lots of room for personal expression, which is not the case, basically, for convex optimization because, you know, a method that solves a convex problem, like  doesnt badly is just called a non-method, right? And any two methods get  I mean if they disagree on the answer, one of them is  I mean and one of them is a not acceptable method, right? Well, for a non-convex optimization, who knows? It's very nice, you know? Maybe your method works better one problem instance; mine on another. I mean, so it's very nice. You get room for personal expression. I can mention some of the variations that you can do for this. One is actually to not optimize all the way, but to take a fractional step in between. So that's an option. It doesn't sound like it makes sense, but actually that can actually make these work better and so on and so forth. But just remember this is  that's alchemy. That's just like  who  it's just go ahead and try it and see what happens on your problem. So that would be  in other words, you would optimize, for example, over A and B, and not step A and B all the way there. But step them half  some fraction  halfway. Okay. All right. So this is alternating convex optimization. Here's a recent and interesting application of this is non-negative matrix factorization. If you don't know about this, you probably should. Although the people popularizing it are making way too big a deal out of it. But that's okay; they're welcome to do that because it is really cool. Here it is. It's sort of like  you might call it a signed, singular value decomposition. So basically, I give you a matrix. I want to explain it, somehow, as a low rank matrix, as an outer product, something like this. And you should imagine that K is small, here. So K  you might want to approximate matrix A as an outer product of a  I guess in this case, a M by five, and then five by M. So you'd have sort of, like, five factors or something.

Now, without this, this is completely solved. It's just the SVD. You take the SVD and you truncate the SVD expansion, and you get the answer. Let's see  oh, by the way, it's EE even  I mean there's lots of  it's not unique here, and that's completely obvious because I could  I can put a positive matrix  you know, K by K inside here, and I get the same  I get a different factorization that yields exactly the same objective value. So it's not unique. But that doesn't really matter. Let's see, a couple of cases we can solve. Oh, you do K equals one. So we can  you can actually approximate a matrix in Frobenius norm, well, any norm. It doesn't matter, a Frobenius norm or a spectral norm in  with a outer product, which is restricted to be positive. And that goes in two steps. You first take the positive part of A. That's the first thing you do. And the next thing you do is you take the singular value decomposition of that. Now, it turns out there's something called Perron-Frobenius Theory that says that the singular values of a positive  singular vectors of a positive matrix, the first ones associated with sigma one, which is what would come up in the rank one approximation, those can be chosen to be positive. Okay? So that would  that's one case where you can do it.

Okay, now on the other hand  I mean come on. It's just this thing. You look at this and you realize, "Wow. It's certainly convex. And X of Y's fixed. And it's convex in Y if X is fixed." And so you can simply alternate over these. And you can say a few things. I think in one of these variables, it's actually separable. I forget which one. I don't think it matters much. Maybe in the columns of Y or  I  it doesn't matter. It's separable. Actually, the separable is not that exciting unless you're gonna do  unless you're gonna run on many machines. Unless you're gonna do an MPI or something like that. Because, as we've discussed several times, if a problem is separable, it just means the computation time grows linearly with the problem size. But that's the case even if there's light coupling by these sparse methods. So, okay, this is non-negative matrix factorization. And here is an example with a 50 by 50 matrix. And we want to show it  we want to approximate it as a rank five matrix. And then, here it is starting from, say, five starting points. And you can see it's going down like this, and will converge. It need not converge to the  in fact it's very  it's unlikely to converge even to the same  it might converge, in this case, in the same one, but there's no guarantee and all that sort of stuff. So all of these methods  these non-convex methods, they're extremely useful. So the fact that no one  you know, if you're doing the Netflix challenge or something like that, and this method works, I don't think you're gonna sit around and complain and lose sleep over the fact that you're using a method that you can't prove always gets the global minimum. So this lots  you have to remember there's lots of cases where this stuff is fun. It is actually  I mean it works perfectly well. If you don't get the optimal trajectory for your robot, as long as you reliably get a really good one, that's fine. There's other cases where you can be much less  you can't be as casual about not finding the global solution for something. But this one  yeah?

Student:In this problem, did you create A to be the product of two positive matrices? [Crosstalk]

Instructor (Stephen Boyd):No, we didnt. I think it's just random. All the source codes are online. You can look at it.

Student:If you did this, would you get the answer, usually? Or 

Instructor (Stephen Boyd):I think you said it just right. Yeah. So yeah, the obvious thing to do here is to make A  you know, generate A as XY and see if you get back XY. By the way, you won't get XY, right? Because there's a  what do you call it? It's not unique. But you'll get something  you should get  they should have the same ranges or whatever. They should be unique Modula of this transformation. Sorry, you should get the right X and Y back. Yeah. I think what happens in that case is you often get back. So I think you said it exactly right. You usually get back what you were looking for. So it's heuristic. You do not have to get it back. For one thing, it depends on where you start from. So  yeah, but these things work, like, pretty well, often. Okay. So that finishes up this. I should say that this was sort of  it was out of order for the class. It was out of order because it  that was just a whole lecture on non-convex methods kind of stuck in the middle of the class. It kind of logically goes at the end, but I thought we'd switch it around for the benefit of those doing projects that are non  involve non-convex problems. Yeah?

Student:[Inaudible] SVD solves like a non-convex problem.

Instructor (Stephen Boyd):It does. That's right.

Student:So are there extensions of SVD that are guaranteed to find the [inaudible] 

Instructor (Stephen Boyd):Yes.

Student: somewhat similar non-convex problems.

Instructor (Stephen Boyd):Good. That's a great question. So a good question is, "What are the problems that aren't convex that you can actually solve?" And now we're back, by the way, now that that's lecture's over, we're back solve now, once again, the  we've just went over the right bracket and the scope. Solve now means solve globally. It doesn't mean this kind of like maybe, possibly get a local solution sometimes, which is what it meant in the last lecture. So we're back to the global definition because that block just finished. And now let's answer your question. So the question is  so clearly there are non-convex problems we can solve. One is things involving like, what's the best rank for the approximation of this matrix? That's  we do that by S  truncated SVD. It's  that is plenty non-convex. And it actually  so the question is what are the others? And there's a handful of others. So there's some  we can solve any problem that involves two quadratics, convex or not. That's just  it's actually not obvious. And then you can involve some weird things, like things involving  some things involving quartacs. But it depends on the dimensions. And these are just by sort of accidents of algebraic geometry. But we're not gonna talk about it. But the point is  I guess my opinion is there's just a handful of problems that are non-convex that you can actually solve. And I mean in the strong form. Solve as in get the solution. Right? So if that's what you mean by solve, there's just a handful. SVD is one, and you're gonna find plenty of others. But 

Instructor (Stephen Boyd):Yeah.

Student:So if you have a convex, concave problem 

Instructor (Stephen Boyd):Um hm.

Student:Why can't you maybe, like, approximate one of them with one quadratic, approximate the other one with another quadratic, and then inherently use that to solve 

Student:CS 205.

Instructor (Stephen Boyd):CS 205. Oh, okay. Great. That was 

Student:Mathematical methods for Vision, Robotics, and [inaudible].

Instructor (Stephen Boyd):Oh, okay. Good. Okay, good. Great. So a bunch of you have seen it. You probably have seen more of it than I have. But we'll  so for you it'll  you can just sit back and, I don't know, relax or something. So for the rest of you, it's great stuff to see. So first, let me start with just sort of the really big picture about solving linear equations. Now, I'm sure all of you know the following. I've said it enough time, but it's important enough. I'll say it again. If you profile, most  certainly, most convex optimization methods, if you profile them and ask if it's  while it's running, what is it doing? It's solving linear equations. That's it. And there's a bunch of other stuff, but it just doesnt even show up in profiling, like, you know, the little line searches and evaluating logs. Zero. What really matters is computing the steps. And that's solving linear equations. So, okay. So  by the way, it's true for a lot of problems. It's just solving linear equations. So let's talk about, at the very big picture, how do you solve AX equals B? Okay. So the  it's kind of a crude classification. It's actually wrong because there's  the boundaries between them are gray. But, in fact, it's helpful to classify these into three huge classes. And here they are. One is dense direct. These are factor-solved methods. We covered this in 364 A. You've probably seen it other places. This  I mean the software involved there is almost universally BLAS in LAPACK. That's  I mean if you use metlab, that's what you're using. Actually if you use anything reasonable, that's what you're using. So these are factor-solved methods. You do  you factorize A into a product of matrices, each of which is easily solved. And then you do a back and forward substitution or something like that. Now, the  here, the high-level attributes of these methods. You can take an entire class on just this, right? No problem. It's like CS 237 or something like that. But  and now we're going through all the horrible details of how do you do the factorizations? What are the round off error implications and all that kind of stuff? But at the highest level, here's what you need to know about these methods. The run time actually depends only on size. That's not completely true, except for positive definite systems, in which case it is true. So for a indefinite systems, you do dynamic pivoting, which means that you actually  there actually are conditionals in the code. You look and you look for a new thing, and you reallocate  and unless you've allocated memory correctly ahead of time and stuff  you know, if you  well, no, I guess in this case that wouldn't happen. Sorry, scratch that. That's coming up.

So there are conditionals, and that means that actually solving AX equals B for a general matrix A actually is not independent of the date. If I give you one A or another A, it can actually take slightly different amounts of time. Actually, it's absolutely second order, so you won't notice it. But it's not deterministic. If A's is positive definite, it is absolutely deterministic because in Cholesky factorization  you do not pivot in Cholesky factorization. And actually a compiler will unroll the loops, and it's just  you know, it's whatever it is, N cubed over three operations, all pipeline, and it just shoots right through. And there's no  it can never be different. It will be repeatable down to the microsecond or lower. Okay? So that's it. Well, there's a few other that happen. Of course, it can fail. I mean there  it can throw an exception if you pass it a singular A. So  but roughly speaking, other than that, it's just  its deterministic. It has nothing to do with the data, except of course when to the boundary where A is singular, it doesn't have anything to do the structured of A. If you toss in a toplits A or a sparse A or bandit A, this thing doesn't care. Okay. And this will work well for N up to a few thousand, I mean immediately. And you can do the arithmetic to find out. It has to do with how much will fit into your RAM and stuff like that. So this will work up to a couple thousand. If you want to use a bunch of your friends' machines and use MPI or something, you can probably go bigger than that. But that's  you know, you can just  I mean it will actually  this will scale up to much bigger problems, but not on a laptop or a little machine or something like that. Okay. So the advantage of these is that it's very nice. It's just technology. This is just A backslash B at the highest level, just done. Right? Extremely reliable, you know. When it fails, it fails generally because you gave it a stupid problem to solve. That is to say A was near singular for some practical purpose.

Okay. Now, this is  so this is the first level. The next huge group is called sparse direct. So these work like  direct means you still are factoring A, but this time you're taking into account sparsity in A. And these  I mean sometimes there are well known name structures, like A could be banded. That's a case where it's named. Or it could be a sparsity  just a sparsity pattern in the sense that there's just lots and lots of zeros all around and so on. Okay? So these are  actually those are sort of different cases, but they're  I would call them both sparse. I'd put them in this part. Now here, the run time is going to depend on the size. Of course, that's always the case, the sparsity pattern. And it's almost independent of the data. It's actually a little bit less independent of the data than it is here. And the reason is in the general, unless it's a Cholesky factorization, you're actually doing dynamic pivoting. And in this case, dynamic pivoting has fill in effects. So you may even end up  you know, if you end up calling Mallock or something in the middle of one of these things, you're gonna be sorry. So, in fact, the way these typically work is you'll actually allocate  before you even go in, you'll allocate a block of memory hopefully big enough to contain all the fill-in that is gonna happen because as your dynamically going through it, you're generating fill-in because you didnt like the pivot that you would have taken, or something like that. Okay? So that's what this is. And these will  it depends on the sparsity pattern. This'll take you up ten to the  I mean generally, ten to the four pretty easily. Ten to the five if it's special, like if it's banded or block banded or has some arrow structure or something like that, you can easily take this up to a million. I mean if it's  if the sparsity pattern is  well, let's say it's banded. You can go to a million like that. So my laptop will easily go up to a million. No problem, none. You can just work out. You just figure out how much DRAM it takes. It's just totally straightforward. By the way, this is described, at least in a very high level  more than I'm gonna say about it here. This is in the appendix of the book on convex optimization. Of course, there's entire books on these subjects, too. Many  and a lot of them are really good. So it depends if you want that much detail. But it's all there.

By the way, the boundaries between them are actually very  they're a bit gray. So for example, people will often use indirect  a few steps of an indirect method, or iterative method. Actually, after one of these, that's called interative refinement. People will also use a pre-conditioner. We'll get to that. Is a pre-conditioner, which is, in fact, based on an incomplete factorization or something like that. And then  so at that point who knows what these things are. So  but it's useful to think of these three huge classes of methods for solving linear equations. That's very useful. Okay, so now we'll get into one, most famous one. Well, they're all  a lot of them are the same. After a while, you get it. So we're gonna look at symmetric positive definite linear systems because actually, mostly that's what we're interested in. We solve either Newton equations, Newton systems, which is H inverse G minus H inverse G. Or even if we're solving with equality constraints, we have a big  an H, A transpose A zero, a big KK key system, which is indefinite. But if you do block elimination, you end up solving something that looks like A, H inverse A, transpose, or something like that. So a lot of it comes down to solving this. Okay, so this comes up, for example, in  well, in Newton. A Newton step is that. It's just  it's least squares equations, regularized least squares. If you minimize just a convex quadratic function  and of course this is solving  this is also solving an elliptic PDE, like a bauson equation, if you discretize. So that's what you end up doing. So, okay, as another example, it's sort of disanalysis of a resistor circuit. So this comes up also in circuit stuff. So here that you have a conductance matrix, and I is a given source current. And you want to calculate all the known potentials, or whatever. By the way, same thing comes up in analysis of reversible mark-off chains. That would be another one that I didn't put here, but it would be the same thing. Actually, it's related to this. So if I give you a reversible mark-off chain, or something like that. And I say, "If I start the thing here at this state, and I want to know what's the probability that it hits this state before that one, or something." You're basically solving a problem like that as well. So  or if you want to calculate, like, hitting times and all this, it's the same sort of thing. Okay, so now we get to CG. Actually, there's a bit of confusion over the method. It's sometimes called conjugate gradient, and sometimes conjugate gradients, with the "s" there. And it's kind of  I don't know. I may switch between them.

And then, I think it also depends if your first exposure to this is in a  if it's  I think one is the British style, and one's the non-British style. But I  anyway, but just to warn you, you're gonna see "s"  sometimes "s," sometimes not. Okay, so this is  it was, as far as anyone knows, it was first proposed in 1952, by Hestenes and Stiefel. And it was proposed as a direct method. And now here's the  instead of getting into the details of the method  in fact we won't get to the details of the method until well  actually the details of the method are actually irrelevant. That's probably not how you guys learned it. But we'll get to that later. There's actually many actual methods that will generate the same sequence of points. So  but we'll get to that later. Here's what it does. Here's  high-level attributes are this. It solves AX equals B, where that's symmetric positive definite. Now, in theory, if you were to do exact arithmetic, it takes N iterations. That's the dimension. Each iteration requires a few inner products in RN; that's fine. And it requires a matrix vector multiply. You have to make it  you have to implement a method that given Z, will compute AZ. So that's what you have to do. That's the  that's basically what you have to do. Okay, now, if you work out that arithmetic of A is dense, then no problem. You call A Z, and if A is dense, you're multiplying N-by-N matrix by vector. That's N squared flops. And congratulations, you run CG. It takes N steps. Each step is order N squared, and you're back to N cubed. Actually, what you have now is an N  is an order N cubed solver that is much less reliable than the standard ones, and actually much less slower because you're using BLAS level one or two. I guess that's BLAS level two because you're using matrix vector multiply, whereas the real method for solving would use BLAS level three and things like that, which would be block, block. And it would use  that would be optimal. Okay, so there's actually no reason to use CG  a CG-type method. I guess, to tell you the truth, CG is really a particular method. And honestly, this really should be called Krumlauf subspace methods. I should even change this slide. I feel so weird because it's really a specific method, but they're all kind of lumped together. Okay, all right, now, if ever matrix vector multiply is cheaper than N squared, you can do better. Okay, so here's an example. If A is sparse, A's multiplying Z by AZ is fast because you only multiply by the non-zeros.

Now, if it's sparse, and not absolutely huge, you're probably better off using a sparse method, like a sparse direct method, maybe, okay? But let's see, someone give me an example of a matrix, N-by-N, you can multiply a vector by fast  fast means less than N squared  but which is not  but which is full. Not sparse. What's an example? Again, all of you have seen these.

Student:[Inaudible].

Student:A counter-product matrix? Like if it's 

Instructor (Stephen Boyd):Perfect. Actually, that's perfect. Yeah. So a matrix that's, like, diagonal plus low rank. Yeah. Diagonal plus low rank. Oh, I mean you have to know it  represent it that way.

Student:Yeah.

Instructor (Stephen Boyd):Perfect example, diagonal plus low rank is clearly full  I mean generally it's gonna be dense, and yet you can multiply  if you know that method, you can multiply real fast. Anything else? This is classic, undergrad EE.

Student:FFG?

Instructor (Stephen Boyd):FFG. So I think  and the right way to say it  let's say it right. A would be the DFT matrix. It's N-by-N. And you would carry out the multiplication using the FFT algorithm. So, and that would be N law again, as opposed to N squared. So that's gonna  that's an example. There's a whole bunch of others. We'll probably mention some of them. Now, we get to the bad news about CG, and this was recognized already in the '50s. It works really badly because of  with round off error. So some methods the round off error doesn't really matter, or you know, it has an effect but the errors don't build up as you go. Others, it's just a disaster. This is just, almost universally, a disaster. I mean, except for the baby examples I'll show with, like, ten variables where CG would be highly, highly, highly, highly inappropriate to use. So, otherwise, it just basically  it's almost universally, it works terribly. So that's the  out of the box, we're gonna see some tricks that will make it work reasonably. Okay, but here's a very important attribute of this method. And it's this. Depending on A, B, you know, some luck, the graduate student that put it together, put it all together  well, no, thats  then what happens is kind of cool. In a very small number of iterations, you can get a rough approximation of the solution quickly. That's a very different thing. If you go back  let's do  let's solve a circuit. I have a circuit with a million nodes. I pump in some currents. This is one step in running spice, let's say. It's a one-time step. I pump in some currents. And all I want to know is know what are the potentials there. If I use a direct  a sparse direct method, and I send an interrupt signal to it, and I just say, "Nope. Sorry, sorry, sorry. Give me the voltages." It's like, "What do you  I  it has  it's not even  have you had ." It says, "I haven't even done, like, the  I haven't finished factorizing it, so I can't give you anything."

So  and this is not like  this is not for the reason that you see in sub-gradient method, where it's not a descent method, where ahead of time you know, this is not a descent method. This is supposed to be a descent method. It's only not a descent method because of round off things and things behaving poorly. Okay. All right. So let's look at how this works. Well, the solution, we'll call it X star, that's A inverse B. And because A is positive definite, does it  when you  actually, there's a very useful thing. When you're solving AX equals B, where A is positive definite, a very good way to think of it is this. You are actually minimizing a convex quadratic function. And that fits, actually, in lots of applications. For example, if you're solving X equals B, and someone says, "Why are you solving that?" You say, "I'm going Newton's method." Someone says, "Explain to me Newton's method." Then you really realize, in fact, what you're really doing is you're minimizing this because Newton's method goes like this. You  someone says, "What's Newton's method?" And you say, "Well, I'm minimizing this weird function. I'm  I have a current estimate of X, and I go to the function and I get from a convex quadratic approximation. And I'm minimizing that." So in fact, this connection is not abstract. It comes up in the applications. Okay. Now, the gradient of this function is X minus B, but this is also  that's a residual. So a residual in gradient of this function over here is gonna be  they're kind of the same. Okay. Now, if you take F star is F of X star. That's this  I guess if you're  that's this thing plugged into that and  I forget what it is, but there's like an A inverse there, or something like that. You get an X  you get a B  it's minus B transposing inverse  anyway, it doesn't matter. But you take F of X minus F star, here's what you get. That is  here's F of X, here, and then minus X star. And you can rewrite that this way. And so it's actually X minus X star  this is the square of  and this sub symbol here means it's the A  people call that the A norm. So that's the A norm, and it means A positive definite. But this is sort of the distance in that metric. So F minus F star is half the squared A norm of the error. Yeah.

Student:[Inaudible].

Instructor (Stephen Boyd):What's that?

Student:The B transpose X stars?

Instructor (Stephen Boyd):Yeah? You mean 

Student:It just 

Instructor (Stephen Boyd):Here?

Student:No.

Instructor (Stephen Boyd):Here?

Student:Can it be a catalytic FX minus F star?

Instructor (Stephen Boyd):Yeah.

Student:Since X and X star aren't likely to be different.

Instructor (Stephen Boyd):Yeah.

Student:B transpose X terms 

Instructor (Stephen Boyd):You mean these don't cancel away?

Student:Yeah.

Instructor (Stephen Boyd):Is that what you're saying?

Student:Yes.

Instructor (Stephen Boyd):No. I don't think  I mean unless you're really good, I don't thing you can go from this line to this one by just looking at it. There's a whole bunch of lines out there commented out. No, no. I mean  yeah, look. But X star is this thing. So X star itself has a bunch of Bs in it. So when I throw  this term here, when it all expands out, is gonna have a giant pile of Bs as well.

Student:Can you transpose X star X transpose AX, or like 

Instructor (Stephen Boyd):Hey, that's a good point.

Student:Yeah.

Instructor (Stephen Boyd):So let me respond to your question by saying this is not  these are not meant to be self-evident. There are many commented lines out. And in fact, they may or may not be correct. They're  I'm going for like 80  80  I'd go 85 percent on the reliability, maybe 90, you know? But I think they're right.

Student:Are you assuming A is positive definite then 

Instructor (Stephen Boyd):I am.

Student:[Inaudible].

Instructor (Stephen Boyd):In fact, A is positive definite until I tell you otherwise. Okay? So yeah, I mean otherwise, a lot of this doesn't make any sense at all. Okay. Okay. And now, a relative measure would be something like this, F of X minus X star divided by F of zero minus F star. And that would be something like that. That's tau. And it's actually the fraction of maximum possible reduction F compared to X equals zero. And actually, these things are very interesting. Here would be one. In a Newton method, what  this would have a beautiful interpretation. It makes perfect sense. So in a Newton method, it would go like this. You have a convex function. You're at a point. You form a quadratic approximation of that function, and then that goes down. And in fact, if you were to find the minimizer of that and take that step, that's the Newton step. And in fact, that height difference is lambda squared over two. That's the Newton decrement, if you remember that. So that tells you, sort of, if the quadratic approximation is right, it tells  it predicts how much you're gonna go down. If tau is a half, it says that you don't have the Newton direction, but you have a direction good enough that you will achieve one half of the reduction you would have gotten had you gotten the exact Newton step. So if you can actually calculate tau  which, sadly, you cannot  as you run these methods.

Student:[Inaudible].

Instructor (Stephen Boyd):They're all just a point  they're single tints. They have their only element is zero. Okay, that goes to Krumlauf's  they are subspaces, right? Every thing's cool. And if you want to minimize that you just  this sequence is just zero, zero, zero, zero. This is when B equals zero. But here's the good news. The first one is a hit because the solution of AX equals zero is X equals zero. So it's okay. A more extreme example would be something like this. What if B were an eigen vector of A? Then what does the Krumlauf subspace look like? That's  it's just the first thing. You get B, and then you get a multiple. So the  actually, all the Ks are the same after K1. So K1 is K2. But again, no problem because then it say that after the first point in the Krumlauf's sequence actually solves it because you get  because it turns out, in that case, if AX equals B, and B is an eigen vector, then in fact X equals A inverse B, which is actually along the direction of B. But we'll get to that. So  but that's obvious anyway. Okay, so  okay. All right, now, this  the K Krumlauf sequence is a particular polynomial. That's a polynomial of degree less than K times B. Now, so far, every thing's actually fairly obvious. And now we get to the cool part, and it's not obvious. It's not obvious. And here it is. It turns out that to generate XK plus one, you can generate it from the previous two, like this. And then these are alpha K. This  in fact, it's just a linear occursion. These are numbers that are gonna come up. Their particular values are gonna come up. This is not obvious at all. And this relies on A being positive definite. That's it. So in general, if I just said that's the Krumlauf  here's the Krumlauf subspace, here. And I asked you to calculate XK plus one, you would actually have to look back at all  I mean just in general, an in fact, if A is not symmetric, you will actually have to look back over all of them. Okay? What that means is the effort to produce the next point in the Krumlauf subspace is actually not going to grow with K.

Student:So are you just taking the residual and getting the best approximation and the last?

Instructor (Stephen Boyd):No, you're not.

Student:You're not?

Instructor (Stephen Boyd):You will see what it is.

Student:Okay.

Instructor (Stephen Boyd):Yeah. It's not quite that. You'll see. So it's  these things are not totally obvious, and we'll get to what they are. So that's it. This  so if you wanna know, like, why is this interesting and all that sort of stuff, it all comes down this one statement. That's where the non-triviality is. Okay? So, okay. And just  oh, I think I mentioned this already, but the Caley-Hamilton theorem says that if you form the characteristic polynomial, like this, of the matrix, then it says you can write  if you  I mean then it says that that's zero. And therefore, will take A to B non-singular, A positive definite. So if you take all these terms  not that last one  and you pull an A out, for example, on the left. And then you put this on the other side, and you get this. You get an explicit formula. And this, I think if you took 263, we saw this. And I probably mentioned it. In fact, my forward reference was to this moment because I said there are  you know, most of the things I would say in a class like that, like we'll get to that later, we never do. I don't know if you noticed that. But in this case, we actually did, if you stuck it out this long. So, okay. So this says that the inverse of a matrix actually is a linear combination of powers of the matrix up to N. Okay? And in particular, it says that A inverse B is a polynomial of degree less than N of A times B. So it's in the Nth subspace, of whatever like that. Okay? So that's the idea. So even if the Krumlauf subspace, by the way, doesn't fill out all of our N, which, of course, in general it does. If you pick a random B and a random A or whatever, it's gonna fill out everything. Okay. So that's it. Okay. So now we're gonna look at the spectral analysis of the Krumlauf sequence. So what we'll do is this. We'll take A and we will write its eigen expansion, or spectral decomposition. And lambda's gonna be this diagonal matrix with the eigenvalues. And we'll define, you know, Y is Q transposed X. That's Y expanded in the Q basis. B bar is gonna be the same thing. Y star is gonna be that.

And basically, this means that the whole problem has now reduced to  everything's diagonal. Now, you know, this is not a method. This is a method for understanding it, just so you know. So it says this. F of X is F bar of Y bar, and I just plug in this thing. That's gonna be my  this is gonna be my Y and that's Y transpose. And so you end up with this. But now this thing is completely separable, and it looks like that. Okay? So. Yeah. I mean this is a bit silly, right? It says that if you want to solve AX equals B, and someone were kind enough to provide the eigen system, you could solve AX equals B very well because it would  you'd transform it. It would convert to a diagonal system, and that  solving AX equals B, ray is diagonal, that's easy. That's no problem. And so you'd be asked to minimize this and that, I presume, you can do. And so on. You'd just get  actually, this is  that's the formula. That's basically capital lambda inverse B bar. But capital lambda is diagonal. So you get this. Okay? So, so far, nothing interesting here, but now let's look at the Krumlauf sequence. Well, what you're doing here is this is your  YK, is the argument of this thing over this subspace here. But this subspace here is actually really simple. It starts with a vector B bar, and you keep multiplying by these diagonal matrices. So these are just different diagonal matrices, and what that says is that the Ith component of Y at the Kth Krumlauf point is a polynomial, PK, times the eigenvalues, times BI bar. Okay? So that's it. And therefore, we can actually say we have now a beautiful formula, not in practice but for understanding. If you understand this, then you understand  you will understand all of it, of how it works. It says this. It says that the Kth polynomial  that's the polynomial that generates the Kth iterate of the Krumlauf sequence, when applied to A and then multiplied by B. That's the argument over all polynomials of degree less than K of this thing. Okay? That's a positive number. It's a weight. And now you look at this and you want to know what polynomials can  what polynomials can you do here? Oh, this thing, of course, will be  you know, this will be  this thing here can be negative. That's the  that's actually the whole point, because of this thing. And so here, it's lambda I times P of lambda I, and then minus P of lambda I. So in fact, well, we'll  this is P of lambda squared.

We'll get to what this means in just a second here. Okay? So you can write this another way. This  it says that if you look at how close you are to the Kth iterate to the maximum reduction you're gonna get, that's F bar of YK, minus F star. And that's the minimum of this. And now, these are just writing  reading it in different forms, so you start figuring out how you wanna  how to understand what happens. But don't worry, we're gonna go over this on next Tuesday, which, as I mentioned earlier, will probably be tomorrow. Next Tuesday. So this  if you simply work out what this is, you take out this thing here. And you end up with this. This is a generic polynomial here, whose degree is N here because  or degree is less  sorry, whose degree  this degree is less or equal to K, but it has it the property that if you plug in, let's see, zero in that polynomial, you're gonna get one. That's gonna be this thing. That's the negative of this. So I can rewrite this this way. And you finally get to something like this. And it's quite beautiful. It goes like this. It says that if you wanna know how close you are to the solution after K steps, you are exactly  there's no inequalities here  you're exactly this. It says, take BI bar squared, that's actually how  that's the  that's basically sort of like  in fact, BI divided by lambda I is the solution, or something like that. But it says, what you want to do is you want to find  you look at all polynomials of degree K, and then this sort of a weighted positive sum of Q evaluated at those points. Okay, now, by the way, you're not supposed to be following all the details. It's just sort of the basic idea here. And then, so what it comes down to is this. It says if there is a polynomial of degree K  ammonic  I'm sorry, not ammonic, but one that adds zero as one, so constant co-efficient one, that's small on the eigenvalues of A, then it means you've got a good approximation. Okay? And this tells  now, this tells you everything. So it works like this. It says, for example, if the eigenvalues of A are sort of all over the place, but let's just see some pictures about how this works. And then we'll quit for today. Let's  here's zero. And let's draw some eigenvalues. So if the eigenvalues look like that, and you have to have a polynomial that starts at one. And if I asked you sort of what's the best degree one polynomial  it means, best, means it should be small near these points. It's gonna be something that looks like this, right? Okay? But if I ask what's the best degree two polynomial, you're gonna make something that kind of  I could  let me just try. It's gonna be a quadratic.

And it's gonna kind of look like that. Everybody got that. And look at this. On here, you're sort of  the polynomial is not too  it's pretty small. So what that says, this thing then predicts that the second  if that' were your second spectrum, the second step in the Krumlauf sequence is gonna actually have really  I mean really low error, right? Everybody see this? This is very, and by the way, if it were exactly clustered, in other words if A had exactly, like, two eigenvalues, X2 will be the solution. Okay? That's not that interesting, but it's a fact. Everybody see why? Because then I can find a degree two polynomial that will go exactly between  that will just go right through and give you zero. And you get the answer. Does this make sense? Okay. So this is  I mean this just the first  your first exposure to it. We're gonna quit here. And in the next lecture, which like I said, will probably be taped ahead sometime tomorrow, we'll go on and look at all the implications.

[End of Audio]

Duration: 77 minutes