ConvexOptimizationI-Lecture14

Instructor (Stephen Boyd):Just a quick poll: I just wanna make sure Ė how many people have tried the image interpolation problem? And how many actually had trouble with it? One, two Ė okay. Unspecified trouble, I think is what theyíre Ė

Student:[Inaudible].

Student:Yeah. I think it was Ė had something wrong with the norm function.

Instructor (Stephen Boyd):Okay. No, thereís nothing wrong with the norm function.

Student:That was an exaggerated one.

Instructor (Stephen Boyd):Oh.

Student:Like, I was trying to do Ė

Student:Yeah, it gets a slightly different response than if you just do the sum of squares.

Student:Right.

Instructor (Stephen Boyd):Oh, you do?

Student:Yeah.

Student:Yeah.

Instructor (Stephen Boyd):How different?

Student:[Inaudible]. [Crosstalk]

Student:[Inaudible].

Instructor (Stephen Boyd):Really?

Student:Yeah. I mean, norm looks good, but itís not Ė

Instructor (Stephen Boyd):Wow. Okay. You know what you should do if you donít mind?

Student:What?

Instructor (Stephen Boyd):Send your code and your build number to the staff website. I mean, the full code and the CVX build number Ė which I should probably mention about how youíre supposed to vote early and often or whatever. You may wanna download Ė redownload this CVX stuff every couple of weeks. Not that anything really has changed or anything. No, no. Itís things like thereís a comment in the user guide that was wrong or thereís something mid Ė tiny, tiny things. But just every two weeks, something like that, when you Ė you can think of it when you clear your cash or something on your browser. Just download another CV just for fun. So I Ė

Student:How do I know when to do it?

Instructor (Stephen Boyd):How do you what?

Student:Know when to do it?

Instructor (Stephen Boyd):Itís easy enough. You just look and see what the build number is. Itís fine.

And you donít have to, right? I mean, some of these Ė most of these builds are just, like, fixing commas. Okay? So itís not Ė

Student:Could we just rerun a [inaudible]?

Instructor (Stephen Boyd):Youíll figure it out. Yeah. Itíll work. Itíll work fine.

Okay. But actually, please, if you actually do find what you believe to be an error Ė and some people have found a couple Ė actually, mostly in the documentation Ė actually, e-mail us. And if youíre finding inconsistencies and numbers coming out, we Ė actually, we wanna know about it. So [inaudible] just passively go, ďOkay. Now itís working,Ē and move on. Please let us know. Some are not Ė can be explained, or thereís nothing anyone can do about it. Others probably need to be addressed. So Ė Okay. Back to numerical linear algebra whirlwind tour. So last time, we got as far as the LU factorization, which is also Ė I mean, itís the same as Gaussian elimination. So this is Gaussian elimination, and it basically says this, ďAny non-singular matrix you can factor as a permutation times a lower times an upper triangular matrix.Ē Actually, this, of course, is highly non-unique. Thereís zillions of factorizations. But for the moment, we donít need Ė weíre not gonna even cover why youíd pick one over another or something like that. Thatís a topic you would go into in detail in whole classes on this topic. But the point is that once you have affected this Ė once youíve carried out this factorization, you solve AX equals B by, essentially, you factor, and then you do back substitutions or forward and backward substitutions. And a very important point about that is that once youíve factored a matrix and the cost Ė if the matrix is dense and youíre not gonna exploit any structure Ė is N cubed. The back solves are N squared. And actually, what this Ė now you know something thatís not obvious at first sight. I mean, you just have to know how this is all done to really know this because itís kinda shocking. It says, ďBasically, the cost of solving two sets of linear equations with the same matrix is actually equal.Ē More or less Ė I mean, for all practical purposes, itís identical to the cost of solving one. Okay? And thatís not obvious, I donít think, because if someone walked up to you on the street and said, ďHow much effort does it take to solve two sets of linear equations compared to how much it costs to solve one set of linear equations?Ē I think a normal person would probably answer, ďTwice.Ē Thatís wrong. The answer is the same. Okay? And thatís where factor-solve method Ė there are other methods where the number is twice. It couldnít be any worse than twice, I suppose. Right? But the point is itís the same.

Student:Could you just put the two in one system? Is that what youíre saying? Like, just to stack the Ė

Instructor (Stephen Boyd):Nope, nope. That would make it much worse. If you stacked it like that, youíd have a 2N by 2N Ė

Student:Yeah.

Instructor (Stephen Boyd):And 2N cubed Ė youíd be off by a factor of eight. So youíd be eight times more.

Student:So what was your argument for being two Ė less than two?

Instructor (Stephen Boyd):Thatís all. One factor, two back solves. Okay. Now we need to get to something very important. A lot of you probably havenít seen it. Itís probably Ė itís one of the most important topics, which I believe is basically not covered because it falls between the cracks. Itís covered somewhere deep into some class on the horrible fine details of numerical computing or something like that, I guess. I donít think itís well enough covered, at least from the people I hang out with Ė not enough of them know about it. And it has to do with them exploiting sparsity in numerical algebra. So if a matrix A is sparse, you can factor it as P1LUP2. Now here, youíre doing both row and column permutations, and you might ask Ė I mean, you donít need the P2 for the existence of the factorization. All you need, in fact, is the existence of P1. You need P1 because there are matrices that donít have an LU factorization. Though the P2 here really means that thereís lots and lots of possible factorizations. The question is why would you Ė why would you choose one over another? And Iíll mention just two reasons Ė again, Iím compressing six weeks of a class into 60 seconds, but one reason is this: you would choose the permutations Ė the row and column permutations in your factorization. First, youíd choose it so that the Ė when you actually do the factorization, itís less sensitive to numerical round off errors because youíre doing all this calculation in floating point arithmetic. Right? Actually, by the way, if youíre doing this symbolically, then you donít need to Ė this first issue doesnít hold. But of course, 99.99999 percent of all numerical computing is in doubles Ė [inaudible] floating points. You have floating points. Okay. Now, thereís another reason, and this is very, very important. It turns out that when you change the permutations Ė row and column permutations, the sparsity of L and U is affected. And in fact, in can be dramatically affected. So if you actually get the row and column orderings correctly Ė thatís these two permutations Ė the LU that you will get here Ė the number of non-zeros in them when theyíre sparse Ė by the way, I should add something like this: that once P1 and P2 are fixed, then you normalize L by Ė or U by saying, for example, the diagonal entries of either L or U are one Ė then this becomes unique Ė the factorization at that point. Otherwise, thereís some stupid things you could do, like for example multiply L by two and divide U by two. But once you Ė then you Ė then they become a canonical form. Theyíre fixed, and so once you pick these, thereís only once choice for L and U once you normalize either L or U. So in fact, the sparsity pattern will be determined by P1 and P2 here. And if you Ė if these are chosen correctly, you will have a number of non-zeros in L and U that is small. If you choose them poorly, itíll just be enormous. And this is the key to solving, for example, 100,000-variable, 100,000-equation problem: AX equals B. Obviously, you canít remotely do that if it were dense. You canít even store such a matrix. Okay? Period. And if you could store it Ė well, I guess if you Ė you could store it and, I mean, if Ė theoretically, you could, but itís a big deal, and youíd have to use a whole lot of your friends machines and all sorts of other stuff to do this. I mean, itís a big, big deal to do something like that. However, if that 100,000 by 100,000 matrix is sparse Ė letís say, for example, it has ten million non-zeros entries, which means roughly Ė this is all very rough. If you have 100,000 by 100,000, you have ten million things Ė it means you have about 100 Ė on average, 100 non-zeros per row and 100 non-zeros per column Ė just roughly. Well, I guess thatís not rough. Thatís exact. When you say average Ė so what that means is something like this: it says that each variable only enters into Ė on average, each of your 100,000 variables enters into, on average, 100 equations. Each equation involves, on average, 100 variables. And by the way, if you still Ė if you think thatís an easy problem to solve, itís not. So I mean Ė however, that can be solved extremely fast if P1 and P2 can be found so that these factors actually have a reasonable number of non-zeros. If you start with ten million non-zeros Ė if the L and U Ė if youíre lucky, L and U will have about on the order of ten million, 20 million on a bad day. Thirty million would be Ė youíre still no problem Ė just not even close to a problem. If you canít find P1 and P2 that end up with a sparse cell, then itís all over. But in many, many, many cases, thereís gonna be very simple Ė shockingly simple and naÔve algorithms. Weíll find permutations that give you a good sparse factorization. Yeah?

Student:How do you know, when youíre finding P1 and P2, whether it can be made less sparse?

Instructor (Stephen Boyd):You donít. You donít. You donít know. So what you do is you run Ė it depends on what youíre doing. You run at Ė youíll actually attempt to solve it, and then youíll be told Ė I mean, it depends, but actually, itíll be really simple. If it just doesnít respond, like, for the Ė and you can look at your memory allocation and watch it go up and up and up and up, then youíll know itís not working.

But itís easy enough to know ahead of time. Thereís actually two passes Ė Iíll talk about that in a minute. The first pass is called a symbolic factorization, and the symbolic factorization, you only look at the sparsity pattern of A, and you actually attempt to find P1 and P2.

In real systems, what would happen is this: the symbolic factorization would run, and by the time itís gotten to some upper limit, like 100 million or a billion, letís say Ė it depends on, of course, what machine youíre doing this on. But after it gets to a billion, it might quit and just say, ďThereís no point in moving forward because the L and U Iím finding already have 100 million entries.Ē

Something like that. So thatís how you do it in the real world. Itíd be two phases for that. Thatís how it would really work.

The way that would work in Matlab Ė which, again, let me emphasize Matlab has nothing to do with the numerics. Itís all done by stuff written by other people. They would just, I think, stop. I mean, you just write A/B, and it just wouldnít respond. And that would be its way of telling you that itís not finding a very good permutation.

Iíll say more about this in a minute.

One thing thatís gonna come up a lot for us is Cholesky Factorization, not surprising. So thatís when you factor a positive-definite matrix. So if you factor a positive-definite matrix, you can write it as LL transpose Ė L transpose Ė I mean, obviously, what this means is itís an LU factorization.

Two things about it Ė you do not need a permutation. Every positive-definite matrix has an LU factorization. In fact, itís got an LU factorization with L equals U.

By the way, just to warn you, thereís different styles for this. Other people that use Ė this is lower triangular. You can also do it as, I guess, letís see Ė upper triangular Ė thereís an upper triangular factor. Thereís different ones, so I guess you just have to check. In fact, who has recently used Ė whatís the chole in LA pack? That would be the standard. What does it [inaudible]?

Student:U.

Instructor (Stephen Boyd):In the upper triangular? Oh, okay. So sorry. Weíre backwards from the world standard. The world standard would be upper triangular, but everything works this way.

So this is the Cholesky Factorization. Itís unique, so thereís a unique Ė and by the way, thereís no mystery to these factorizations. Iím not gonna talk about how theyíre done, but thereís absolutely no mystery. I mean, you basically just write out the equations and equate coefficients. I mean, I donít want you to think thereís some mysterious thing and you have to take this ten-week course to learn what a Cholesky Factorization Ė itís really quite trivial.

To write down what it is Ė the formula for it is extremely simple. What you should know after our discussion last time in multiplying two matrices is that little formula we write down Ė thatís not how itís done. Itís done by blocked and hierarchies and all that kind of stuff, which exploit memory and data flow and things like that.

So let me just show you what I mean here. If you write A11, A12 Ė you know what? Thatís good enough. A22 Ė something like that. You simply write this out that this is L11, L21, L22 times the transpose of that, which is L11, L21 and then L22 Ė like that. And now, this is the Cholesky Factorization, and you just wanna find out what are these numbers? Well, okay. Letís start here. Letís take the 11 entry. We notice that this times that is Ė so we find L1 squared is A11. Everybody following this? So obviously, L11 is square root A11. Okay? Thatís the first one. Now that you know what L11 is, you just do the same thing. You back substitute, and you look at the next row Ė would be this one times this one would be L11, L12 equals A12. And now you divide out. Do I have to do this? I donít think I have to. I mean, so this is totally elementary for you. You just write it all out. And so the existence and formulas Ė in fact, lots of formulas, obviously all equivalent for the Cholesky Factorization Ė you can derive this way. But just Ė and then you could write, by the way, a little chole in C, and it would be about five lines. Again, like matrix multiply, but your chole would be beaten like crazy by the real ones, which would block it up into little blocks optimized for your register sizes and cashes and things like that. Okay. All right. So thatís the Cholesky Factorization. The way you solve Ė and here you get one third, and this is kinda predicted Ė the factor of two is totally irrelevant in flop counts. As we Ė as I mentioned last time, you can be off by a factor of 100 comparing real computation time versus flop count time. So two things with an equal number of flop counts, like for example, your amateur chole or your amateur mat molt will be beat Ė could be beat by a factor of 100 by one that does the right thing. So Ė and they have exactly equal number of flops. Actually, it could be more than 100. You could arrange it very nastily to Ė you arrange N to be just big enough to have a lot of, like, cash misses and things like that. And we could arrange for that number to be almost as big as we like. Okay. So when you do a Cholesky Factorization, not surprisingly, A is symmetric. A basically has about half the number of entries as a non-symmetric matrix. So the fact that the work is half is hardly surprising. But again, the factor of two is totally irrelevant. Nothing could be less relevant. Okay. So this is how you do Cholesky Factorization. And there is one thing I should say about this, which is good to know, and that is this: when you do a dense Ė when you do a non-symmetric LU factorization, the P is chosen primarily so that numerical round-off does not end up killing you as you carry out this factorization. It doesnít end up Ė you end up factoring something, but itís not Ė the product of the things you factor are not equal to the original matrix or something like that so your answer is totally wrong. So the P is chosen. That means that this P is often chosen, actually, at solve time. So when you Ė you look at Ė you have to look at the data in the matrix A to know whether you should do a Ė thatís called pivoting, by the way. By the way, you Ė obviously, if you Ė you donít have to follow everything Iím saying, but itís important to know some of these things. By the way, that slows down a code. Right? Because if itís got branches and things like that Ė and then if itís got to actually pivot and stuff like that, you have data movement. On a modern system, this can be very expensive. Okay? So Ė

Student:Why did you say [inaudible] fewer elements in the positive [inaudible] case? It seems like theyíre the same size.

Instructor (Stephen Boyd):Yeah, theyíre the same size. But this is a general one. So yeah, if you get N squared entries to tell you what it is Ė this is symmetric. So you really only have half the increase.

Student:[Inaudible].

Instructor (Stephen Boyd):Okay. Now Ė so here, you would choose this Ė the P dynamically. So actually at Ė when you factor a specific matrix. Here, you donít need it, and in fact, itís actually well known that just straight Cholesky Factorization actually will Ė is quite fast. And actually, itís quite Ė itís stable numerically. In other words, itís known that this is not gonna lead to any horrible errors.

This means, for example, if this is small, like 100 by 100, you get something thatís very, very fast. And also, itís got a bounded run Ė I mean, itís got a bounded run time. It just Ė it will be very fast. Right? Because there are no conditionals in it, for example. Okay.

Now weíre talking about sparse Cholesky Factorization. That looks like this. Here, you of course, preserve sparsity by permuting rows and columns. Okay?

So of course, any permutation you pick Ė if Ė well, I can write it this way. P transpose AP equals LL transpose. So the other way to think of it is this: you take A, and your permute is Ė you carry out a permutation of its rows and columns Ė the same one. Then you carry on a Cholesky Factorization Ė by the way, itís absolutely unique Ė the Cholesky Factorization Ė provided you normalize by saying that the diagonals are positive. Otherwise, itís a stupid thing.

Question? Okay.

Okay. So this Ė so since you can Ė I can permute the As any way I like, and then Ė Iím sorry, compute the rows and columns of A and then carry out a Cholesky Factorization, when A is sparse, this choice of P is absolutely critical. So if you give me a sparse matrix, and I pick one P, thereís just a Cholesky Factorization. Period. And that Cholesky Factorization will have a number of non-zeros in L. In fact, if you choose P randomly, L will basically be dense for all practical purposes. Okay?

And then itís all over. Especially if your matrix is 10,000 by 10,000, itís pretty much over at that point.

Yeah?

Student:Could you [inaudible] solve the equation using Cholesky? You need to know the matrix is positive-definite. Right?

Instructor (Stephen Boyd):Thatís correct. Right. So Ė

Student:[Inaudible]. [Crosstalk]

Instructor (Stephen Boyd):I can tell you exactly how that works. So in some cases, itís positive-definite because, for example, itís a set of normal equations. Thatís very common. Or itís gonna be something weíll talk about later today. Maybe weíll get to it Ė Newton equations. Right.

So a lot of times, itís known. Now in fact, the question is what happened? Obviously, the matrix has a Cholesky Factorization. Itís positive-definite. In fact, if L is Ė especially if this is the case, itís positive-definite. What happens if itís not? Thatís a good question. And here Ė I didnít get very far in my Cholesky Factorization, but in fact, what would happen is this: at each step of a Cholesky Factorization, thereís a square root. Right? You really did the first one. The next one would be square root of Ė well, thatís basically the determinant. Okay? Itíd be the square root of the determinant would be the next one.

What would happen is this: that square root Ė if all those square roots Ė if the arguments passed to them are all positive, then you will have a Cholesky Factorization. If that matrix is not positive-definite, one of these square roots Ė youíre gonna get a square root of a negative number. Okay?

And then several methods Ė thereís actually different Cholesky Factorizations modified. In fact, some Ė I forget the name of it, but thereís a type of Cholesky Factorization which you might want, and it works like this: right before it takes a square root, it has a test that says if this thing is negative, it throws an exception.

By the way, it can even be arranged to throw Ė to actually calculate a vector Z for which Z transpose AZ is negative, which of course is a certificate proving A is not positive Ė or less than or equal to zero. Whatever. So it would do that.

Thatís one way. Unfortunately, a lot of the other ones Ė because these are built for speed Ė itíll just throw some horrible exception from the square root or something like that.

So thatís Ė thereís a modified Ė in fact, Cholesky is, in fact, the best way to check if a matrix is positive-definite. You find that number. You donít calculate the IG or anything like that. You just run a Cholesky. You donít want a basic Cholesky, which will dump core, and then that will be your Ė then you have a system called Ė anyway. No. Iím joking.

But the way it would work is this: youíd run with this modified Cholesky that, when it encounters a square root of a negative number, does the graceful thing, which is to return telling you, ďCholesky didnít succeed,Ē which is basically saying, ďNot positive-definite.Ē

So thatís the best way to check.

Okay. Letís go back to this one. This permutation. Now, in Ė so hereís the way this actually works, and Iíll explain how it works. The permutation here is generally not chosen in Cholesky Factorization. What Iím about to say is not quite true, but itís very close and is a zeroth order statement Ė itís good enough. You donít choose the permutation in a sparks Cholesky Factorization

to enhance numerical properties of it.

Now technically, thatís a lie, but the first order Ė itís pretty close to the truth. Itís a perfectly Ė itís a zeroth order. Itís a good statement.

You choose the P to get a sparse L. Thatís what you do. Okay? And what happens then is this: thereís a whole Ė I mean, this actually Ė thereís simple methods developed in the Ď70s where you would choose these Ps Ė and I mean, they have names like approximate minimum degree. Then they get very exotic. Then thereíd be ones like Reverse Cuthill-Mckee ordering. Then you get to very exotic orderings.

So, thatís all fine. Thatís all fine, but now Iíll tell you as most of you would be consumers of sparse matrix methods Ė in fact, you donít know it, but you have already been very extensive consumers of sparse matrix methods. Okay? Because when you take your little baby problem with 300 variables and CVX expands it Ė I donít know if youíve ever looked, but if you looked, youíd find some innocent little problem youíre solving with ten assets or something like that Ė if you look at it, itíll all of a sudden say, ďSolving this problem with 4400 variables.Ē

Well, trust me. Oh, systems of Ė when it says 27 iterations, I promise you equations of the size 4400 by 4400 were solved each iteration. Okay? You would have waited a whole long time if this wasnít being done by sparse matrix methods. So youíre already consumers Ė big time consumers of sparse matrix methods. So what Iíll Ė what I wanna say about this is that even very simple Ė by the say, all the things that you did, although itís in the solver, which I donít know the details of Ė those were Ė the orderings were very simple ones. They were things like approximate minimum degree. I mean, just really simple, and for most sparsity patterns, this things obviously work fine. So thatís what this is. Now the exact value really depends in a complicated way on the size of the problem, the number of zeros and the sparsity pattern. So actually, itís interesting to Ė I wanna distinguish the dense versus the sparse, so letís actually talk about that. So Ė and Iíll just talk on a laptop little PC Ė something like that. Dense is, like, no problem. You can go up to, like, 3,000. I mean, just no problem. [Inaudible] scale something like N cubed. You already know that. And itíll just work. Itíll start allocating more memory and start taking, like, a macroscopic time, but yeah. Itíll just Ė itís gonna work. Thereís essentially no variance in the compute time. Itís extremely reliable. Itís gonna work. Itís gonna fail if the matrix you pass in is singular or nearly singular, but then you shouldnít have been solving that equation anyway. So you can hardly blame the algorithm for it. So these things Ė for all practical purposes, the time taken to solve a 3,000 by 3,000 equation with 3,000 variables Ė the time taken is just deterministic. By the way, at the microscopic level, thatís a big lie. But itís good enough. And the reason itís a big lie is very simple. Actually, depending on that matrix Ė the data in that matrix, different orderings and pivoting is gonna do Ė and if you have no pivoting, it might take less time or whatever. And you have a lot of pivoting and all that, it might take more. But itís absolutely second order Ė just not there. Okay? So thatís the deal with Ė so thereís no stress in solving a 3,000 by 3,000 system. Zero. Absolutely zero. It just works. You can say exactly how long itís gonna take ahead of time, and thatís how long itís gonna take. Period. Now you go to sparse. Now Ė well, it depends. Thereís actually two passes. It depends on the sparsity pattern. So letís go to a typical example: 50,000 by 50,000 matrix. So you wanna solve AX equals B, 50,000 by 50,000. Now some people, you can make a small offering to the god of sparse matrix Ė orderings that some people say is effective. But the point is, thereís no reason to believe that you can always solve 50,000 by 50,000 equations. Okay? And in fact, thereís ones that you canít. And it just means that whatever sparse solver youíre Ė whatever ordering method is being used, typically on your behalf Ė I mean, unless you do go into large problems. If you do large problems, you have to know all of this material Ė every detail. Okay? If you donít work on huge problems, then this is stuff you just have to vaguely be aware of on the periphery. Okay. So in this case, 50,000 by 50,000 Ė thereís a little bit of stress. You type A/B, it might come back like that. Youíll know examples where itíll come back like that soon. Okay? Or it might just grind to a halt, and you can go and watch the memory being allocated and allocated and allocated, and it just Ė at some point, youíll put it out of itís misery. So thatís what Ė however, the interesting thing is this. Once thatís done Ė if you do it in two steps Ė if you do a symbolic factorization first, youíre gonna know. Youíll know. Then after that, itís Ė for matrices of that sparsity pattern, itís absolutely fixed. Right? So this value that has a lot of implications in applications. If you write general purpose software that has to solve AX equals B where A is sparse, you donít know. Basically, you deal with it every time itís fresh. If I pass you an A matrix, the first thing you do is you do a symbolic factorization. You start going over it, and first, you look at it and you go, ďWell, itís sparse. All right. Letís see what we can do with ordering.Ē And then you do the ordering. You do the symbolic factorization. You know how many non-zeros there are gonna be. At that point, if you wanted to, you could announce how long itís gonna take to solve that problem because once you know the sparsity pattern of L, itís all over. You know everything. In fact, just the number of non-zeros is a good Ė question?

Student:[Inaudible] symbolic factorization?

Instructor (Stephen Boyd):Well first of all, thatís in the appendix, which you should read. So letís start with that.

Student:Appendix B?

Instructor (Stephen Boyd):Well, yes. Okay. Some people are saying itís appendix B. Itís one of the appendices. Thereís not too many, so a linear search through the appendices wouldnít Ė itís feasible, I think. So the appendices where this is covered in more detail, but you have to understand even that is like condensing a two-quarter sequence into 15 pages or something.

So Ė okay. So the interesting thing is if youíre gonna solve a set of equations lots of times Ė like, suppose you wanna do medical imaging or you wanna do control or estimation or some machine learning problem or something like that, but you have to keep solving the same problem with the same sparsity structure. At that point, actually, you could invest a giant pile of time into getting a good ordering. So it could take you Ė you could run this whole communities that do nothing but calculate ordering. You could actually calculate your ordering for three days, and when you get Ė by the way, if you get an ordering that works and has sparse Cholesky Factorization, itís no oneís business how you got it. Right? Like a lot of things, itís just Ė itís fine. So you could do that. And then, it can be fast. So Ė okay. All right. Well, thatís an Ė oh, let me just summarize this. For dense matrices, totally predictable how much time it is. Sparse Ė depends on the sparsity pattern, depends on the method you or whatever is solving this equation on your behalf is using to do the ordering. By the way, if it fails on Matlab or something like that because itís using the simple approximate minimum degree, symmetric minimum degree, blah, blah, blah ordering, it could easily be that if you were to use a more sophisticated ordering, it would actually solve it. Just no problem. Right? So Ė okay. On the other hand, it is too bad that the sparsity pattern plays a role in these methods. On the other hand, hey. You wanna solve 100,000 by 100,000 equations? You canít be too picky. You canít say, ďWow. Thatís really irritating. Sometimes I can do it, and sometimes I canít.Ē So anyway. All right. Enough on that. And finally, one more factorization weíll look at is the LDL transpose. So this is just for a general symmetric matrix. Here, the factorization is this. Thereís a permutation. Thereís an L and an L transpose. In fact, they can have ones on the diagonal. In fact, you canít Ė what you do here is D is actually a block Ė itís block diagonal, and the blocks are either one by one or two by two. The essential matrix is easily inverted, by the way. If I give you a block matrix where the blocks are one by one or two by two, can you solve DZ equals Y? Well, of course you can. Right? Because you either are dividing or solving a two by two system. So Ė okay. And itís the Ė itís sort of the same story. So I think I wonít go into that. Now, weíll look at something thatís actually Ė everything weíve looked at so far is just so you can get a very crude understanding of what happens. Thereís nothing for you to do so far. Right? So in fact, most of this stuff is done Ė in most systems, this is done for you. You solve a Ė you call a sparse solver. You can do that in Matlab just by making sure A is sparse and typing A/B, just for example. Or the real thing would be to directly call something, like, from UMF back or spools or thereís all sorts of open sores packages for doing this. And youíre probably using them whether you know it or not. So youíre certainly using it in Matlab. If you use R, youíre using it. If Ė anyway. So Ė actually, I donít know. Does anyone know? Do you know what the sparse solver is in Packagen R?

Student:I donít have one.

Instructor (Stephen Boyd):No, come on.

Student:Itís a separate package.

Instructor (Stephen Boyd):Than Ė okay. Thatís fine. But it has one. Come on.

Student:Yeah, but itís not like private [inaudible] language.

Instructor (Stephen Boyd):Well, thatís okay. Thatís fine. Thatís Ė I mean, come on. Itís an open sores thing. It doesnít matter then, right? So thatís fine. Okay.

Now, weíre gonna talk about something that is gonna involve your action. Everything so far has been Ė itís just me explaining what is done on your behalf by some of these systems. Now, weíre gonna talk about something you actually need to know because youíre gonna need to do it. So Ė and these are things, by the way, that will not be done for you. They cannot be done for you. Youíll have to do them yourself. So this Ė now is the time Ė up till now, itís just sort of, ďHereís how things work.Ē Now you better listen because now thereís gonna be stuff you have to understand and youíll actually have to do.

So weíll just start with this blocking out a set of equations. Really dumb Ė I mean, we just divide it up like this. You can divide it up any way you like, and weíre just gonna do elimination. So itís gonna work like this: weíll take the first row Ė block row. It says A11 X1 plus A12 X2 equals B1. So I just Ė I solve that equation by multiplying on the left by A11 inverse, which Iím assuming is invertible. Obviously, a leading sub block of a non-singular matrix does not have to be non-singular. Right? Thatís obvious, as in 0111 is obviously non-singular, and its leading one by one block Ė its A11 is certainly not.

But assuming it is, you can eliminate X1, and you just get this. Thatís X1. You back substitute X1 into the second equation, and in the second equation, you get A21 X1. Thatís A21 A11 times this junk over here. And then that plus A22 X2 equals B2, and that equation looks like this.

Now, this guy you have seen before, although you donít remember it probably. Or maybe you do. Thatís the sure compliment. The sure compliment just comes up all the time, so you might as well get used to it because it comes up all the time.

By the way, if you have an operator called the sure compliment, thatís what Ė the Cholesky Factorization is, like, two lines because you just keep calculating the sure compliment of kind of whatís below you or something like that.

So thatís the sure compliment times X2 is equal to this right hand side.

Now Ė so this suggests a method for solving this set of equations, and you do it this way. You first form A11 inverse A12 and A11 inverse B1 Ė and by the way, I have to say something about this because this is sort of already high-level language. When you say form this, how do Ė by the way, how do you do it? Do you actually calculate the inverse of A11 and Ė it depends on what A11 is. So what this really means is the following: you factor A11 once, and you back solve for B1 and then each column of A12. Okay?

So itís understood when you see this in pseudo-code that you know what youíre doing and that you donít form this inverse and all that kinda stuff. Itís just understood. Thatís what it means.

Student:[Inaudible] in Matlab because we can wait to do that?

Instructor (Stephen Boyd):Yeah.

Student:Can Matlab store that [inaudible]?

Instructor (Stephen Boyd):Yeah. So let me Ė this is coming up on your homework, which weíll put on the web server. That is if itís up. Sorry about the AFS outage yesterday. I donít know how many people noticed it. Anyone notice it yesterday? I think it raises our blood pressure a lot more than yours. So I gotta learn, actually, just to just kinda Ė itís like a power outage and just deal with it. But it makes you very nervous. Thatís my, like, nightmare Ė number six worst nightmare is final exam Ė AFS outage. Or something like that. I donít know.

Okay. What were we talking about? Fact solving. Yeah, yeah. Matlab. All right. Yes. Of course.

Okay. Letís suppose that you needed to Ė letís suppose we had to do a Cholesky Factorization just Ė okay. So the way you do it is this: So youíre gonna factor a matrix A Ė okay. Hereís what you do Ė and this will work. So first, I have to tell you a little bit about what a backslash does. Okay? Which you could figure out anyway.

So it does a lot of things. Iíll just give you whatís essential for what weíre talking about now. Hereís what this does. First of all, it checks if A is sparse or something. But even if A is full, it looks to see if A is upper or lower triangular. Okay? If A is upper lower triangular Ė actually, more than that. If it can be permuted to upper lower triangular Ė but letís say itís just Ė if itís upper lower triangular, it will actually be back solved. Okay?

So for example, that means if I write A is 3,000 by 3,000, thatís a big enough matrix that you will see a macroscopic difference. And I type A/B if A is full. Itís gonna take order N cubed. I donít know how long itíll take Ė 20 seconds, 15, 12 Ė I donít know. Something like that. Okay?

I mean, obviously, it depends on the machine youíre on. But itís gonna take a macroscopic amount of time. Youíll hit carriage return, and you will wait a little bit before you get the answer. Okay?

If A is 3,000 by 3,000 and lower triangular, what will happen when I type A/B?

Student:N squared?

Instructor (Stephen Boyd):Yes. Itís N squared, and itíll be Ė I wonít even see it. Iíll just Ė Iíll hit carriage return. Itíll take more time to actually go through all the Ė to interpret it and to instruct X or whatever is drawing your window to type into that terminal. Thatís gonna take way more time than solving it.

So itíll go from, like, ten seconds to nothing. We can Ė in fact, we can guess how much faster itíll be. Itís gonna be on the order of 3,000 times faster Ė thatís by the order. But in fact, the numbers are different by a factor of two. So itíll be about 1,000 times faster. This is my guess.

Okay. So this is the first thing you need to know. Okay.

Now, suppose you wanna do Ė suppose you wanna solve AXI equals BI in Matlab where you have a bunch of these. Okay? So this would be intensely stupid. I mean, it should work, but anyway Ė that doesnít matter. If you did this Ė Iím just writing something thatís in between Ė I donít even know what it is.

But anyway, if you did this here, this thing Ė what do you think itíll do? Itíll just Ė itíll look at A fresh each time and go, ďGreat. Letís factor that guy.Ē Right?

Itíll even recalculate the permutations fresh every time. Okay? So this is not the way to do it. Okay?

If you do this, though, it will actually do the right thing. If you do that, itíll do it. I mean, this is nothing but a batch Ė thatís a batch solve. Thatís all it is because the columns of A inverse B, where B is this thing, is nothing but A inverse B1 inverse B2 and so on. So you can call this a batch solve or whatever you want.

So if you write A/B, itíll do the right thing. Itíll factorize A once. Itíll go through the columns and back solve.

Actually, to tell you the truth, even that is not Ė thatís actually not what it does. But letís pretend thatís what it does. It doesnít do that, but it again blocks it out and calls an LA pack code. But these are fine details.

Thisíll do the right thing here. Okay?

So for example, what is the run time of this versus this? Whatís the run time of those two?

Student:[Inaudible].

Instructor (Stephen Boyd):Identical. Basically identical. Itís not identical, but itís Ė basically, itís the same for all practical purposes. Okay? So thatís an important thing to know.

Now, we can put all these together. And now suppose you need to factor A once, but then you donít have all the things you wanna back solve at the same time. Then you canít do this trick. Itís not a trick. I mean Ė okay. Hereís what you do. Now weíll get through it, and Iíll show you how to do it.

You would do something like this. Iím just gonna do this approximately. By the way, I think this will give you the upper triangular version of L, but it hardly matters.

So you do this once. Okay? And the cost on this is N cubed. Is that the comment? I think it is. Okay, there. Okay? Thatís an N cubed thing. Okay?

So for 3,000 by 3,000, youíre gonna Ė thereíll be a microscopically visible delay when you do this. Okay?

If by then, do you Ė well, letís see how I can do this. You have LL transpose equals A. So A inverse B is L minus TL inverse B. Is that right? Yeah. So hereís my Ė ready for my code now? If I do this in store L, anywhere else from now on Ė if I wanna solve AX equals B, I can do it super fast like this. Letís see if I can get this right.

There. There you go. I can execute this anywhere I like with different Bs, and it will execute super fast N squared. Do you know Ė do you see why? The reason is Ė I mean, thereís a lot of interpretive overhead, so the whole thing Ė I mean, this is a bunch of stupid overhead. Basically, it looks at this Ė I mean, Iím just Ė this is just Ė this is the formula for that. Right?

So it looks at this, and it says, ďOkay. Letís look at L.Ē And it looks at L, and it pokes around for a while. And then after a while, it says, ďSay. L is lower triangular. Oh. Okay. Iíll call the back solve instead of doing blah, blah Ė instead of doing Ė factoring it. Itís already factored,Ē or something like that.

So weíll do this. Then this will be passed to this thing, and itíll look at that, and itíll analyze the matrix as if itís never seen it before and go, ďHey. Wait a minute. That thing is upper triangular. Oh. Okay. Iíll just go ahead and Ė.Ē

By the way, all of this thinking and Ė takes way longer than to actually do it, unless the matrix is a couple thousand by a couple thousand. Right?

So Ė and then itís fine. But the point is that this will now execute very fast. So I think I just answered your question.

Okay. All right. So letís Ė back to block elimination, which is what this is called. So what you do is you form A11 Ė and actually, the comment I just made is very important because itís about to come up right now.

So these two, you would solve by factorizing A11 once and then back solving on A12 concatenated with B1. So this line would translate into a very compact thing. Okay.

Then you form the sure compliment. Now, to form the sure compliment, the point is that youíve already calculated these columns. So you would Ė this is just a matrix multiply here and a subtraction. So you form the sure compliment, and you form B tilde Ė thatís this right-hand side guy here. This guy. No more of this inverse is appearing here, but youíve already calculated the sub expressions, so this is fine.

Then you determine X2 by solving this equation. Thatís the sure compliment

times X2 B tilde, and then you get X1. Once you get X2, you go back and you replug X1 from wherever our formula was Ė here it is. Right here. From this guy. Okay?

Now, the interesting thing here is you have to do the following: you have to solve A11 X1 equals this. But guess what? A11 you already factored on step one, so this last Ė step four is actually a back solve. It is not a factorization Ė the back solve. Right?

So these are very important things.

Student:I have a question.

Instructor (Stephen Boyd):Yeah.

Student:Why does the sure compliment arise? Like, whatís the common thread?

Instructor (Stephen Boyd):Oh.

Student:Is it just coincidence? Or is there, like [inaudible]. [Crosstalk]

Instructor (Stephen Boyd):No, itís not coincidence at all. Yeah, thereís a physical reason for it. It has lots of Ė I think you did a homework problem on it, anyway. I think itís Ė it comes up when you solve one set of equations Ė if you have a block set of equations, and you solve one in terms of the other. Thatís basically where it comes up. So Ė and it comes up in a lot of physical things, too, like in networks and Ė it just comes up all the time in all sorts of things. PDEs Ė it just comes up everywhere, basically. So whenever you have some linear system, and you impose some boundary conditions on some of the variables, a sure complimentís gonna come along.

It comes up in statistics, too. I mean, this is Ė this happens to be the covariant Ė this is the error covariance of estimating X1 given X2 or the other way around. Do you remember which it is? One or the other. But I have a Galician joint variable, and I estimate X1 condition Ė if I look at the random variables expected Ė the condition expectation of X1 given X2. I look at the conditional covariance Ė itís this thing. So just Ė it just comes up everywhere.

So Ė okay. So if we work out this block elimination and see what happens, you have a big drum roll, and you work it out, and you think itís all very clever. When you total it up, you get something like this. And weíre gonna do it in terms of factorizations and solves Ė back solves. You get something that looks like that. Okay? And weíre going to assume, by the way, that the smaller system Ė the sure compliment one Ė weíre gonna do by dense methods. So to pay for the dense method, itís gonna be N cubed.

So you get something that looks like this Ė N2 cubed. You get this. Now, if you just plug in the general matrix A, you Ė it comes out, actually, not down to the leading order, but of course, as it has to be down to the flop. Itís identical to what we had before. So thereís no savings. Thereís absolutely no reason to do this. Well, sorry Ė in terms of flop count for a dense matrix.

Now, the truth is this is exactly how itís done when you do blocking and stuff like that. This is how you get fast because you arrange for, like, the sure compliment block size to be just right for your registers and your L1 cash and all that kinda stuff. And you get that size just right, and this thing will give you a big speed up. But thatís a secondary question.

Now, this is useful Ė when this is Ė block elimination is useful exactly when A11 has extra structure that can be exploited. Okay? Now, thatís Ė when it does, this method is gonna speed you up like crazy. So letís look at some of those.

Hereís an example. Weíll do a couple of examples, actually, because theyíre very, very important.

If A11 is diagonal, okay? That means that your matrix looks like this. Iím just gonna draw a pattern like that. Okay? If it looks like that, then youíre gonna do unbelievably well, and hereís why. Letís just put some numbers on here. Letís put 10,000 here, and letís put 1,000 here. Okay? Something like that.

Youíre gonna do really, really well. So I have 11,000 equations, 11,000 variables. But itís blocked in such a way that A11 is diagonal. Okay?

And by the way, this has meaning Ė this will have meaning in any application which this arrives. It basically common variables, and everybody depends on them. Every equation, everybody Ė if you go Ė if you look at every variable, they all depend on those 1,000 common variables.

But then thereís 10,000 variables that donít depend on each other. And all Iím doing is Iím providing the English description of this sparsity pattern. Thatís all Iím doing. Okay?

So if you see this Ė actually, from this class onward, if you see this, some big green lights should light up. This is what you want to see. If you see something like this, you will know this will scale to unimaginably large systems. Okay? And itíll scale very gracefully.

Now, the reason here is this. You just go back and look at what weíre gonna save, and you can actually even determine Ė I lost my Ė oh, here it is right here. All right. So letís actually see what happens in this case. A11 is diagonal. Well, this is, like, super cheap right here. Thatís, again, a non-issue. Donít even think about it.

This Ė by the way, hereís the joke. Forming S is now the dominating thing. So in fact, if you were to Ė the joke, as in this problem Ė if you were to actually do a Ė if youíre gonna profile the code that solves this, you would find out that, most of the time, it would be done doing matrix multiplication. It would actually be forming this. Okay? Thatís gonna dominate completely, and in fact, itís gonna be something like Ė itíll be 1,000 by 10K by a 10K by 1,000. Matrix multiply Ė thatís it.

In the end, youíll have to solve 1,000 by 1,000 system, but thatís gonna be 1,000 squared, and this one was 1,000 times Ė 1,000 squared by 10,000 is ten times more. So we know exactly the number of flops itís gonna take. Itís gonna be 1,000 Ė itís gonna be multiplying this matrix by one like that. Thatís it. And it will be seriously fast.

If you were to take this and pass in a matrix like this to Ė and just write this back slash something, itís not gonna work. I mean, basically, itíll just sit there and Ė actually, itíll very happily start factoring this Ė very happily. This will not be recognized Ė this thing. This structure will not be recognized, and itíll start factoring things and things like that. And most of the time, of course, itís multiplying zero by zero and adding it to zero and then storing it very carefully back and then accessing it again later Ė pulling it back in. But itís very happy to do so. No problem. Itíll just do it, and it just will never finish. Okay?

So Ė by the way, this is called arrow structure. So itís good. Itís a good thing. So from now on, arrow Ė by the way, this Ė letís do another example. Suppose, in fact, this 10K was ten 1K by 1K blocks. Okay? I canít fit ten in there, but you get the idea. Okay?

By the way, thereís a story behind this. Basically, it says, ďI have 11 groups of 1,000 variables each.Ē Okay? One of those groups connects to everybody Ė thatís this last group. The other ten only connect to that last guy. So Ė in fact, you should even visualize the following graph. Right? It looks like that except thereís ten of these. Okay?

So each note here on the graph is a group of 1,000 variables. This group does not depend on that one, but one of them depends on all of them. How do you solve this? By block Ė I mean, you do block elimination, thisíll be way, way fast because how do you do A11 inverse?

A11 is 10,000 by 10,000. Whatís the effort to compute A11 if itís ten 1,000 by 1,000 blocks? How do you actually Ė well, you donít calculate A11 numbers. Sorry. How do you calculate A11 inverse A12? How do you form this expression? How fast can you Ė whatís the cost to factorize a block diagonal matrix?

Yeah, you factorize each block. So this is ten times 1,000 cubed divided by three. That beats like crazy 10,000 cubed over three. I mean, by a long shot. Okay? So thatís very important.

Okay.

Student:So are Ė is the moral of the story donít use A/ and just go through this algorithm in your own script?

Instructor (Stephen Boyd):No. Now Iíll tell you a little bit about this. Okay. So having said all that for this example, hereís what would happen. So Ė okay. So let me now, letís talk practically about what would happen here. It would probably work. If you had a matrix whose sparsity pattern looked like this, okay? If you did this Ė now, let me just explain a few things. If you just did this /B, too bad. Itíll never work. And itís 11,000 by 11,000 Ė never, never. Okay? However, if you gave, say, Matlab or some other system the hint that this was sparse Ė so if you say itís sparse, all youíre saying is that thereís a bunch of zeros. There are a bunch of zeros. You see this giant triangle here Ė giant triangle here. These are all zeros. Okay? Thatís a lot of zeros. Itís a lot of non-zeros, though, over here. Right? But if you do Ė then, actually, it is overwhelmingly likely that it wonít do a bad job because the method will actually get all of these guys first, and then Ė in the ordering, you will get something Ė it might not be Ė it might not get it exactly right, but itís likely to do pretty well because this Ė so in fact, this arrow structure is one that a normal sparsity handling system is probably gonna recognize. Okay? So in this case, declared as sparse Ė do backslash Ė you can do your own experiments, but theyíre not really very interesting experiments. Theyíre just experiments on what some people implemented in messages. I mean, itís not that interesting. Right? I mean, the important part is if you do real problems like this, youíre calling these things directly. Then, of course, itís your job to do this. But now, let me mention some others where itís not obvious. Okay? Letís do one. Hereís one. Letís suppose you do medical imaging, for example, and suppose you do MRI or something like that. And you have to solve an equation that looks like this. Okay? Letís make this like one million variables, and letís make this 1,000 here. By the way, I have no idea what this is, but letís just imagine that such a thing is there. And this one million by one million here Ė if this is, like, diagonal, you know what to do. We just did it. Iím gonna make this, though, a DFT. Okay? So itís a discrete Fourier transform. Thatís a dense matrix Ė completely dense. So you can form Ė well, you could attempt to form your million plus 1,000 by a million plus 1,000 matrix and actually attempt to solve it. Itís completely dense. Everyone agree? Totally dense. Okay. Those sparsity is gonna help you out Ė the gods of sparsity Ė you are on your own on this one. Okay. However, how fast can you solve this system? You canít even store this, right? All you have to store is this part, right? This is pushing it, but thatís your fault for going into medical imaging. You do medical imaging, youíre not gonna do this on something with a couple of gigs of RAM, right? Obviously, we can work out what that is. But this is perfectly storable otherwise. You donít store this block, obviously. How do you solve this?

Student:[Inaudible].

Instructor (Stephen Boyd):Yeah. Of course. So you look at this, and you go back, and you say, ďWell, what do I have to do?Ē

And you go, ďWell, I have to calculate A11 inverse a couple of times.Ē

By the way, you do not solve a discrete Fourier transform equation by Ė you canít even form Ė if itís a million by a million, you donít even form a factorization. It doesnít matter. You actually calculate by the inverse DFD. Itís analogue N. So itís 20 times 20 times a million flops. Itís way, way fast. Okay? Everybody got this?

So Ė by the way, these are not obvious things. Theyíre known by lots of people, but probably more than an order of magnitude, more people who should know these things donít know them. So thatís why Iím telling you these things.

So Ė okay. So this is very important to know. So Ė in fact, I cringe whenever I hear people very unsophisticated Ė in very unsophisticated ways talking about scaling. And itís like, ďWell, I tried that. Itís way too slow.Ē

Well, anyway. By the way, the difference between this and just typing A/B Ė the technical name for that is Ė this time, they actually Ė thatís called stupid or naÔve linear algebra, and this is called smart linear algebra.

Smart linear algebra means you kinda know what youíre doing, and youíre exploiting structure at this level. Thatís what it means. So Ė

Student:How do you [inaudible] in English?

Instructor (Stephen Boyd):How do you communicate it Ė well, you Ė well, it might Ė itíll just end up being sparse. I mean, do you mean in Matlab, how do you do it?

Thereís a very bad way, and then thereís a good way. You make it sparse in the beginning, and then itíll be preserved as you do the right operations on it.

You have to be very careful in Matlab because, like, one false move Ė one little character here, and boom. You have a dense matrix. You add identity to it. Now, adding identity to a sparse matrix, you got a sparse matrix. Right? Not in Matlab, you donít. So you have to know, ďOh, you shouldnít add IEYE, you have to add spy or something.Ē I donít know. SPEYE Ė that would be the sparse identity.

Student:[Inaudible].

Instructor (Stephen Boyd):Right. Now, of course, that makes you wonder under what circumstances the identity not sparse, and I donít know of any. Maybe two by two case because then you could say, ďWell, only 50 percent of the entries are non-zero.Ē

I donít know. Itís not my Ė but anyway. Okay.

So I think thatís it. So basically, you wanna look for things Ė block things. If you look at a block, and you see a block thatís easily solved, exploit it.

By the way, if whatís easy to solve about it is it has to do with itís sparsity, you can just be lazy and let your sparse Ė let whatever Ė do your offering to the gods of sparse matrix factorization orderings Ė heuristics for the orderings, and hope that it works. But if itís dense or something like that, youíre gonna have to do the work yourself. That comes up in a bunch of cases, one of which weíre about to look at right now. But thatís the idea.

Letís quickly talk for fun about some fast Ė what systems can you solve fast? I mean, weíve already looked at some obvious ones: diagonal, lower triangular Ė we just talked about one a few minutes ago: block diagonal. How about, like, block lower triangular? Can you do that fast?

Yeah, of course you can because if itís block lower triangular, you can do back substitution at the highest level or whatever, and youíll have matrix inverses to do on the blocks. Okay?

We talked about another one, [inaudible] FD. Sorry Ė thatís the algorithm. DFD Ė so discrete Fourier transform is the operation. [Inaudible] FD is am algorithm, which does it fast.

Others would be things like DCT Ė so Discrete Cosine Transform that youíd use in imaging. Anybody know others that are fast?

Student:Toplets?

Instructor (Stephen Boyd):Which one?

Student:Toplets.

Instructor (Stephen Boyd):Toplets, yeah. Toplets is a good one. So you have fast algorithms for toplets matrices, and they all have names. And usually, people are tortured by having to learn them every operation by operation like Levinson-Durbin is one of them, and Ė did they teach this in statistics? Probably. No. Okay, good. Fast Ė oh wait. It doesnít matter. For toplets? No. Okay.

Hunko matrices is another one, so it goes like this.

Student:Circulant.

Instructor (Stephen Boyd):Circulant Ė yeah, circulant. How fast can you solve a circulant?

Student:N log N.

Instructor (Stephen Boyd):N log N. Right because you take an FFT, multiply FFT back. So that Ė and circulant, of course, is the matrix is circular convolution. Convolution Ė well, thatís toplets. Weíve already covered that.

Thereís a bunch of others, and itís fun to actually poke around and learn about others, like thereís one called the fast Gauss transform that comes up in machine learning. Thatís, like, way fast.

So itís Ė thereís a bunch of them that are, like, order N. Thereís some other ones that come Ė theyíre whole fields that can be described as fast methods of solving AX equals B.

For example, if you solve an elliptic PDE, you are solving a very specific AX equals B. Okay? And those can be done, like, some of them are light, and they can solve it insanely fast. Itís not easy to do. It is not trivial. But for example, if you go to someone and say, ďOh, I really Ė I have this grid. Like, I have an image,Ē letís say. Right? For example, suppose you have an image, and all the constraints only connect a pixel and its immediate neighbors. And you say, ďI have to solve a set of linear equations that looks like that.Ē

Itíll have a very characteristic look if you look at it. Itíll be, like, stripes and things like that. And if you do PDEs and things, youíll just immediately say, ďAh. Okay.Ē

But in fact, solving that equation, thatís the one-one block in an image. Letís do the image youíre doing or will do or have done in homework eight Ė seven. What are you doing now, anyway?

Student:Six.

Student:Six.

Instructor (Stephen Boyd):Oh. Well, that shows you what weíre thinking about. All right. So suppose we scale it up to a million by Ė 1K by 1K, so you have a million variables. You Ė actually, each step in that algorithm Ė the 20 steps you will solve Ė I promise you a million by million equation. Definitely.

Itís super duper sparse. Right? Because each row and column will only have, like, two and three entries because youíre connected to yourself. Youíre connected Ė four entries. Whatever. Youíre connected to your neighbor above, below, left, right. Right?

So super sparse. By the way, sparse orderings wonít do so well. It does okay for your homework Ė or it should, anyway. It does. But it wonít scale to bigger numbers very gracefully. However, if you look at it carefully, youíll realize that, in fact, solving that set of equations Ė that big A11 block for that problem, you know what it is? Solving an elliptic PDE on a uniform grid.

Then, you go to your friends. Itís like a plus on equation or something like that. So you go to your friends who know about PDEs, and you say, ďHow do you solve that?Ē

And they fall down laughing. And theyíre like, ďAre you kidding? 1K by 1K? We precondition. Blah, blah, blah Ė Fourier transform Ė multi-level.Ē

I mean, whatever they do Ė thereís probably some people here who do this. Okay? Itís not easy. Itís a whole field. Right? But the point is, they know how to do it. And then you Ė after they calm down, you say, ďGreat. Could you just give me the code that solves that?Ē

And then wherever you have to do A11 in the back substitution, you put that there. Now youíre done.

So, okay. Okay. Thatís enough on that.

Letís look at a consequence of this. Itís the idea of a structured matrix plus a low rank. This is actually very important. This is also extremely good to know. And this is also Ė this is something that you have to do. In other words, you better recognize this because no automatic system can do this. Period. So you have to do it.

So here it is. Suppose you need to solve A plus BC times X equals B. Normally, B and C are small. So youíre supposed to think of B Ė this is low rank. This is, like, rank P. B is a skinny matrix. C is fat, and A is something thatís quickly invertible. A could be the identity or diagonal Ė block diagonal, lower triangular. You take your pick. It could be solving a plus on equation. It doesnít matter. Itís something. You can do fast circulant Ė who cares?

Okay. And you wanna solve this equation. Now, hereís the problem. Generally speaking, whatever structure was good about A is almost certainly destroyed by adding Ė in fact, just add a dyad. Just add a little B, a little C to it, and itíll destroy anything. Itíll Ė and of course, itíll destroy a DFD matrix, a DCT matrix, toplet structure is totally blown out of the way. Diagonal Ė forget it. You make an outer product, itíll spray it non-zeros everywhere. A block diagonal ruined Ė so thatís kinda the idea.

Okay. By the way, once someone calculates this and hands you this matrix Ė all over. You canít Ė if they donít tell you itís A plus Ė once they multiply this out and hand it to you, itís all over. Youíll never recover. Youíll never get any speed out of it or anything. Youíre doing Ė youíre stuck at N by N now.

Okay. So letís kinda do this. What you do is you do what we did in duality, which is you uneliminate. You introduce new variables. Itís strange. So hereís what you do. You uneliminate. You write this as AX plus B times CX, and you introduce a new variable called Y, which is CX. You uneliminate.

Then you write this as AX plus BY equals B, and then CX minus Y equals zero. Thatís that this is Y equals CX.

So, if you can Ė these two are completely equivalent. If you solve one, you can solve the other. Okay?

Now when you look at this, you should have the following urge: you should look at a matrix like that, and your eye Ė donít ignore this. Your eye should gravitate towards the minus I. I mean Ė well, yes. Thatís right. Okay. Sorry. Your eye should gravitate towards this minus identity. And it should look at that, and you should have an overwhelming urge to carry out block elimination because a matrices is easy to invert. They donít come simpler that I. Right?

So you should have an overwhelming urge to eliminate this. By the way, if you eliminate this I, what will Ė if you eliminate this thing, what will happen? Youíll get the short cut, and youíll go, ďOh my God. This is great. This is fantastic. Thatís an I. I now know that if I see an easily inverted matrix appearing as a block in the diagonal, I know what to do.Ē

You do it, youíll form the sure compliment. Thatís the sure compliment. Okay? So by the way, if the I were huge and A were small, great. But of course, this is [inaudible]. So if you eliminate this block first, youíre actually back where you started.

So now the trick is Ė so you have to actually hold off on the urge to eliminate an easily inverted block. You must hold off. This is very important, and instead, you eliminate A. Okay?

So thatís the idea. And actually, let me explain the reason. The reason is this: in this application, I is small. This is Ė letís say A is a million by a million, but B and C are each a million by ten and ten by a million. But anyway, thatís what they are. In that case, this I is ten by ten. So it doesnít really help you to eliminate this. Itís this million by million guy you wanna go after. And A, for example, could be a DFD Ė a discrete Fourier transform. Right? Or something like that. Thatís the one you wanna go after.

If you eliminate that, you get this: I plus C inverse B times Y equals this. And this allows you Ė this, we can actually handle efficiently. Why? Because you put A inverse B here. Thatís fast because, somehow, you have a fast method for doing A inverse. You form the Ė in this case, you form the smaller one, and then you solve that. And then this is sometimes written Ė this has got lots of names. Itís called matrix inversion lemma, but itís got lots of other names. Thereís Ė if youíre British, itís called Sherman-Morrison-Woodbury, and then any subset of those, depending on if you went to, like, Oxford or Cambridge or something like that. Okay. So thatís one. And itís also called Ė letís see. Matrix inversion lemma, Sherman-Morrison-Woodbury Ė thereís one more. I think itís if you came from MIT, thereís some name you have for it special. It doesnít matter. Youíll hear another name for it. So Ė whatís it called? Anyway Ė I donít know. Iíll think of it. It hardly matters. The point is youíll hear lots of people talk about it, and itís Ė sometimes, itís just written out as this formula: A plus BC inverse is equal to this, assuming all the inverses here are Ė A is invertible and A plus BC is invertible. You get this formula. And you would see this Ė I mean, here are the types of things you should know. Itís a staple in signal processing. In fact, essentially, most signal processing relies on this. This is basically the one non-trivial fact of all signal processing up to about 1975, Iíd say. Itís just this. Same Ė by the way, this is also the trick for optimization. So let me explain what it is. In that case, B and C Ė I mean, the most classic case is something like this: diagonal plus low rank. So here Ė well, I canít use lower b. So PQ transpose Ė there you go. So, suppose you need to solve that Ė something like that. So diagonal plus low rank is a beautiful example of a matrix that, if you know what youíre doing, you can solve equations diagonal plus low rank so fast itís scary. But if you donít know what youíre doing, itís a total disaster. So this one, you can solve. Itís actually Ė in fact, what is the order of solving that? Order N. Itís just order N. Itís Ė thereís nothing here to do. Itís just order N. So if someone says, ďDonít you understand? I have to solve a million by million equation?Ē And you go, ďYeah. I understand.Ē And they go, ďItís totally dense. Dense! A million by a million! Thatís like 10 to the 12 entries in my matrix.Ē And you go, ďYeah, I know. But you can solve it in about a half a second. Maybe less.Ē Okay? But you have to know what youíre doing to do this. Okay? Nothing will recognize this. By the way, you might ask, ďIs it easy to recognize a matrix, which is diagonal plus low rank?Ē If it were easy to recognize Ė I guess you know the answer, by the way. Iím asking. But if it were easy to recognize, you can imagine some processing or some higher-level thing that would look at the matrix Ė it could think a while. If itís a big matrix, this might be worth thinking. So it can look at it for a little while and go, ďLook at that. Youíre lucky. Itís diagonal plus low rank.Ē So what do you think? Is it easy to determine diagonal plus low rank? No. Itís a famous problem in statistics. Itís in lots of other Ė it comes up in lots of areas. Itís impossible. Okay? Let me just ask this just to see if weíre on the same page here. How about this: how about alpha I plus PQ transpose? Do you think is that right? Am I asking the right thing? I think I am. All right. By the way, Iím not sure I know the answer to this one, but letís try it anyway. This might Ė what Iím about to say Ė my example might be just for symmetric, but letís just try it anyway. How about a multiple of the identity plus rank one? Can you detect that in a matrix? How?

Student:[Inaudible]. [Crosstalk]

Instructor (Stephen Boyd):[Inaudible]. The eigen values will do it. Thank you. Thank you. What are the eigen values?

Student:[Inaudible].

Instructor (Stephen Boyd):You got it. Okay. And thatís if and only if. Right. So in fact, this we can determine. This we can determine. Okay? Because one eigen value will be different, and all the others will be equal to alpha. Okay?

Student:[Inaudible].

Instructor (Stephen Boyd):Correct. Right. So then we say fantastic. We have a method of recognizing a multiple identity plus low rank. Okay? And we have to calculate Ė all we have to do is calculate the eigen values. And whatís the cost of that?

N cubed. Actually, with a multiplier, itís about five or ten times more than just solving the stupid equation. Okay.

So Ė all right. Well, anybody had this? I mean, the main thing hereís to think. But diagonal plus low rank? Forget it. So Ė okay.

So Ė all right. So thatís the picture. And these are just sort of stunning Ė you get stunning speed-ups if you know this material.

By the way, I should say I went over this kind of fast, and a slightly less fast description is the appendix, which you should definitely read. And weíre going to assume, moving forward, that youíve read it and got it.

A lot of this, by the way, you can experiment with. I mean, you can experiment with it in Matlab. You have to be careful, though, because youíre kind of experimenting half with Ė half of it is this method, and half is what Matlab actually sort of does and stuff like that.

So Ė okay. So I guess weíll quit right here.

[End of Audio]

Duration: 72 minutes