Instructor (Stephen Boyd):Great, I guess this means weíve started. So today, weíll continue with the conjugate gradient stuff. So last time Ė let me just review sort of where we were. It was actually yesterday, but logic, I mean, logically Ė in fact. But we can pretend itís five days or whatever it would be.
So weíre looking at solving symmetric positive definite systems of equations and this would come up in Newtonís method, it comes up in, you know, interior point methods, least squares, all these sorts of things. And last time we talked about, I mean, the CG Method the basic idea is itís a method which solves Ax=b where A is positive definite. And Ė but it does so in a different way.
The ways youíve seen before are factor solve methods. In fact, in those methods what you need is you actually need the matrix. So you actually Ė you pass a pointed to an array of numbers, roughly. And then you work on that.
Whatís extremely interested about the CG method is actually the way A is described is completely different. You do not give a set of numbers. In fact, in most of the interesting applications of CG you will never form or store the matrix A, ever, because, in fact, in most cases itís huge. Itís some vast thing. Thatís kind of the point. Instead, what you need is simply a method for calculating A times a vector.
So what you really are gonna pass into a CG method is a pointer to a function that evaluates this thing. How it evaluates it is itís business, itís none of your business. Thatís sort of how Ė thatís the idea.
Okay, well, last time we started looking at a little bit about this. We looked at two different measures of the error. So one measure is this number tao and itís the amount of decrease, F is that quadratic function youíre minimizing, youíve achieved divided by Ė sorry, this is the amount of the decrease, the sub optimality and decrease divided by the original sub optimality of decrease.
Thatís probably what youíre interested in. But another one which is probably in many applications whatís actually easier to get a handle on and all that sort of stuff is the residual. And the residual is nothing but, you know, b-Ax. Itís basically how far are you from solving Ax-b.
Now the CG method, and actually many others, actually work with the idea of a Krylov subspace. And this is just to sort of rapidly review what we did last time.
The Krylov sequence of a set of vectors is defined this way. Itís actually Ė you take this Krylov subspace and thatís the stand of b/ab up to some Ak-1b. And that essentially means itís all vectors that can be written this way. Itís B times a polynomial of A and that polynomials of degree K minus one. Thatís a subspace of dimension, oh, it could be K but it actually can be less. Actually, if itís less than K then it means it youíve solved the problem.
Okay, so XK is the minimum of this Ė this is the function youíre minimizing, this quadratic function on the Krylov subspace, itís the minimizer of that. And the CG algorithm and several others generate the Krylov sequence. Thatís actually the important part.
Now, in the Krylov Ė along the Krylov sequence obviously this quadratic function that youíre minimizing decreases. Thatís obvious because, in fact, youíre minimizing over a bigger and bigger subspace in each step and that canít get worse. Now the residual can actually increase, thatís not monotone.
Now it turns out if you run N steps you get X star, that follows from Cayley-Hamilton theorem. And the Kth iterate of the Krylov sequence is actually a polynomial of degree K-1 or less multiplied by B. Now the interesting part and the reason why these methods actually work really well Ė although the Ė by the way, thereís plenty of cases where youíre gonna do this for such a small number of steps that this is actually not that relevant.
Thereís a two term recurrence and the two term recurrence is this, you can compute the next point in the Krylov sequence actually as a linear combination of not just the previous one but the previous one and the one before that and then thereís some coefficient here and weíll see what the coefficients are later. Actually, theyíre not that relevant. What matters is that they exist and are easily calculated.
Okay, so weíve seen this and what I want to do now is look at the Ė to understand how the CG method works or how well it does. Itís extremely important to get a good intuitive feel for how it works.
When is it gonna work well? By the way, notice that you would never even say anything like that when you talk about, like, you know, direct methods. You wouldnít say itís good to talk about, you know, letís say, ďWell, hereís a 1000 by 1000 matrix. Oh, yeah, hereís one thatís gonna work well.Ē It always works well. You have positive definite matrix, you take a Cholesky factorization, it doesnít depend on the data. I mean, to first and Ė to 0 and 1st order does not depend on the data.
So okay. With these though you need to know when is it gonna work well because thatís actually the key to all of these things. So the way to do this is essentially to diagonalize the system. Then when you diagonalize itís kind of stupid because if A is diagonal and I ask you, ďHow do you solve Ax=b?Ē thatís easy. Thatís just the inverse, A is diagonal.
So the solution if I diagonalize is actually just this, itís just the transform B divided by the Ith entry in the transform A which is diagonal. This lambda thing here. So thatís very simple. But this is gonna give us an idea for when this works well. The optimal value is just this. Itís nothing but this thing. Thatís that A inverse B star, A inverse B. Thatís this thing here. Okay?
Okay, now the Krylov sequence in terms of Y is the same except now we can actually look very carefully at this thing because the polynomial of a matrix is really, really simple. The matrix is diagonal so a polynomial of it is simply the polynomial itís a diagonal matrix and each entry is a polynomial of that entry in the diagonal which are the eiganvalues of A.
So you get very simple expressions, all in terms of the eiganvalues now. So it says that PK is the Ė one way to say it is that itís the polynomial of degree less than K that minimizes this sum.
And notice itís got some positive weights, none negative weights here. And then over here you can kind of see whatís going on. If you look carefully at this you really want Ė well, weíll say what P should look like in a minute. P should look like, you know, one over lambda or something like that to make this work well.
Okay, so another way to write it, letís just keep going down here is weíll look at the second expression. The second expression says that this error is the minimum over Ė these are the positive weights, and then here you can see itís lambda I times lambda IP of lambda I minus one. And in fact what that says if you can make P of lambda look like one over lambda on the eigenvalues of A youíre Ė in fact if you could have P of lambda I equals one over lambda I itís done. This is zero and then that says that in fact this would have to equal F star. Okay?
So thatís the idea. And in fact, these Ė we saw already in the Cayley-Hamilton theorem that thereís an Nth degree polynomial which in fact we saw exactly what PN is. It has to do with a characteristic polynomial. It gives Ė itís a polynomial that in all cases satisfies P. The P of lambda I equals one over lambda I and thatís why this conversion ten steps Ė sorry, in N steps.
Now whatís interesting here is this is gonna give us sort of the intuition about when this works well. There are lots of other ways to say it. I mean, one is to say Ė well, look. A polynomial Ė something that looks like that is a general polynomial of degree K with the value that if you plug in itís probably if you plug in 0 this goes away, you get one.
I mean, I switched the sign on it. So thatís another way to say it is this way. There are lots of these. I wonít go into too many of the details but the important part are the conclusions. So here are the conclusions. If you can find a polynomial of degree K that starts at one thatís small on the spectrum of A then the Kth Ė then actually no matter what the right-hand side B is youíre F of XK minus F star is gonna be small.
And in particular this says that if the eigenvalues are clustered into groups, letís say K groups, then YK is gonna be a really good approximate solution because if I had K clusters of eigenvalues I can take a Kth order polynomial and put it right through, letís say, the center of those clusters or near the center of those clusters and then that polynomial would be really small on each of those clusters and weíll get a very good fit here.
Now thereís another way to do well and thatís to have YI star small for just a handful of things. So this says if your solution is approximate linear combination of K eigenvectors then YK is a good approximate solution. So thatís another way to say it.
Notice that this statement is independent of the right-hand side and depends only on the matrix A. This one now depends on the right-hand side but doesnít depend on the clustering. It basically says if youíre a linear combination of K eigenvectors then this must be a good solution.
So, okay. Now you can do things like this, this allows you Ė these are classical bounds. Classical bounds would be things like this; you would Ė suppose the only thing I told you about A was that its condition number is kappa? So I give you Ė letís say I can scale A. So I give you lambda min and lambda max and if I put a Chebyshev polynomial on there, thatís a polynomial thatís small uniformly across it on that interval, you end up with a conclusion that says this, it says this convergence measure that this thing goes down like that.
And this is actually Ė this allows you to Ė oh, by the way, a simple gradient method would have a kappa here and a kappa here. So if you just use the gradient method to minimize that function F youíd get kappa here and kappa here. And so youíre supposed to say, ďWow, thatís much better because you get square root here,Ē or something. So thatís the idea.
It turns out Ė this is interesting and thatís fine but it turns out actually that where you want to use CG is where in fact, I mean, this is like many other things itís an upper bound and in fact, you usually get convergence. Itís sort of much, much faster. Not to high accuracy but Ė
So letís do an example here. So these show you Ė this is the function Q. They all start at 1 here. Itís a 7 by 7 matrix. Now I think it goes without saying that CG is not the method of choice for solving a 7 by 7 positive definite system. Thatís something Ė I guess the time to actually solve a 7 by 7 positive system is down in the Ė itís definitely sub microsecond, you know, obviously. So this is not the right way.
Nevertheless, this is sort of the Ė weíll just look at an example. So here are the eigenvalues of A, I donít know, itís one here, I guess itís around two, a couple down here, another cluster here and an outlier out there.
After one step of CG the optimal Q is this thing. By the way, it goes without saying that you donít ever produce these. I mean, if you want keep track of these polynomials thatís fine, but you donít need to.
So this shows you that one and you can see it. Itís kind of doing its best. Itís not so small here nor is it small here and itís pretty big there. The quadratic is this green one and you can see it kind of Ė it gets small near this cluster and it kind of splits this cluster and this cluster and goes near it so that you get things like this.
The cube term, I think that might be the purple one here, is the cubic term and you can see now cubic is nice because itís sort of Ė you can see three clusters and it does just what you think. It goes down, goes right through this thing. Itís very small here, small here, small here, over here itís very, you know, quite small, still small, still small, and then kind of goes down there and itís small here. So actually you could expect that after three steps of this method youíre gonna get a very good solution.
The quartic I think is the red, maybe, I guess maybe thatís the red. Iím not sure. I guess thatís the quartic or something. And the quartic one as you can see goes through Ė is now actually picking off bits and pieces. Itís actually doing things like hitting both sides of it so itís small on all three here. Itís small here and now itís just nailing that one.
And then the seventh one is this one and thatís actually an interpolating polynomial so itís 0 on all of them and that means that at step seven you get the exact answer. Okay? So, I mean, actually Iím anthropomorphizing this, obviously. So well, all thatís actually done is the Krylov sequence is computed. But this gives you a rough idea of how this works.
Okay, actually youíll know shortly why it is that you need to know how well CG works because itís gonna be your job to change coordinates to make CG work. Weíll see that in a minute.
Okay, so hereís the convergence and sure enough, you know, you start with thatís the full decrease, and you can see this sort of after five steps youíve done very well and I guess after four youíve done extremely well and so on. So thatís the picture. Okay, hereís the residual which in fact does go down monotonically. It didnít have to but it did in this case. Now look at Ė I mean, thatís a fake example. Letís look at a real one.
Hereís an example, itís just a resistor network with 100,000 nodes. We just made Ė itís a random graph so each node is connected to an average of ten other nodes. So, you know, some big complicated resistor network.
So, again, a million nonzeros in the matrix G. And I pick one node and I make that the ground. Okay? Then Iím gonna do the following, Iím gonna inject at each of the million nodes a uniform current, a current thatís chosen randomly uniform on 01 and the resistor values will be uniform on 01. Okay?
So thatís our Ė thatís the problem. It doesnít matter. But in this case if you want to use a sparse Cholesky factorization Ė actually, before you ever do it youíll know what happens because you can actually do the symbolic factorization on G and actually calculate the number of fill-ins. Okay?
So in this case it required Ė there was too much fill-in and I donít know Ė I donít know how big it would be but maybe, I donít know, 50 million, 100 million or something like that, nonzeros starting from 1 million. Okay?
So Ė oh, I should mention this, if the number of nonzeros goes up by a factor of a 2 you are to consider yourself lucky in a sparse matrix thing, right? And that means you must go and make an offering to the god that controls the heuristic of sparse orderings and of ordering in sparse matrix methods.
Ten, you know, thatís okay. Youíre supposed to be happy or thatís typical. You start getting to fill-in factors like 100 and stuff like that and thatís because you didnít go make an offering the last time sparse matrices went your way.
So, all right. So in this case this problem you canít solve with a sparse Cholesky factorization and I actually shouldnít say that. I should say using the approximate minimum degree ordering method produces a Cholesky factor with too much fill-in.
Now, of course, on the big machine I actually could have done it and would have gotten the exact answer but it would have been really long and taken a lot of time. It might be that thereís another heuristic ordering method that would work perfectly well here. I doubt it but anyway, thereís lots of them.
Okay, so instead weíll use CG here. Now in this case I do form the matrix G explicitly and all I have to do at each step is I have to multiply by G. But thatís just a sparse matrix vector multiply is a million nonzeros so Iím doing like a million flops per matrix vector multiply and thatís a dominant effort of a CG iteration. So I donít know, how much time does that take?
Student:About a second.
Instructor (Stephen Boyd):A second. Man, weíve got to work on you guys. This is not cool.
Instructor (Stephen Boyd):What?
Student:Less than a millisecond?
Instructor (Stephen Boyd):What?
Student:Less than a millisecond.
Instructor (Stephen Boyd):Thank you, less than a millisecond. So around a Ė letís just say a millisecond and letís just get the order of magnitude right, millisecond. Okay? So matrix vector multiplies a millisecond here, roughly. Is that right? Thousandths of Ė yeah, sounds about right, right? Yeah, sure. Weird. Isnít that strange? Man, these computers are fast. Okay, that shocks me.
All right, so itís a millisecond, matrix vector multiply. And it might be a few milliseconds because of all sorts of, you know, issues with accessing memory and stuff like that. But if itís set up right and youíre lucky itís on that order of magnitude.
Okay, so hereís how CG runs and this is a residual here. So Ė and you can see that, well Ė for ten steps the residual actually goes up by a factor by 100 Ė generally considered not good. And then it goes back down again. But the wild thing is Ė the theory tell us the following, the theory says that if you run it one Ė what did I - was 100,000? Yeah.
The theory says if you run it 100,000, you know, the millisecond doesnít sound right to me but Iíll have to think about that. I think thatís Ė I have to do that for each one or something? Anyway, maybe it is right. It doesnít matter.
The theory tells you that K here runs out to 10 to the 5; weíll get the solution exactly. But the wild part is, is if youíre not picky about super high accuracy you actually have a perfectly good solution in 50 steps.
Each step was a matrix vector multiply and if our estimate of a millisecond is right it means you just solved a very large, you know, diffusion equation, Poisson equation, whatever you want to call it. You just solved it in a quarter second. I mean, if weíre right or, you know, something like a tenth of a second. Everybody see whatís going on here? By the way, absolutely nothing in theory guaranteed that this would happen, absolutely nothing. Okay? It just did. And this is very common.
So Ė and this is kind of the cool part about CG is that in a shockingly small number of steps you often Ė what emerges is something that looks kind of like the solution.
In this case it doesnít look like the solution; itís actually, like, quite good. Okay? So thatís the idea. So if you wanted to know at what point have you learned something new, you just did. You cannot Ė if you go just type this thing, you know, G\I or whatever, and let a sparse, you know, even if you have a big computer itís gonna take a long time and a lot of RAM. And then thisíll just get a pretty good answer in 50 steps and just be way, way shorter. Okay?
Okay, so here is the CG algorithm. There are many variations on it. People seem to focus more on the algorithm itself than actually on the Ė what the algorithm produces. In my opinion itís much more important what the algorithm produces which is the Krylov sequence and as many Ė there are other methods to produce the Krylov sequence and they have different round up properties and things like that.
Let me show you what those are. So hereís one and this is Ė instead of just sort of making up my own I just followed exactly one from a very good book on the subject by Kelly. So just to make it Ė not to invent new notation or anything, itís this.
And the only important part you need to know here is something like this; you maintain the square of the residual at each step. If the residual is small enough, you quit. So this quitting criterion epsilon is on the ratio, itís what we call ada; itís on the ratio of the current residual to the norm of B. Thatís this thing.
And what you do now is something like this. Your search direction is gonna be something like P and itís a combination of both the current residual and the previous search direction. Then all of the effort in here, well, not necessarily but in most cases is right there, this one thing right here. This is where you call the mult by A method is called right here.
Everything else if you look here is actually sort of an O of N or a blass level one or however you want to call it, call. So, for example, here you update a vector thatís O of N, thatís O of N, you have to calculate an inner product.
Actually, if you want to parallelize this thatís the one that is really irritating. Everything else here can be done in a Ė is completely distributed. So the main effort here is this call to A. These other things I guess this is also has to be collected together so thatís another one that would not be distributed easily but thatís the algorithm.
By the way, these calculations can be rearranged like 50 different ways and so you get different versions of it. In exact arithmetic all of those ways will compute exactly the same sequence XK. With round off errors in there they can be different and youíll find people talking about one versus the other and this that and the other thing and youíll find all sorts of different flavors and things like that and people telling you one way is better than another and all that. Yeah.
Student:Is there anyway to know, like, if it will be faster than a theoretic convergence?
Instructor (Stephen Boyd):No, I think the theory just gives you, like, rough guidelines and basically says if youíre Ė if the eigenvalues are clustered Ė for example, if theyíre tight, if the condition number of the matrix is small that condition number [inaudible] will tell you itís gonna nail it in 50 steps. Okay?
If theyíre spread out but have, you know, clusters or something like that, itís gonna nail it, that kind of thing. So in general you donít know. Kind of the worst things you can have are ones whose spectrum is sort of uniformly spread all over the place, right? Thatís the kind of the worst thing.
Now it also depends on the right-hand side. So if the right-hand side Ė the B, actually if that one Ė the worst thing that could happen is B can be sort of a uniform mixture of all of the eigenvectors and that would be kind of the worst thing to happen or something like that.
Okay, so letís talk about Ė so as I said at the beginning this is mostly interesting. I mean thereís a fundamental difference between this and a direct method. In a direct method you give the matrix A, you give the coefficients to the algorithm literally. You pass an array or some data structure and you get the entries.
In CG you do not need that. All you need is a method that multiplies by your matrix A. Thereís nothing else you need. Okay? That actually in interesting cases that can be some, like, specialized hardware, it could actually carry out an actual experiment. I mean, it could do Ė it could solve the whole PDE, it could do insane things, right? Do a full up simulation of something, run Monte Carlo, it could be all sorts of weird stuff.
But it doesnít have to be a matrix, is the point. So hereís some examples, why canít you do an efficient matrix spectrum multiply. This is kind of obvious, if A is sparse, for example, if itís sparse plus low rank now you better know the factorization here.
And if you want some numbers here you should think of A as like a million by a million. Thatís just kind of, you know, because if A is, I donít know, 10,000 or something this is kind of not worth it or something. Yeah, so you should Ė your mental image is should be that Aís a million by million or 10 million by 10 million. Thatís a good number. A 100 million by 100 million.
These are the numbers you want to think about when you think of CG and how these things work. If you have products of easy to multiply matrices, that works. Fast transforms come up so FFT, Wavelet, DCT, these types of things, things like fast Gauss transform that actually is when A looks like this and you do this via multiple pull methods.
This is actually an extremely interesting one. Hereís a matrix that is extremely Ė that is dense, well, half dense, and that this the inverse of a lower triangular matrix, in fact, even the inverse of a sparse lower triangular matrix, thatís a great example.
So I give you a sparse lower triangular matrix, by the way, the inverse of that is generally completely dense. So what Ė and I couldnít even store Ė letís make it a million by million, in fact, letís make it a million by million banded.
The Cholesky factorization of a million by million banded, of course, I could solve that directly, but anyway. A million by million banded or something like that is actually gonna be banded. Itís gonna have Ė itís small, itís easy to do. The inverse is gonna be a full lower triangular matrix. You canít even store it and it would be completely foolish to actually calculate the inverse matrix and then multiply each time.
But if I give you a vector and I ask you to multiply by the inverse, it means you have to solve a lower triangular set of equations and that you can do by back substitution. If itís a Ė Iím assuming Ė Iím taking the case where itís sparse.
Okay, so these are just examples. Itís very good to think of examples like this. Okay, now couple things you can do. You can also shift things around. So if you have a guess X hat of the solution X star then, in fact, what you can do is actually Ė well, you can Ė this is the residual with your guess. If you solve Az equals this residual then your solution is actually gonna be just if you solve this you get the optimal Z and you add it to X hat and thatís your solution.
What happens now is that XK now looks like this. Itís actually X hat plus this and itís the argment of this quadratic over the shifted Krylov subspace so you shift by this X hat. And thereís nothing you have to do to make this happen, all you need to do is initialize not with 0 in the first step but with X hat and everything will work.
Now this is also very important because this is good for worm start. So what this means is, if you need to solve, letís say a giant circuit equation, giant resister equation, if thatís part of integrating a circuit, I mean, sorry, integrating a differential equation for a circuit you will step forward one couple of picoseconds or whatever and youíll solve a similar system. You can use the last thing you just calculated as your guess and this can often give you, like, stunningly good results. So as you will see an application of that in optimization soon.
Okay, but now we come to the real way itís used and I should say this. I should say that if you Ė in most cases if you simply run CG on a problem it wonít work.
The theory says it will work but thatís not relevant for several reasons. Itís not relevant number one because you have no intention of doing a million Ė if you have a million dimensional system you have no intention of running a million steps because you donít have that much time. Thatís the first thing.
So to say that it works in practice means that it works in some very modest number of steps. Or a lot of people I remember hearing Ė I hear this in Ė hereís something you hear on the street that youíre doing on CG when youíre doing Ė pretty well when youíre doing square root N steps. Theory says you have to do N but the word on the streets is square root N should do the trick for you.
And if you think carefully about what that means itís just a rule-of-thumb it says if you have a million variables in a thousand steps you should have a pretty good solution. Okay? These are just Ė Iím just saying these Ė it could be slow, you know, but you shouldnít be shocked if itís on the order of squared N. That should be the target. So 100 million, 10,000, that kind of thing. Okay? These are the numbers.
Okay, and the other reason itís not that relevant is the following Ė the theory is that errors in Ė when youíre doing conjugant gradients round off errors they actually add, itís unstable and the thing diverges, actually, very quickly.
So by itself CG generally Ė often will just fail. If you just have some problem and you run it, itís just gonna fail. Okay, so the trick in CG is actually changing coordinates and then running CG on the changed Ė the system in the changed coordinates. Thatís the trick.
Thatís called preconditioning and now you know why it is that you needed to know when CG works because the key idea now is to change coordinates to make the spectrum of A, for example, to make the spectrum of A clustered or something like that or live in a small interval. So thatís the key.
So hereís the basic idea, youíre gonna change coordinates, youíre gonna use Y is the coordinates of X and some T expansion and whatís gonna happen is youíre gonna solve this equation here and then in the end set X stars to inverse Y.
Then sometimes people call T the preconditioner but then sometimes they call TT transpose the preconditioner and you get all combinations of preconditioner meaning T, T transpose, M, T inverse and T minus transpose. So there may be some other possibilities like M inverse but the point is that anytime anybody says preconditioner you have to look very carefully and figure out exactly what they mean.
Okay, so thatís the preconditioner. And in a Ė basically what happens is when you Ė if you rerun Ė if you just take CG itís actually not that bad. You have to multiply by T and T transpose at each step. The other option is to multiply simply by M once and you donít really need this final solve, in fact.
So this is called PCG. And by the way, this is exactly where you Ė this is why in these iterative methods you have room for personal expression, right? So if youíre doing Ė if youíre solving dense equations there is no room for personal expression. If you do anything other than get glass and run Atlas to [inaudible] or cash sizes and things like that and then write good code then itís just not right.
There is no reason under any circumstances to do anything else. You go to sparse matrices Ė actually, in sparse matrices is there room for personal expression if youíre doing sparse matrix methods? Yeah, and what is it?
Instructor (Stephen Boyd):Ordering, yeah. So there is some room for personal expression in sparse matrices and as to selection of ordering and thatís cool. Actually you can say, ďOh, I know my problem. I know a better ordering than approximate minimum degree is findingĒ or something like that. Or thereís exotic ordering methods. You can go look at Netus is a whole giant repository at Minnesota thatís got all sorts of partitioning methods and some of those are rumored to work really well.
Okay, but we move up the scale one step higher to iterative methods. Now, boy, is there room for personal expression and it is mostly expressed through the preconditioner. So, in fact, when you get into these things youíll find everything is about finding a good preconditioner for your problem. And then you can go nuts and you can have simple preconditioners and complex ones, unbelievably complex ones and things like that. So thatís the Ė thereís a lot to do.
Okay, so the choice of preconditioner is this. Hereís what you want. You want the following. You want a matrix T for which T transposed AT or a matrix M, letís say M. I want a matrix M for which the eigenvalues of MA are clustered, for example.
And hereís one choice, how about this? The inverse. Thatíll do it because now the eigenvalues of MA are all 1; CG takes one step because the eigenvalues are all 1. Itís kind of stupid though because you actually then have to, like, sort of invert the matrix or something like that.
So that doesnít make any sense. So hereís what you really want. What you really want is a matrix M which somehow approximately is some kind of approximate Ė well, see approximate can be very crude. An approximate inverse of the matrix and yet is fast to multiply. Thatís what you want. Okay? So thatís the essence of it. And so this is the idea. And the M Ė this M can be quite Ė youíll see approximate it like very generous here. I mean, you can be way, way off.
Hereís some examples. The most famous one and often very effective is diagonal preconditioning. Itís kind of stupid but actually you should try that always and immediately first because it often just works. So thatís Ė you would not call the diagonal of the inverse of the matrix or whatever you would not call that an excellent approximation of the inverse. But you have to admit itís cheap to multiply and all that kind of stuff and it works quite well, amazing.
Hereís another huge family, itís actually really cool and it works Ė itís called incomplete or approximate Cholesky factorization. And so hereís how that works.
Now what Iím gonna do is Iím gonna compute a Cholesky factorization not of A but of some A hat. And the way you might do it is something Ė itís weird. It might work like this, you might run a Cholesky factorization, in fact, thereís a whole field on this called incomplete Ė itís also called ILU, incomplete LU factorization. These are preconditions based on this.
And the way it would work is this, you take the matrix and you start doing a Cholesky factorization on it but you might decide if an entry is too small when you do the Cholesky factorization Ė you say, ďOh, screw it. Just forget it. Iím just gonna ignore it.Ē Or if youíre doing something and youíre gonna fill in in various places which is the death of direct methods, you just say, ďForget it. Iíll just ignore it.Ē
So you end up calculating a sparse Ė well, itís a sparse Cholesky factor or something but itís definitely not the original matrix, right? So these methods can also work really, really well, these things.
And here will be an example; you can do the central KY band. Thatís a version of this. Itís a version of this, too, where you just basically say, any fill in of L below some band I refuse to even Ė I wonít even go there. So letís see, these are some obvious Ė anyway these are sort of the obvious things.
These can work really well and a good example would be something like this, if you have a problem where thereís a natural Ė where something is ordered, for example, in space or in time like in signal processing or control you have time. Then what happens is, you know, things are connected locally, you know, states, transitions, signals and things like that and that leads to this banded system.
Now banded systems you can solve super fast, we all know that. But supposing you have a dynamic system or signal processing but a few things were kind of Ė you had a main bandwidth of, like, 10 or 20 and then a light sprinkling of little things all over the place everywhere else.
So for some weird reason in your problem, you know, the state here is related to the Ė itís just, I donít know, Iím just making this up. But the point is you could easily make up an example like that. Everybody see what Iím saying? So itís kind of banded plus a little light sprinkling of nonzero entries other places, okay?
So this would work unbelievably well. You simply do a banded Ė you simply ignore all the crap off some band, you do a Cholesky factorization on that, thatís your preconditioner, and the only thing youíre working the error to fix is all that little light sprinkling of entries that were outside that band. So thatís the idea.
By the way, a lot of this stuff that Iím talking about, I mean, these are whole fields, I mean, this is the basis of all scientific computing, PDEís, all Ė so thereís tons of people that know all of this stuff backwards and forwards. A lot of this stuff though hasnít diffused to other groups for some reason. So these are actually just really good things to know about.
Okay, so hereís that same example with 100,000 nodes and a million resistor circuit and here itís just with diagonal preconditioning. I mean, itís kind of silly because it was already working unbelievably well but you can see this is what diagonal preconditioning does. Diagonal preconditioning actually is mostly useful not for this kind of speed up but for when this residual goes like this and goes like that and then it would go like this, okay.
Theory says it doesnít do that, thatís all from round off error. So that how the Ė and in fact, hereís a very common CG stopping criterion used on the street, this. You run CG until the answer starts getting worse then you stop, then you just say, ďWell, thatís it, sorry. Hereís what it is.Ē
So Ė and I know thatís done and I know people who do that in image processing and computational astronomy and all sorts of things. They just Ė they run CG until it starts Ė that usually only gets it a couple hundred steps and then basically you can keep running CG but things are getting worse because of the numerical errors youíve lost or foganailty and all sorts of things happen.
Okay, so hereís the summary of CG. Actually, the summaryís all that matters. So here it is, in theory, and that means with exact arithmetic, itís not particularly relevant. It converges to a solution and N steps, period.
Thatís not interesting or relevant because N you should think of is on the order of a million roughly and the whole point is you have no intension of doing a million steps. Your goal is to do it in a thousand steps, couple hundred, whatever. So thatís kind of your goal. I mean, if youíre really lucky, 10, 20, 30, these are the numbers you really want.
Okay, now the bad news is that if you Ė is that actual numeric round off errors actually makes this thing work much worse than you would predict. Now the good news though is that with luck that means a good spectrum of A you can get good approximate solution in much less than N steps. Right?
So now the other main difference with a CG type method or something like that is this, and this is very important, you never form or need the matrix A. Anyway, you canít store a million by million dense matrix anyway.
Well, I mean, maybe you could but the point is you donít want to. So the point is you donít need the matrix, you donít need the coefficients the way it works for direct methods where you pass in a data structure and itís got all the coefficients in it. Okay?
Here you donít need it. What you need only a method to do matrix multiplication. That means you have to rethink in your head all Ė everything you knew about linear algebra and you have to rethink to this model where what you really have is a matrix vector multiply oracle and thatís it.
What the oracle does, how itís implemented is none of your business. You can call it, you can give a vector and it will come back with AZ. And by the way, you could Ė this would be very bad, but you could actually get the coefficients in A from an oracle. How would you do that actually?
Student:Multiplying by Ė
Instructor (Stephen Boyd):Yeah, EI.
Instructor (Stephen Boyd):Exactly. So you call the oracle with E1 and what comes back as the first column of A? Then you do it for E2, second column, and actually after EN youíve got all of A. So then you could pack it in there and then call it LA pack, whatever. But thatís not the point here.
Okay, and itís very important to fix in your mind how this is different from factor solve methods, okay? Itís less reliable, itís data dependent. So, for example, that circuit I showed you that was with a random topology. I might take another circuit thatís got like a pinch point or something in it, 10,000 iterations, nothing happens or itís even worse, so it diverged.
These are data dependent, okay? And this is not the case roughly speaking for direct methods. Theyíre data dependent. And Ė but there is Ė you can either think this is the bad way or the good way, the bad way is this, these methods donít work in general unless you change coordinates with a clever preconditioner. Thatís the bad way.
The good way is to say is that CG methods have lots Ė I mean, the employment prospects are very positive because you donít just call a CG method, you need somebody to sit there, figure out the problem, try a bunch of preconditioners and tune things and find out what works. So thatís Ė you can think of that a good way. Thereís room for considerable personal expression in CG methods.
On the other hand you can hardly complain, this is really your only method for solving a problem with a million, 10 million, 100 million variables. So if you made the mistake of going into field where thatís needed then thatís your problem.
Student:What Ė so if you have any operator form itís a very, very hard to get some of the preconditioners you mentioned before. You just use easier ones?
Instructor (Stephen Boyd):Yes, so thatís Ė okay, yeah. So if A is an operator form usually you know the operate Ė I mean, typically you know the operator so itís not this thing where you donít even know what the oracle does. The theory says you could do that.
You know, to get Ė well, no. So for that you have to sneak around the oracle. Itís a good point, you know, you have your Ė itís not a pure oracle protocol. Just to get to diagonal youíd sort of sneak Ė youíd send a messenger around in back of the protocol and say, ďWould you mind telling me what your diagonal is?Ē
Normally a reasonable oracle will be willing to tell you the diagonal. So Ė but thatís a very good point. If you have to find your diagonal by calls to the oracle youíre screwed. Yeah, okay, thatís Ė any other questions about this? Because weíll go on to the next Ė the topic which is actually just a Ė itís kind of obvious itís just applying this to optimization.
Okay, so the Ė these are called truncated Newton Methods. Actually, thereís a lot of other names for it, weíll get to those in a bit. And theyíre called approximated, truncated and you can do this all the way up to interior point methods. So, letís see how this works. So hereís Newtonís method and in Newtonís method you want to minimize a smooth convex function, second derivatives.
And so what you do is you form Ė you want to calculate the Newton step by solving this Newton system here. Itís symmetric positive definite, thatís symmetric positive definite. And you might do that using a Cholesky factorization. This would be the standard method, you know, for problems that are either dense with up to a couple thousands variables or problems that go up to 10,000 or even 100,000 something like that but where the sparcity is such that the Cholesky factorization can be carried out.
So thatís what you do here. And then you do a backtracking line search on the function value. It could be on the function value or the norm, either way, and youíd stop basically based on the Newton decrement, thatís this thing, or the norm of the gradient. So that would be your typical stopping criterion.
Okay, so an approximate or inexact Newton Method goes like this, instead of solving that Newton system exactly weíre gonna solve it approximately. So Ė and the argument there is something like this, you donít really have to get the Newton direction exactly, thatís kind of silly. In fact, for sort of convergence you only need a dissent direction.
Of course, the whole point Ė the advantage of the Newton Method is these Ė especially for these smooth convex functions is that it works so unbelievably well and you get your final quadratic conversions and things like that.
So what you really want is to trade off two things. What you want to do is you want to do a fast and sort of crappy job of approximately computing the Newton steps. You want to get a fast delta X. But if itís too crappy a job then the algorithm is actually gonna make slow progress.
By the way, as to whether or not itís gonna make progress at all itíll always make progress because you will see that almost any method for approximately solving a Newton step, every single step is gonna generate a direction which is a dissent direction.
And so convergence of the methods, at least theoretically is guaranteed. What you can lose is youíll lose things like quadratic terminal convergence and things like that. Now if youíre solving gigantic problems you may not be interested in quadratic terminal convergence, youíre interested in just solving it even to modest accuracy but in, you know, the amount of time, for example, less than a human lifetime. I mean, thatís your Ė but again, this is your problem for solving such big problems. This is your fault, I should say.
Okay, so this is sort of the idea. Now here are some examples, one is this, is to simply replace the Hessian with a diagonal or another one is a band of this and use that as the Newton step because if thatís diagonal or banded this can be solved incredibly efficiently. So thatís one method.
Another method, and I think we even explored these in 364a a bit or maybe you did a homework problem on this or something. I canít remember. If you didnít, you should have. So Ė you did? Okay.
Another one is to factor, this is called the Shimansky Method is you factor the Hessian, every K iterations and then use that for a while. And, of course, that requires a method Ė a factor solve method.
Now the factor solve method Ė oh, that just reminds me, I have to say something. In a factor solve method if itís dense or in some dense factor solve methods thereís a big difference between the factor and the solve. The factor typically goes like N cubed and the solve goes like N squared. Right?
So what that says is you get weird things you can say. You can say things like this, ďI need to solve Ax=b.Ē And you say, ďNo, actually I need to solve Ax=b, Ay=, you know, Ax2=b2, Ax3.Ē You want to solve it multiple times with multiple right-hand sides.
In a dense factor solve method itís the same cost because the expensive part was the factorization. Once youíve factorized the Ė you can solve Ė you put an investment in and you can now solve multiple versions of that problem very cheaply. Thatís the key behind this. Okay?
Iterative methods, nothing like that, none. Iterative Ė want to hear the cost of solving Ax=b by an iterative method if thatís like C? Then the cost of solving Ax=b and Ax~=b~ along with the other one? 2C. Thereís no speed up. You simply call the solver again. Everybody see what Iím saying? So some things that you probably just got used to thinking about, for example, if you factor back solves are essentially free because theyíre in order less, now donít transfer to these iterative things.
Okay, so this is the Shimansky Method. It also works quite well. Okay, truncated Newton Method is very, very simple. It does this, youíre gonna run either CG or PCG and youíre gonna terminate early and in fact, itís very important to understand the key is not to terminate early, the key actually is to terminate way early. Thatís kind of the hope here.
Thatís Ė if you want to get stunning results itís gonna have to be something like that. Now, these also have lots of other names. Theyíre called Newton Iterative Methods, theyíre called Ė itís also called Limited Memory Newton. Itís also called Limited Memory BFGS. You end up with exactly the same Ė you have to go through all this horrible equations, everyone has their different notation system, but in the end you find out itís sort of the same method. So these methods have lots of names.
Okay, now in a situation like that the cost is not the cost Ė the cost per iteration is completely irrelevant. So the cost is measured by the number of CG Ė actually, itís the number of calls you make to the multiplied by the Hessian. Thatís actually Ė thatís the exact cost is that. Thatís the number of CG steps.
And to make these things work well youíre gonna need to tune the CG stopping criterion, you want to use just enough steps to get a good enough search direction. If you use too many CG steps then youíre gonna get a nicer search direction but itís gonna take you longer to get there.
If you use way too many CG steps youíre gonna basically be doing Newtonís Method now. Thatís great because now, you know, whatever it is, 12 steps, itís all over. But each of those steps is gonna involve 4000 CG iterations or something. At the other extreme if you have too few steps youíre basically doing gradient, itís gonna jam and itís gonna take a long time.
Okay, now itís, of course, less reliable than Newtonís Method which is essentially completely reliable. But, you know, again, with good tuning, good preconditioner, a fast Hessian multiply and you need some luck on your problem, you can handle very large problem.
I should say that what Ė I should say something else about these methods. Whereas one can write a general purpose convex optimization solver and youíve been using multiple ones, youíve been using SDPT3, [inaudible], all these things.
If you think about it thatís really very impressive. I mean basically they donít know what problem is gonna be thrown at them. They can be scaled horribly. I mean, you can make it too horrible itís not gonna work but they can be scaled horribly, all sorts of weird things. You can have weird sick, you know, flat feasible sets and all sorts of weird stuff people throw at these things every day and they do a pretty good job. Theyíre general purpose. Okay?
So but when you get into these huge systems itís much more ad-hoc and ad-hoc means literally it means everything is built, you know, for this.
So, although people have tried to come up with sort of general purpose iterative methods and theyíre getting close with some things, I think, in some cases. But generally speaking what this Ė youíre gonna need to tune stuff. Youíre gonna need to Ė I mean, it has to be for a particular problem.
You have to say, ďIím interested in total variation denoising.Ē Or, ďIím interested in this pet estimation problem.Ē You know, in medical imaging or something. I mean, it has to be a specific problem. ďIím interested in this flow Ė aircraft flow scheduling problem.Ē Okay?
So having Ė now the good news is this, once youíve fixed to a particular Ė itís, by the way, not a particular instance of the problem but the problem class. Once youíve fixed to one of those things I am not aware of any case where these methods cannot be made to work.
Obviously thatís not a general statement. I canít back that up by anything but every time Iíve ever looked at any problem, as long as you say itís a specific thing, like itís a network flow problem, itís a this problem, itís a gate sizing problem, itís a problem in finance or machine learning, these work always. They donít work in 15 minutes, they work in Ė it takes a while, right? Maybe not a while, so thatís kind of the good news of these methods.
So Ė and itís kind of the answer to this, maybe once a day somebody comes up to me and whines, often by email actually, because theyíre not here. But they whine, they go, ďI have my problem CVX, you know, I canít solve itĒ and blah, blah, blah. And okay, youíre like, ďIt worked fine for 10,000 variables, it doesnít work for 100,000.Ē And Iím like, ďWell, first of all itís mat lab. Itís made so you could rapidly prototype your problem. You need to solve 100,000 variables, you need to know how these things work and stuff like this.Ē
So this is the answer to people like that. Theyíre like, ďNo, I donít want to write my own software, itís too hardĒ and everything like that. Iím like, ďThen donít solve problems with 100,000 variables.Ē Just go away, donít bother me, or something.
So Ė but the bottom line is you want to solve a problem with 10 million variables, 100 million in a specific area, like a specific little area, itís gonna work basically. So itís gonna require some luck and itís not gonna happen in 15 minutes.
Okay, so hereís truncated Newton Method. Youíll do a backtracking line search on this, you can do a backtracking line search on F as well. The typical CG termination rule is gonna stop when this is your Newton System residual here, divided by the gradient, which by the way is the right-hand side. So this is exactly the ada that we had before. This is an okay one, it says Ė all of these things are kind of heuristic and stuff like that. So exactly what you use to stop maybe doesnít matter so much.
And what youíll do is youíll have simple Ė with simple rules youíll iter out at some constant number and youíll have this epsilon PCG, thatís constant. And so this might be like .1. That would be a reasonable number there. You wouldnít want to put .01 and you sure as hell would not want to put 1E-4, 1E-4 says youíre basically calculating the Newton direction and thereís no need for that.
Well, maybe to get the thing up and running and verify your code works and stuff like that, you might start with a small problem and make this 1E-4 just to make sure youíre actually calculating the Newton step and it should look then exactly like a Newton trace at that point. But you might want this to be .1.
So a more sophisticated method would adapt the maximum number or this thing is the algorithm proceeds and the argument would go something like this. It would say that, look, early on you donít need to solve the Ė the whole point of Newton is that it changes coordinates correctly in the end game so that if youíre stuck in some sort of little canyon instead of bouncing across canyon walls as you would in the gradient method youíre skewed towards a direct shot at the minimum. That is what Newtonís Method is.
So youíd argue that actually Newtonís Method at first youíd totally Ė I mean, in many cases youíre wasting your time. So then you might have actually at first this might be, like, very Ė this might start .1 and then it might kind of get Ė go smaller or something and that might modulate depending on how close you are to the solution. Thereís a lot of lore on this and you can find this in books and things like that but a lot of it is just to play with it. So thatís it.
You would find some theory on this and itís, you know, itís mildly interesting and things like that. I mean, this would Ė youíd find some Ė go find some thing from the Ď70s or Ď80s or something like that or in some book that would tell you, ďIf you did this youíd guarantee super linear convergence.Ē I guess my response to that is, you know, weíre not really trying to achieve super linear convergence; weíre trying to solve problems with 10 million variables.
Student:Right. So you [inaudible] tends to take, like, a long time. Is there Ė
Instructor (Stephen Boyd):It shouldnít. Why would it take a long time?
Student:Because essentially, I guess, it just starts, like, has to do a lot of queries to get to the, like, I guess Ė
Instructor (Stephen Boyd):It shouldnít.
Student:Ė it [inaudible] with your criteria for Ė
Instructor (Stephen Boyd):It should not.
Instructor (Stephen Boyd):Yeah, a general rule of thumb is, you know, you said beta equals a half, so your step Ė every time you query youíre divided by equals two. If you do five or six of those thatís too many. And the average number of lines or steps you should be doing is like three or something, two, no more, often one point something.
So Iím suspicious of that. And Iím just telling you sort of what people experience. Yeah, thatís not Ė and in any case you have to evaluate F. Youíre gonna evaluate it at every step here. Not every CG step, right? But every outer iteration youíre gonna evaluate the gradient of F and I havenít added that in to the cost here.
So, yeah, that would come up. But, yeah, you shouldnít be doing a lot of line search thing. Right? I mean, first of all, Newtonís best when you get into the end game youíre doing none. I mean, if itís really Newton you should be doing none.
And that will translate over here. When you get down in the region of quadratic convergence with a method like this you shouldnít be Ė even though you donít have the exact with the Newton direction which is aiming you right to the solution, itís a good enough direction that a full step should be taken.
So itís actually kind of a bad sign if youíre doing more than a couple of Newton steps. And if later in the algorithm Ė Iím sorry, backtracking a second. Later in the algorithm if youíre doing lots of backtracking steps it doesnít sound right. Is that a specific problem youíre thinking of?
Student:Itís just Ė yeah, I guess.
Instructor (Stephen Boyd):Okay, weíll talk about it later. Okay, now you also have the question of CG initialization. One way is to initialize with delta X=0. It turns out in that case the first step if you go back and look at the CG thing is you Ė well, actually itís very simple, you solve Ax=b. In the first step you minimize over the line span through B. B in this case is minus the gradient.
So the first Ė after one step of CG to solve a Newton Method if you start from 0 what pops out is a very interesting thing. Itís something Ė itís along the negative gradient direction.
So if you take CG and you said N max=1 which means do only one step then it turns out that the steps popped out by CG in fact are scaled versions of the negative gradient. So Ė and in fact, you can prove things like this, this is Ė that might even be a homework problem on that in which case weíll assign it to you or maybe Ė I forgot. Or maybe thereíre Ė if thereís not a word problem maybe there should be one or something like that.
You can prove the following, that when you run Newtonís Method, sorry, CG to do an approximate Newton Method every step of CG takes your angle of your search direction closer to the Newton direction.
Everybody see what Iím saying? So let me show you the Ė how that works. So here is Ė letís just make it quadratic. I mean, it doesnít really matter. Here you are, you donít know itís quadratic, letís say. And you are right here, okay? Now the negative gradient direction says search in that direction, right?
But the Newton direction says, ďNo, I think actually thatís a better step.Ē Thatís Ė well, thatís Newton, right? If itís quadratic it nails it in one step. And what you can show is this, that if you run CG starting from 0 your first step will be in this direction and every other step is gonna actually close Ė is simply Ė theyíre gonna simply move over here in angle towards the Newton thing. Okay?
So everybody see the idea here? So basically you bend Ė you can even think of CG as sort of, you know, modifying your step taking into account more and more of the curvature, in fact. Thatís a good model for what it is.
Okay, now another option is this, you can choose Ė you can actually use the previous one and do a warm start. Now one problem with that is that if you start with zero every step of CG it will be a dissent direction and this might not be the case here. But you can give it an advantage if youíre only gonna do ten steps, if thatís what your tuning suggestion, you can actually do really well by just using the previous one.
And the simple scheme is something like this, if the previous step is a dissent direction for the current point use it, initialize this way, otherwise you initialize from zero and these are just sort of schemes.
So letís look at an example. It is L2-regularized logistic regression. So hereís my problem. Itís gonna look like this, and if you want to know what this is, is I am fitting the coefficients in a logistic model right here and thereís L1-regularization over here.
So Ė and what that means is you have a logistic model, you have binary Boolean outcome and I have a vector of features and I have a giant pile of data that says, ďHere are the features and hereís the outcome like plus or minus 1.Ē You know, like the patient died or didnít and I give you Ė or, you know, the stock price went up or it went down.
And I give you Ė letís say, oh, to make it interesting a million features. Okay? So Ė and I give you as a thousand samples. Now obviously thatís horrendously over fit, youíre over fit by a 1000 to 1. If I give you a 1000 samples a thousand patient records and a million features.
So what the L Ė hey, wait a minute here. Sorry, Iím gonna have to go back and cut out that whole discussion. I thought that was L1. So, all right. Weíll just rewind and skip all that, sorry. We just gonna do L2-regularized. Actually, you can solve L1 Ė if you can do this one, you can do that one. So, all right. Weíll just do L2-regularized. Sorry, pardon me.
L1-regularized would allow you to do it with a million features. It would run and it would come back with 37 features and it would say, ďHere are the 37 out of the million that look interesting. These are the ones that I can make a logistic model and predict these thousand medical outcomes wellĒ or something like that. Okay?
Sorry, this is not that. This is just an example. Okay, all right. So the Hessian and gradient looked like this. I mean, they start looking very familiar, itís [inaudible] equals DA plus 2 times lambda. Lambda is diagonal here. The gradient looks very similar.
And none of the details matter but the important point is all you have to be able to do is multiply by the Hessian, multiply the vector by the Hessian. When you do that you have to do this and the key is here you donít even form this matrix because this thing itís easy to make up cases where I can store A but I canít store A transposed DA let alone can I even think about factorizing it, thatís a joke.
But this is a case where I canít even store it let alone factorize it. But anyway, I make this Ė I multiply it this way so I need two methods, I need a method Ė by the way, this often corresponds to a forward model method and thatís a reverse or adjoint call. So I have a forward model call and an adjoint call and Iím gonna make two. For each Iím gonna make a forward model call and an adjoint call per CG step.
And weíll just make an example here with 10,000 features and 20,000 samples. So here we just may have a problem where these XIís have random sparcity pattern. They have around 10 nonzero entries and then in the nonzero entries it just shows at random. None of this matters. But you end up with about, I guess half a million nonzeros in this and we made this small enough that we could actually do the symbolic factorization just to know what weíre talking here.
And so the factorization gave us 30 million nonzeros in the Cholesky factor. Oh, by the way, thatís something we can handle but as you can see this method will be like just orders of magnitude different.
Okay, so the method is Newton which weíll look at in a minute and then truncated Newton with various things for stopping. Basically weíre max iterating out because weíre asking for .01 percent error and hereís the convergence versus iteration.
So these are actually iterations and the Newton and the 250 step CG are right on top of each other. So the whole problem takes like Ė well, I mean, it depends what your accuracy is but, you know, it takes like 15 steps. This is what we expect from our friend Newton, right? This is exactly the kind of thing we expect.
Now, by the way, the cost of a Newton step versus the cost of 250 CGís is like just orders of magnitude off. The Cholesky factor had 30 million nonzeros. Okay? The CG Ė the original problem had something like 100,000 nonzeros or something like that. So youíre just way, way, way off by orders of magnitude.
The time for these Ė you can see here that if you do only 50 CG steps you start losing a little bit, but not much. Now the ten is actually Ė this is very typical. What happens in the ten is youíre making excellent progress and then down here thatís this one, you actually Ė you stall. And the theory says thatíll keep going down but we donít have time to wait for it. So thatís the idea.
Now the right way to do this is actually to look at the convergence versus cumulative CG steps and I see a totally different thing. Oh, by the way, Newton step would be like, I donít know, probably over in my office over there, the first Newton step in terms of effort would probably be, I donít know, weíd have to check but probably way, way, way, a kilometer that way. Okay? Just to put these things in perspective.
Now, letís see here, what you can see if very, very cool. This guy that does ten CG steps, which is hardly gonna win at any prizes for, like, good approximation of the Newton step is doing, like, amazingly well and then it just jams.
By the way, if youíre happy with a gradient ten to the minus five or whatever, like, this is ridiculous. Cumulative number of CG steps is 100 and so youíre solving a problem that I donít know how long it would take with the direct method but this one might be in the order of like 20-30 minutes or something like that. Youíre now back to solving it in milliseconds. So or youíre just orders and orders of magnitude faster.
So here Ė oh, hereís some time so we can actually get the times. So, letís see, so if you do 50 or 250 youíre basically the same. So hereís Newton Method with N max and it jams near this gradient 10 to the minus 6 but that often is just fine.
Oh, so here are the time, itís 1600 as opposed Ė 1600 seconds so itís not bad, itís half an hour, something like that, as opposed to four seconds, okay? So these numbers are Ė these are pretty good factors here. These are worth going after these factors. So these are just much, much, much faster methods.
Thatís Ė yeah, this one that jams or something like that. So but youíre already way off. I mean, youíre down in the sub, sub, sub minute range. If you just do diagonal PCG hereís the same picture and I think now itís simple. Now thereís absolutely no doubt what the best thing to do is and itís right here. You just do diagonally preconditioned CG to do your Newton method.
Ten steps. Now thatís ridiculous. If someone says, ďWhat are you doing?Ē You say, ďIím doing a crappy job on Newtonís Method.Ē And you say, ďReally? Howís it crappy?Ē And you go, ďIím using an iterative method to calculate the search direction.Ē You say, ďWell, whatís your iterative method?Ē
You say, ďIím doing ten steps of CG and then I quit. Then whatever I have at that point I pass it to the hieroalgram to do a line search.Ē Then you say, ďOh, do you have some theory that says that worked?Ē And you go, ďOh, yeah, if I were to do 100,000 steps in exact arithmetic I would have computed the Newton step.Ē
So, I mean no one can possibly justify why ten steps is enough here. I mean, you just canít do it, so thatís it. But these things work done now unbelievably well, and now that means you can scale. That means you can do a problem with 100 million variables like no problem.
So here are the numbers for this and theyíre kind of silly. It takes 1600 and then it goes down to 3 seconds or it would be 5 seconds as opposed to, like, 45 minutes or whatever that is. So these are real factors here.
And by the way, if you make the problem bigger this goes up much faster than Ė this will just go up linearly so youíll just Ė if you were to make it ten times bigger you could solve a problem Ė and this would be a minute and that would probably scale much worse than linearly.
So okay, and you can use this for many, many, many, many things. You can use it for barrier and primal-dual methods and I think what weíre gonna do is quit here and sort of Ė and finish up this lecture next time. So weíll quit here.
[End of Audio]
Duration: 75 minutes