ConvexOptimizationII-Lecture15

Instructor (Stephen Boyd):All right, I think this means we are on. There is no good way in this room to know if you are Ė when the lecture starts. Okay, well, we are down to a skeleton crew here, mostly because itís too hot outside. So weíll continue with L_1 methods today. So last time we saw the basic idea. The most Ė the simplest idea is this. If you want to minimize the cardinality of X, find the sparsest vector X thatís in a convex set, the simplest heuristic Ė and actually, today, weíll see lots of variations on it that are more sophisticated. But the simplest one, by far, is simply to minimize the one norm of X subject to X and Z.

By the way, all of the thousands of people working on L_1, this is all they know. So the things we are going to talk about today, basically most people donít know. All right. We looked at that. Last time we looked at polishing, and now I want to interpret this Ė I want to justify this L_1 norm heuristic. So here is one. We can turn this Ė we can interpret this as a relaxation of Ė we can make this a relaxation of a Boolean convex problem. So what we do is this. I am going to rewrite this cardinality problem this way. I am going to introduce some Boolean variables Z. And these are basically indicators that tell you whether or not each component is either zero or nonzero.

And Iíll enforce it this way. Iíll say that the absolute value is XI is less than RZI. Now R is some number that bounds, for example Ė like it could be just basically a bounding box for C, or it can be naturally part of the constraints. It really doesnít matter. The point is that any feasible point here has an infinity norm less than R. If we do this like this, we end up with this problem. This problem is a Boolean convex. And what that means is that it is Ė everything is convex, and the variables, thatís X and Z, except for one minor problem, and that is that these are 01. Okay? So this is a Boolean convex problem. And itís absolutely equivalent to this one.

It is just as hard, of course. So we are going to do the standard relaxation is if you have a 01 Ė 0, 1 variable, weíll change it into a left bracket 0,1 right bracket variable. And that means that itís a continuous variable. This is a relaxation. And here, we have simply Ė we have actually worked out, this is simply Ė well, itís obvious enough, but this is simply the convex hull of the Boolean points here. Now if you stare at this long enough, you realize something. You have seen this before. This is precisely the linear program that defines Ė this is exactly the linear program that defines the L_1 norm.

So here, for example, this is at norm X is Ė itís an upper bound on Ė ZI is an upper bound on one over RXI. And so, in fact, this problem is absolutely the same as this one. And so now you see what you have. That Boolean problem is equivalent to this. By the way, this tells you something. It says that when you solve that L_1 problem, not only do you have Ė is it a heuristic for solving the hard Boolean problem, itís says itís a relaxation, and you get a bound.

The bound is this: you have to put a one over R here, where R is an upper bound on the absolute value of any entry in C, or for that matter any entry that might Ė is a potential solution or something like that. And this tells you that when you solve this L_1 problem not only do you get Ė is it a heuristic for getting a sparse X, but in fact it gets you within a factor of R a lower bound on the sparsity. So thatís that. By the way, itís a pretty crappy lower bound in general, but nevertheless itís a lower bound. Okay. Now we can also interpret this in terms of a convex envelope. And let me explain what that is.

If you have a function, F, on a set C Ė and weíll assume C is convex, so on a convex set C Ė in fact, I should probably change this to convex set. This function is not convex. The envelope of it, it is the largest convex function that is an under-estimator of F on C. And let me just draw a picture, and weíll Ė Iíll show you how this works for the function we are interested in. The function that we are interested in looks like this: itís zero here, and then itís one over here. So thatís our indicator function. And, if you like, we can do this on the interval, plus one minus one, okay. And so our function looks like this, basically.

Thatís our cardinality function. And what we want to know is this. What is the smallest Ė sorry, the largest convex function that fits everywhere beneath this function on that interval? By the way, well, I can leave it this way. So the simplest Ė any convex functions, itís got to go through this point; itís got to go through that point; and therefore, this gives you Ė thatís it. So no one would call Ė no one would call this absolute value function a good approximation of this function, for sure. But it does happen to be the largest convex function thatís an under-estimator of it, okay. So thatís that.

And then, actually, you can go like this, if you like. Because itís on this Ė itís just on this one set. So thatís the convex envelope. Now we can relate to all sorts of interesting things. I mean one is this that the Ė one way to talk about the envelope of a function in terms of sets is this. You form the epigraph of the function, and then simply take the convex hull. And it turns out thatís the epigraph of the envelope. And you can see that over here, too. So the original function has an epigraph that looks like this. Itís all of this stuff, and then this little one tendril that sticks out down there. So itís everything up here, and one little line segment sticks out there.

Convex hull of that fills in this part and this part. And what you end up with is the absolute value restricted to plus minus one. So thatís going to be the convex envelope here. And another interesting way to say it is this, it is, in general, itís F star star. And I donít want to get into technical conditions, but Ė actually, the conditions arenít that big of deal, itís just that it should be closed or something like that. Actually, this function here is not closed so Ė but anyway, it looks like this. Itís the conjugate of the function, which is always convex, and then that starred. So itís the conjugate of the conjugate.

So now if the function F were closed and convex originally, this would cover F. Thatís kind of obvious. Okay. So for X scalar, absolute value of X is the convex envelope of card X and minus one one. And if you have a box of size R, an L infinity box here, then one over R norm X1 is the convex envelope of card X. So that gives you another interpretation. So if someone says, ďWhat are you doing?Ē You say, ďI am minimizing the L_1 norm in place of the cardinality.Ē They can say, ďWhy?Ē You would say, ďWell, itís a heuristic for getting something sparse.Ē And they go, ďWell, thatís not very good. Can you actually say anything about it?Ē

And you go, ďActually, I can. One over R times my optimal value is a lower bound on the number of nonzeroes that any solution can have.Ē Okay, by the way, if you understand this, you already know something that most of the many thousands of people working on L_1 donít know. You are actually now much more sophisticated. And Iíll show you, and itís going to be stupid, but itís actually completely correct. Suppose for some reason I told you that X1, Y is between one and two, so it lies here. What is the convex envelope of that? Well, itís this. It looks like that. And what you see is that the function is now no longer an absolute value.

Itís asymmetric, okay. So itís asymmetric. You can write out what it is. Would it be a better thing to do if you wanted to minimize the cardinality? In this case, the answer is absolutely it would be better. It would work better in practice. In theory, of course, itís better because it gives you the actual convex envelope and so on. So what that says is that when you minimize the one norm of X as a surrogate for minimizing cardinality, you are actually making an implicit assumption. The implicit assumption is that the bounding box of X Ė of your set Ė that the bounding box is kind of Ė itís like a box. Itís a uniform, right.

All of the edges are about the same, and they are centered. If you ever had a problem where you are minimizing cardinality and X is not centered, like for example, some entries lie between other numbers, you would be using a weird, skewed, weighted thing like that where you have different positive and negative values. Oh, what if I told you that X2 lies between, for example, between two and five? What can you say then? Then X2 is not a problem because X2 will never be zero. So itís just Ė itís a non Ė then itís easy. Okay. Well, we just talked about this.

If you had a Ė if you knew a bounding box like this with an LI and a UI, then Ė and by the way, you can find bounding box values very easily by simply maximizing and minimizing XI subject to over this set. So you can always calculate bounding box values. Now if the upper bound is negative or the lower bound is positive, then thatís stupid because that means that X has a certain sign, and there is no issue there. It is a non-issue. If they straddle zero, that means there is the possibility that that X is zero. And in that case the correct thing to do is to minimize this. If these things are equal here, that reduces to L_1 norm minimization.

So thatís what that is. This will also give you a lower bound on the cardinality. Okay. So letís look at some examples. I think we looked at this last time briefly. Iíll go over it a little bit better this time, or weíll go over it in a little bit more detail. This is a regressor selection problem. You want to match B with some columns of A, linear combinations of columns of A, except I am telling you, you can only use as many as K columns here. So, okay, so the heuristic would be to add, for example, a one norm here and adjust lambda until K has fewer than K nonzeroes.

And then what would happen is Ė youíd look at this value of that then. So here is the Ė here is sort of a picture of a problem. Itís got 20 variables, 2 to the 20 is around a million. And therefore, you can actually calculate the global solution. You can do it by branch and bound. We are going to cover that later in the quarter, but you can also Ė in this case, you just work out all million. So one million least squares problems, youíd check all possible patterns, and not a million. Yeah, yeah, this is a million. You just solved a million of them and for each one. So the global optimum is given here, like this.

And this one gives you this Ė the one obtained by the heuristic. And you can see a couple of things here. It looks to me like you never Ė I am not quite sure here, but I think for most of it you are never really off by Ė well, no, here you are off by two. Thatís a substantial error. You are off by one; sometimes you are exactly on and stuff like that. But the point is that this curve is obtained by the heuristic, which was one-millionth the effort there. Okay. So now weíll look at sparse signal reconstruction. Itís actually the same problem, different interpretation. You want to minimize norm AX minus Y.

Y is a received signal. X is the signal you want to estimate. The fact that there is a two norm here, this might be, for example, that you are doing maximum likelihood estimation of X with a Gaussian noise on your measurement, so you have your Y equals X plus V. Then this is prior information that the X you are looking for has no more than K nonzeroes. So thatís the Ė thatís the other. The other one, the heuristic would be to minimize this two norm subject to norm X in one less than beta. In statistics, this is called LASSO here. I canít pronounce it Ė you have to pronounce it with Trevor Hastyís charming South African accent.

I tried to learn it last time I went over this material, but I never succeeded. So Iíll just call if LASSO. Okay, so thatís this thing. And another form is simply to add this as a penalty and then sweep gamma here. And in this case itís called basis pursuit denoising or something like that, so. And I can explain why itís basis pursuit. If you are selecting columns of A, you think of that as sort of selecting a basis. So this, I guess Ė donít ask me why itís called basis pursuit, but thatís another name for it. Okay. Letís do an example. Itís actually Ė when you see these things, they are actually quite stunning.

And they are rightly making a lot of people interested in this topic now; although, as I said earlier, the ideas go way, way, way, way back. So here it is. I have a thousand long signal. And I am told that it only has 30 nonzeroes, so itís a spike signal. Now just for the record, I want to point out that a thousand choose 30 is a really big number. Okay. Just so the number of possible patterns of where the spikes occur is very, very large. For all practical purposes, itís infinitely large. And here is whatís going to happen. We are going to be given 200 noisy measurements. And weíll just generate A randomly.

And there will be Gaussian noise here, okay. And then you are asked to guess X. Now, by the way, I should point something out. If someone walks up to you on the street and says, ďI have a thousand numbers I want you to estimate. Here are 200 measurements.Ē You should simply turn around and walk the other way very quickly, okay. Get to a lighted place or something like that as soon as you can, or a place with other people. The reason is it doesnít Ė this is totally ridiculous, right.

Everyone knows you need a thousand Ė if you are going to measure a thousand signals, you need a thousand measurements, okay Ė so at least a thousand, right, and better off is 2,000 or 3,000 or something like that to get some redundancy in your measurements, especially if there is noise in it. But the idea that someone would give you one-fifth the number of measurements as you have data to estimate and expect you to estimate it is kinda ridiculous, okay. Now, by the way, the flip side is this. If someone told you which 30 were nonzero, you move from five-to-one more parameters than measurements to the other way around.

If I tell you that there is 30 numbers I want you to estimate, and I give you 200 measurements, now you are 8-to-1 in the right direction, or you are 7-to-1 in the right direction. In other words, you got seven times more measurements than Ė everyone following this? Okay. So, all right, so what happens is if you simply give this L_1 thing Ė you just Ė well, you can see in this case you just recover it like Ė I guess itís perfect. I mean itís not completely perfect, some of these. I think the sparsity pattern is maybe perfect. It looks to me like itís perfect. Yeah, if itís not perfect, itís awfully close.

By the way, what that means is if you polish your noise will go down lower, and youíll get these very close. You wonít get it exactly because you have noise. But the point is this is really quite impressive. If, in contrast, you would use an L_2 reconstruction, a [inaudible], and you would solve this problem, you can adjust gamma Ė basically, it never looks good, the reconstruction. But this would be an example of what you might reconstruct, something like that. So this is the rough idea, okay. So these are actually pretty cool methods.

I mean I donít really know any other really sort of effective method for doing something like this, for having Ė for saying, look, here is 200 measurements of a thousand things. Oh, and here is a hint, prior information. The thing you are looking for is sparse; please find it. And these work. I should also mention, unlike least squares. Least squares is kind of nice. Itís a good way to blend a bunch of measurements and get a very good Ė it can work beautifully well if you have got like 50 times more measurements than variables to estimate. You use least squares. Almost like magic, itís a 263 level, right.

All of a sudden, from all of these crazy [inaudible] with noise, out comes like a head that you are imaging or something like that. I mean itís really quite spectacular. And it kind of fails gracefully. I should add something here. These donít fail gracefully. And I bet you are not surprised at that. So if you take this Ė which you can, all of the source is on the web, everything Ė and you just start cranking up sigma. You crank it up, and up, and up. And it will work quite well up until some pretty big sigma Ė youíll give sigma one more crank up and, boom, what will be reconstructed will be just completely Ė it will go from pretty good reconstructed to just nonsense very quickly.

So I just thought Iíd mention that, so. Somehow itís not surprising, right. Okay. Let me mention some of these theoretical results. Obviously, I am going to say very little about it. They are extremely interesting. In fact, just the idea that you can say anything at all about it I find fascinating. But here it is. The problem is going to be that the set C is going to be very embarrassing. Itís going to be an affine set. So basically, suppose you have Y equals AX, where the cardinality of X is less than K. And you want to reconstruct X here. Now, obviously, you need Ė the minimum you could possibly have would be K measurement.

So in other words, if someone comes up to you and says, ďWow, thatís good. You have got my 30 non Ė my spike signal with 30 things. What if I give you 19 signals?Ē Thatís not even Ė thatís not enough to get 30. So on the other hand, if the number of measurements is bigger than the size of the number of parameters, then we are back in 263 land, and everything is easy to do. So thatís trivial. So the interesting part is where M lies between K, thatís the sparsity Ė known sparsity signal and the number of parameters. And the question is: when would the L_1 heuristic thatís minimizing norm X1 subject to AX equals Y.

I mean and notice how simple the set is, C if the set of X Ė AX equals Y. When would it be constructed exactly? Thatís the question. And actually, this Ė you can actually say things which are quite impressive. And it basically says this. It says that depending on the matrix A here Ė but there is a lot of matrices. Actually, there is a long Ė there is a lot of matrices that would actually work here. And there is actually all sorts of stuff known but what exactly what it is about the matrix that does the trick; it has to do with coherence or something like that.

And it says basically that if M is bigger than some factor times K Ė so this is sort of Ė if I gave you the hint, if I told you what the sparsity pattern is, you would need M bigger or equal to K. So the extent to which this number goes above one is how much more you need than the minimum if you had the secret information as to what the sparsity pattern was. And this is an absolute constant C times log N here. Then it says if this works, then the L_1 heuristic will actually reconstruct the exact X with overwhelming probability. What that means is that as you go above this, actually the probability of error goes down exponentially. Okay? Thatís it.

And some valid Aís would be this. If the entries in the matrix were picked randomly, that would do the trick. If A was a rows of a DFT matrix, of a discreet 48 transform, then it would work. As long as the rows are not like bunched up or something like that, then it would work. And there are lots of others. So this is it. And these are beautiful things. I would take you about four seconds to find these on the web with Google, to find references and just get the papers. So, okay, so we will go on to the second part of this lecture, and that is going to be here. There we go. Great. Okay. So we are going to look at some more advanced topics here, so Ė and just other applications, and variations, and things like that.

So one is total variation reconstruction. I should add that this predates the current fad. The current fad, L_1 fad, it depends on the field. I mean statisticians have been doing it for a long time, like 12 years or 15 years now. People in geology, I think, have been doing it for 20. Others have been doing it probably 20 or something like that. But the recent thing, actually, was spurred by these results. And thatís in the last five years, letís say. But total variation, this goes back, I believe, easily to the early Ď90s or something like that. So it works like this. You want to fit a corrupt Ė you have a corrupted signal, and you want to fit it with a piecewise constant signal with no more than K jumps.

Well, a simple way to do that is to trade off Ė X hat is going to be your estimate of your signal. You trade off your fit here Ė by the way, if this were Gaussian noise that would be something like a negative log likelihood function here. You would trade off your fit with the cardinality of DX where D is the first order difference matrix. And what you do is you would vary gamma. If you made gamma big enough, the solution X is constant. And then, of course, itís equal to the average value. As you crank gamma down from that Ė from that number, what will happen is the X hat will first have one jump, so it will be piecewise constant with one jump, then two, then three, then so on and so forth. Okay.

So DX1, by the way, is the sum of the differences of the absolute values of a signal Ė this is scalar signal for now. Thatís got a very old name. It goes back to the early 1800ís. Thatís called a total variation of a signal Ė of a function Ė a signal, thatís fine. And this is called total variation reconstruction. And there is a lot of things you can say about TVR reconstruction, but what happens is they actually are able to remove sort of high frequency noise without smoothing it out. Weíll see how that works, like an L_2 regularization will just give you a low-pass filter; it will smooth everything out.

Now these, they are very famous here because they didnít Ė these were some of the methods do things like recover. These original from the wax recordings of Caruso or something, they actually reconstructed them. They got all sorts of jazz stuff from the Ď20s and reconstructed them using these methods. And they are amazing. I mean sort of the clicks and pops just go away. I mean they are just Ė they just Ė they are just removed. Okay. So here is an example. And so here we have a signal that looks like this. Itís kind of slowly moving, but itís got these jumps as well. I mean this is just to make a visual point.

There is a signal, and the corrupted one looks like this. So there is a high frequency noise added here. Okay. And if we do total variation reconstruction, these are three values of gamma. And they are actually chosen Ė one is supposed to be like too much; one is too little; and one is not enough. But you can see something very cool. The jumps are preserved. So Ė and thatís not like Ė thatís not smoothed out at all. Thatís jump Ė thatís smoothed out perfectly. Okay. And this is sort of too much because I have actually sort of flattened out the curvature that you saw here.

This is maybe you havenít removed enough of the signal, but you still get this jump here. And this might be just enough. By the way, if you were listening to this Ė in fact, I should probably produce some like little jpeg Ė I mean some little audio files or something like this so you can hear total variation denoising. Itís very impressive. L_2 denoising, in a minute weíll see that, thatís just low pass filtering. You have a lot of high-frequency stuff with everything just muffled. You hit a drum, everything is muffled. The L_1 [inaudible] will actually cut out this high-frequency noise.

But when someone hits a snare drum, itís right there. So itís pretty impressive. We should Ė it would be fun to do that, actually. Okay. Here is the L_2 reconstruction, just to give you a rough idea of what happens. In L_2 reconstruction, to really smooth out the noise, basically, you lose your big edge here. And this is sort of, maybe, the best you can do with L_2 or the best trade-off. And this would be Ė if you still Ė by the way, you still Ė in this case, itís not Ė that has not been preserved exactly. Itís actually been smoothed a little bit. You just canít see it here. And you still are not Ė you are not getting enough noise attenuation, so just get a picture. Yeah.

Student:[Inaudible] that even this sounds very, very good.

Instructor (Stephen Boyd):This one?

Student:Yes, L_2 because thatís the one we didnít do extraneous [inaudible].

Instructor (Stephen Boyd):Yes.

Student:This one sounds very good. And why should we go the extra half if we are going for L_1? I mean [inaudible].

Instructor (Stephen Boyd):Oh, in cases Ė I mean to do things like remove clicks and pops. And then, if you started listening carefully, you would find out this did not sound good at all. I mean not at all.

Student:Okay.

Instructor (Stephen Boyd):Yeah. Because youíd either hear this noise, right, or you start muffling this. And that makes a drum sound like Ė then you are not tapping a drum; you are tapping like a pillow or something like that. And itís no longer a drum. I mean just Ė so thatís the Ė if you listen to these things, itís quite audible. We can adjust the parameters so the 263 methods work well, which of course naturally we do in 263. Wouldnít we? So, okay. So, okay. But actually, the total reconstruction variation is really done more often for images. And I believe itís also done even for Ė in 3-D. And I believe itís done even for 4-D, so for 3-D movies with space-time.

But weíll look at it in 2-D, and itís quite spectacular. So here is the idea. And this is going to be very crude. And Iíll make some comments about how this works. So what it is, you have X and RN. These are the values on a N by N grid. So our R grid has about a thousand points on it. I mean itís small, but thatís it. And the idea is I want Ė here is a prior knowledge is that X has relatively few Ė so X is sort of piecewise constant. In other words, itís like a big region. It looks like a cartoon. Itís got big regions where itís constant and with a boundary. So everybody see what Iím saying? So thatís the Ė itís cartoon looking.

It looks like a cartoon, or a line drawing, or whatever. All right, now this problem. You get 120 linear measurements; thatís, of course, a big joke. You are whatever that is, six or seven times Ė I guess you are seven times under or six, whatever it is. Youíre some big factor under here, eight maybe that is, I don't know, eight. So you are eight times under sample. In other words, I want you to estimate 960 numbers; Iíll give 120 measurements. These are exact. These are exact. So the way weíll do this is weíll say, look, among Ė this has, among all of the Xís that are consistent with our measurements, thatís a huge set.

In fact, whatís the dimension of the set of Xís that satisfy this? What do you think? Whatís the dimension on that? You donít have to get it exactly, just roughly. What 840 dimensions? Yeah. You got Ė you get 961 points. You get 120 measurements, null spaces on the order of the Ė is the difference, right. So, I don't know, you have eight. So this is a huge number of Xís are consistent with our measurements; 840 dimensional set of images are consistent with our linear measurements. But among those, what weíll do is to pick one weíll do this. We would like to minimize Ė this is the sum of the cardinalities of the differences, and let me show you what that is over here.

And Iíll explain in a minute how to make this better Ė look better, anyway. Itís this that we have our grid. And basically, we would Ė we are going to charge for the number of times two edges Ė two values are different. And thatís both this way and this way. So, for example, that big objective would be zero if the entire image were constant. Otherwise, everywhere where there is sort of a boundary, you are going to get charged, okay. So thatís the picture. Now we canít solve that problem, but we can solve this variation on it.

Now, by the way, when you do L_1 this way on an image, and you just go Ė you charge for this way and this way, what happens is you are going to tend to get images Ė or youíll get things that will Ė they actually prefer like this direction or this direction, and you get weird things. I think we had that in a maybe a homework Ė no final Ė was it final exam problem 364? Was it? I canít remember. I think it was. We made Jacobís happy face. Midterm? Final? Midterm.

Student:[Inaudible].

Instructor (Stephen Boyd):What?

Student:Homework.

Instructor (Stephen Boyd):Homework. Okay, homework, sure. Anyway, so okay, so letís see how this works. So here is the Ė here is the total variation reconstruction. And the summary is itís perfect.

Student:[Inaudible].

Instructor (Stephen Boyd):Yeah. I know.

Student:I just [inaudible].

Instructor (Stephen Boyd):Great. Okay. Good. Good, okay. I forget, too. Wait until you see the [inaudible] 364c.

Student:[Inaudible] 364c?

Instructor (Stephen Boyd):Oh, yeah. Yeah. We are starting it this summer just for you. You are already enrolled. Youíll like it. Weíre bringing the final exam back on that one, except that itís going to be every weekend, though, 24-hour. But you are learning a lot. You are going to learn a lot, though. Okay. So I Ė I mean this is Ė you get the idea. In this case, you recovered it exactly. So I mean these are kinda impressive when you see these things. Variations on this, by the way, are quite real. This is a fake toy example. You can go look at the source code yourself, which is probably like all of ten lines or something.

The plotting, needless to say, is many more lines than the actual code. These are actually quite real things. I mean there is stuff going on now where you do total variation reconstruction MRI from half of this Ė a third of the scan lines, and you get just as good an image. So, okay, and this is what happens if you do L_2. And I mean this is what you would imagine it to look like. Thatís kind of what youíd guess. And you can adjust your gamma and make the bump look higher, or less, or whatever. But itís never going to look that great. Thatís Ė this is your 263 method, so. Thatís essentially a least-normal problem. Yeah.

Actually, Iím sorry, there is no gamma in this problem. Thatís just least normal. Okay. So that finishes up some of these Ė oh, I should Ė I said I promised I was going to mention how these methods actually done. What you really want in image is you really want your estimate of Ė is to be approximately rotation and variant. So, in fact, you get a much better looking Ė visually now, if you really do this, not by just taking your differences this way, but youíll also take this difference here as well. And you will Ė and that thing, youíll divide by square root two or something like that.

And the other, the most sophisticated way, is to take your favorite multi-point approximation of the gradient and take the two norm of it and minimize the sum of the two norms of those gradients. Thatís the correct analog of total variation reconstruction in an image. And that will give you beautiful results that will be approximately rotation in variant. Okay. So let me talk about some other methods. This is also Ė this is just starting to become fashionable, the L_1. And there is other variations on it. And I think people call it beyond L_1 or something like this. And it goes like this.

So one way is to iterate, and Iíll give a Ė weíll give a very simple interpretation of this in a minute. So I want to minimize the cardinality of over X and C. So what you do is this is instead of minimizing the L_1 norm, weíll minimize like a weighted L_1 norm. Now we have already seen good reasons to do that. The weights correspond Ė one over the weights correspond to your prior about the bounding box. So thatís one way to do Ė one way to justify weights. But the idea here is this. You solve an L_1 problem, and then you update the weights. And the weight update is extremely interesting.

It works like this. If you run this L_1 thing, and one of these numbers comes out zero, this Ė then you get the biggest weight you can give, which is one over epsilon. What that means is thereafter, itís probably going to stay zero because it went to zero at first. Once you are zero, in the next iteration your weight goes way up. And then there is very strong encouragement to not become nonzero. Whatís cool about this is if X turns out to be small but not zero, so you are actually being charged for it cardinality-wise, what this does is it puts a big weight on it. And it basically makes that one look very attractive.

And it basically mops Ė cleans up small ones, just gets rid of them. On the other hand, if XI came out big, you are taking that as a heuristic to mean something like, well look, if you minimize an L_1 norm and something comes out Ė one of the entries comes out to be really big, it basically means, look, that thing is not going to be zero anyway. You are not going to drive it to zero. So therefore, relax, and basically says reduce the weight on it. So if it wants to get bigger, let it get bigger. So this is the picture. And this will typically give you modest improvement. Well, I mean it will actually give you real improvement over the basic L_1 heuristic.

And it typically converts in five or fewer steps. By the way, a more sophisticated version, actually, is not symmetric here with the weights. Weíll see what the more sophisticated version is, but anyway. So here is the interpretation. So weíll work with a case where X is bigger than equal to zero. And weíll do that by splitting X into positive and negative parts where those are both non-negative. And the cardinality of X then, itís the same as this thing, I mean provided one of those is always zero. And weíll use the following approximation. Instead of Ė let me show you this. In fact, this kind of the idea behind all of these new methods that are beyond L_1.

I mean it goes back to Ė I mean all of these things go back to stuff that is very stupid. By the way, these things are very stupid, and yet it doesnít stop people from writing fantastically complicated papers about it, right, and making it look not stupid. But they are stupid. So letís go to one and stop there. Okay. So here is the function we want. It jumps up and goes like that. Here is our first approximation, not an impressively good Ė not what you call a good approximation. And so some of the Ė this one says you replace it with a log one plus X over epsilon. And, you know, if you allow me to scale it and change various things, thatís a function that looks like this.

Iíll shift it and scale it, if you donít mind. And itís a function that looks like that. Well, let me just make it go through there. There, okay, so it looks like that. And this little curve at the bottom, thatís the epsilon like that. So it looks kinda like that. Okay? I drew it with epsilon exaggerated. I really shouldnít have. Let me redraw it, and it looks like this. And then you have got a little thing like that. Thatís sort of how you are supposed to see it. Okay? And you are supposed to say, well, yeah, sure, okay. This function here is a way better approximation of this thing than this, okay. What? You donít think it is?

Student:Itís not [inaudible].

Instructor (Stephen Boyd):Itís not convex. You are very well trained. Right. I can tell you a story. A student of Abassís went into Ė they were just talking to Abass. And Abass said, ďYeah, but then you can do this problem and maximize the energy lifetime of this thing and blah, blah, blah, like that.Ē And the student stepped back. And he said, ďAre you crazy?Ē And Abass said, ďNo. Whatís wrong with that?Ē And the student looked at him and said, ďThatís not convex.Ē Then he came and complained to me. He said, ďWhat are you Ė what are you doing?Ē So you are right. Thatís not convex. Okay. But itís okay. Now you are okay. You can handle it.

So in fact, this method is Ė in fact, not only is it not convex, itís concave. Now if you have to minimize a concave function over a convex set, when we did sequential convex programming, you saw that there is a very good way to do that. And itís really dumb. I mean itís Ė what you do is you take a certain X, you linearize this thing at that point, and you optimize. No trust region, nothing. And you just keep going. If you linearize this, basically you get this thing here. This is a constant, and itís totally irrelevant when you linearize. And you actually get this. And in fact, part of that is a constant, too, like this part. Okay.

And in fact, itís the same as minimizing XI over this. And guess what sequential convex programming applied to this non-convex problem, which is supposed to fit the cardinality function better, yields that exactly. Okay. So this is really an interactive heuristic. Sorry, it is. This is a convex-concave procedure for minimizing a non-convex function versus a smooth function, which is supposed to approximate the Ė what do you call it. Itís supposed to approximate this card function, okay. By the way, there is other Ė lots of other methods. And I can say what they are. Here is a very popular one. All of these work.

Thatís my summary of them; lots of papers coming up on all of them. Here is another one. I don't know, here is one; you donít like Ė letís just do it on the positive. You donít like X, how about X to the P for P less than one. So these functions start looking like that. And the small Ė if you make P really small, they look just like that card function. And by the way, this leads some people to refer to the cardinality as the L zero norm. Now letís just back up a little bit there. And two of you have already said that, so you can say it. I cannot bring myself to say that because there is no such thing as an LP norm with P less than one because itís not convex.

The unit balls look like this. And then I canít even say it. You see? Thatís what happens if you learn math when you are young. That is not the unit ball of anything. And you should not say that. You shouldnít even Ė you shouldnít say that. And yet, you will hear people talk about LP norm with P less than one. And itís not convex. Itís not a norm, blah, blah, blah. So the methods work like this. In fact, you tell me. Letís invent a method right now. How would you Ė how would you minimize this approximately, and heuristically, and so on? By the way, if you minimize that, you would get a very nice sparse solution, very nice.

How would you do it? You just linearize this thing. And what would you Ė and what, in fact, would you be doing at each step, and if you did the convex-concave procedure on this guy? You would be solving an iteratively re-weighted L_1 problem. Okay. And the only thing that would change is your weight update would be slightly different from this one. But your weight update would be reasonable. And I always do the same thing. What happens in a weight update is this. Entries that are big, you just say, ah screw it. Thatís probably not going to be zero, and you reduce the weight. Entries that are small, you crank the weight up.

If that thing is already zero, thatís a strong inducement to pin it at zero. If itís small, thought, thatís a Ė that makes that thing a very attractive target for being zeroed out. And thatís what drives the cardinality down. Everybody got this? So thatís the idea. Okay. Itís a very typical example is you want to minimize the cardinality of X over some polyhedron. And the cardinality drops from 50 to 44, not that impressive. And if you run this heuristic, I guess six steps it converges, actually, after a couple, it will stop out at Ė no, sorry, letís see. Here we go. L_1 gives you 44. And the iteratively re-weighted L_1 heuristic gets you 36.

The global solution, probably, found for this problem in long time, later in the class, is 32. So just emphasize, again, we are not Ė we are not actually solving these problems. These are heuristics. But they are fast, and they are good, and so on and so forth. By the way, the fact that you are not solving the problem if the problem is Ė for example, is rising in a statistical context, I think means it doesnít matter at all. Right? Because it Ė you donít get a prize for getting the exact maximum likelihood estimate. Maximum likelihood estimate is just Ė is itself, in some way, itís just a procedure for making a really good guess as to what the parameter is.

And itís backed up by a hundred years of statistics. Okay. If you miss that Ė and by the way, even if you do perfect maximum likelihood, as any statistician or anybody who knows anything about it will tell you, you are not going to be getting the exact answer anyway. Thatís just some that Ė thatís just some which asymptotically will do as well as any method could or something like that. Thatís its only Ė now by the way, for engineering design, thatís a different story, right. You find a placement of some modules on a chip that takes 1.3 millimeters as opposed to Ė whereas, opposed to 1.6, thatís real.

Thatís unlike the statistical interpretation. But still, okay. So letís look at an example of that. Itís a fun example. Itís just detecting changes in a time series model. So letís see how that works. We have a two term ARMA model, or I guess people call this Ė Iím sorry, itís not ARMA Ė itís AR Ė I think, actually, people call this AR(2). So it looks like this. Y of T plus two is equal to a coefficient times Y of T plus one, plus another coefficient times Y of T plus a noise, which is Gaussian. Now the assumption is this: these coefficients here are mostly constant. And then every now and then there is a change in the dynamics of the system.

And one or both of those numbers changes, okay. Youíll be observing merely Y. And your job is to actually estimate A and B, and in particular, to find where the changes are. So letís see how that works. By the way, well, it doesnít matter. I mean I was going to make up an application of it. And it basically says the changes could be Ė could tell you something about a failure in a system or something like that, or a shock in a financial system or economic system, something like that. Okay. So here is what weíll do is we will Ė given Y, this is a negative log likelihood term here with some constants.

Thatís a negative log likely Ė this is an implausibility term. Because, for example, if you run up a giant bill here, itís asking you to believe that the V did some very, very unlikely things. Thatís what this term is. And then here, we add in a Ė actually, itís a total variation cost. It basically says it penalizes jumps in A and B in the coefficients. Now, by the way, if I make gamma big enough, A and B will be constant. Okay. If I make them zero, then I will make this thing zero because I can adjust my A and B, in fact many ways, to get absolutely perfect fit here. Okay. So here is an example.

Here are how A and B changed, so A is this, and then, I guess, I don't know, I canít see now. But letís say if T equals a hundred, it changes to a new value. B is here, and a 200 pops up here. So there is three changes. And you can sort of, visually, if you squint your eyes, you can change in the dynamics of the system here that, if you look left of there, you get one kind of dynamics. You can see a little Ė some Ė with a little squinting, you can see that the dynamics on the left in between a hundred and 200 looks different. And again, between 200 and 300, well, it helps that have I told you what happened.

But none of the Ė itís certainly consistent. Now the interesting thing, though, is imagine I hadnít told you this. I don't know that itís that obvious. I mean certainly this doesnít look very much like that. But would you really know that something happened here? I don't know. You could Ė I could have made the coefficients change in such a way that you would Ė your eyeball Ė you couldnít do it. So if you run this total variation heuristic, on the left this is the estimate of the parameters. And I want to point out this thing is already like very good. Itís estimated the parameters to be here. It jumps down, and it does some weird thing here.

I can explain that a little bit here. Itís some little false positives. Thatís a false positive where this thing jumps up. Every time this thing jumps up, there is a bunch of false positives in here, and some false positive jumps in here, and so on. But actually, you know what, this is not bad at all. Not bad at all. This is kinda Ė itís kind of saying that there are weird changes in here and here. If you do the iterated heuristic, you can actually see visually exactly what happens. By the way, this guy is pulled down here because itís charged a lot. It only makes this big shift for which it pays a lot in this objective to make the Ė to try to make this overall objective small.

But what happens is actually really, really cool. What happens is this. If you iterate it, this difference is really big. And on the next step, itís going to get a less weight. So it basically says, oh, you really want to jump here? I am going to charge you less for the jump at this time step. Here, these little guys Ė by the way, where itís flat, it says, okay, you donít want to jump at all. I am going to make Ė I am going to charge you the maximum amount. Thatís the one over epsilon in the weight. And these little ones, thatís what these L_1 Ė iterated L_1 heuristics do. They go up, and they clean up things. They just get totally nailed.

And thatís the final estimate there. And you can see that itís much, much better. I mean you are actually tracking the parameters very nicely. You might ask why the error here? Why the error here? Why did it miss the time point here? And the answer would be because there was noise. Thatís why, but itís still Ė itís awfully good to do this. Okay. It works very, very well. Okay. And our last topic is going to be the extension of these ideas to make matrices and rank. So if you have Ė if you have cardinality of a vector Ė thatís a number of nonzeroes, there is a very natural analog for matrices, and thatís the rank.

And by the way, both of these things come up as measures of complexity of something. So in other words, sort of the complexity of a set of coefficients is something like Ė I mean this is very rough number of zeroes or something. And the complexity of a matrix also comes up a lot, and thatís the rank. Now a convex rank problem, thatís a convex problem except you have a rank constraint or rank objective, these come up all the time. And they are actually related. If you have a diagonal matrix and the rank of it is the cardinality of the diagonal, thatís kind of obvious. But the interesting part is whatís the analog of the L_1 heuristic?

And it turns out itís the nuclear norm, which is the dual of the spectral norm or maximum singular value. And itís the sum of the singular values of a matrix. So thatís it. Itís not simple, but thatís what it is. It is the sum of the singular values. And thatís the dual of a spectral norm, which you probably didnít know but it kinda make sense. Because somewhere in your mind you should have this map there that associates Ė well, it should associate LP and LQ, where one over P equals one over Q is one. But particular pairings should be burned into neurons directly. Thatís two and two, so dual of L_2 is L_2 type thing.

And dual of L_1 is L-infinity; dual L-infinity is L_1. These you should just know. So it shouldnít be surprising that if itís the maximum singular value, thatís like and L-infinity on the singular value, roughly; that the dual norm should be the sum of the singular values. Thatís it. Now if a matrix is positive semi-definite, then symmetric Ė positive semi-definite, then the eigenvalues are the singular values, and the sum of the singular values are therefore some of the eigenvalues. Thatís a trace, okay; whereas, for a vector, if I have a non-negative vector, and I think the one norm itís the same as the sum.

So, oh, and by the way, thatís why a lot of things would Ė I would still call them sort of L_1 heuristics, but you might sort of grip through the paper and never see L_1 mentioned because if a vector is non-negative, itís just a sum. But L_1 sounds fancier than to say L Ė you know, than the sum. The sum heuristic seems kinda dumb. Okay, so this is the Ė this is the nuclear norm. And weíll do an example. Actually, itís a very interesting example. It goes like this. You are given a positive semi-definite matrix. And what you would like to do is you want to find Ė you want to put this in a factor model.

Now a factor model looks like this. Itís an outer product, a low rank outer Ė a low rank part plus a diagonal. What this Ė I mean it would come up this way. It basically says that if covariance of a matrix, it basically said that thatís Ė that random variable is explained by a small number of factors. In fact, itís the R Ė R is the dimension of F, the number of columns. Itís a small number of factors. And then D is sort of an extra variation. So this would be the factor model. Now by the way, there are some very easy ways to check factor models. If D is zero, how would you approximate Ė how do you approximate a positive semi-definite matrix as just low rank? How do you do that?

I give you a covariance matrix like the covariance matrix of 500 returns. And I want you to tell me Ė approximate it as a matrix of rank five, how do you do it?

Student:Itís the [inaudible].

Student:Itís [inaudible] symmetric.

Instructor (Stephen Boyd):So you should say Ė I can [inaudible] decomposition, but itís the same as the SVD. So you take the eigenvalue decomposition, and you take the top five ones. So we know how to do factor modeling without this thing. But if I want to Ė letís do one more. How about factor modeling plus Ė suppose all of the Ė suppose instead of D this was sigma squared I. Can you do factor modeling for that? How?

Student:[Inaudible] eigenvalues almost [inaudible].

Instructor (Stephen Boyd):Exactly. So the way you know it is you look at the eigenvalues of something, and you see Ė you would see five large ones and a whole bunch of small ones all clustered. And that would be your hint. Okay, so that would do that. That would be one way to do it. So you can do this with this. But when these actually numbers are all different, no one can solve that problem. Actually, itís a hard problem in general. Well, in any case, this is the Ė this is the factor Ė simplest factor modeling problem. So C is a set of acceptable approximations to sigma. And they can be Ė I mean it could be simple, like some normal, or it can be very complicated and statistically motivated.

For example, it could be a Kullback-Leibler divergence, which would be this for a Gaussian random variable. Okay. And thatís just a very sophisticated way of saying that the two matrices are close. The two variances are Ė two co-variances are close. This is the one that would give you the statistical stamp of approval. The statistics stamp of approval would come from using something like that. Then a trace heuristic for minimizing rank is pretty simple. It goes like this: X plus D is your original matrix, and so your variables here are going to be X and Ė oh, oh, this should either be Ė well, I should either write Ė capital D is the diag of little D.

So itís Ė but anyway, it doesnít say that here which is weird. But thatís it. So this would be the problem. And thatís a convex problem. If you put rank here, itís a convex rank problem. So weíll look at an example. So here is an example where, in fact, I have a bunch of data, which are Ė well, we know it. Itís actually generated by three factors, all right. So what happens is you get snapshots of 20 numbers. They are all varying. They are random. You look at these 20 Ė you look at a bunch of these things, they Ė you get a full covariance matrix, full rank. But it turns out that three factors describe it plus the diagonal elements.

By the way, in a Ė if these were asset returns, the diagonal elements would be called the firm specific. They would be Ė thatís the firm specific variation. Basically, each Ė it said that you have a bunch of factors. I think one is the overall market, typically. And then you get some other things. Some very obvious things if you look at factor models in sort of finance you get these things. And then the Dís are the firm-specific volatilities. Okay. Weíll just use a simple norm Ė a norm approximation. And you get a trace Ė you get a trace heuristic that looks like itís a convex problem. And what weíll do is weíll generate 3,000 samples from here.

Now, by the way, we are estimating, I guess, itís 20 by 20 covariance matrix. You have got about 200 numbers. So you are maybe about 15 times as many numbers or samples are there numbers you are supposed to get. I guess each sample is 20. So, okay, itís a couple of hundred times in terms of the estimate, but you asked me in covariance, so okay. And this is sort of what happens. Itís rank three. And what happens is, is you crank up beta. You start with a rank Ė by the way, the top rank is 20. But you immediately get a 15-rank model. Then, as you increase beta, thatís the Ė that multiplies the trace thing or something like that.

What happens is the rank of that X goes down, and down, and down. You get a very steep drop down at three. And by the way, thatís Ė this is the hint that rank three is going to be Ė give you a nice fit. If you keep going, if you increase beta enough, it goes from Ė it goes down there to two and then stays there. Actually, I guess this would go down to zero at some point. We didnít show beta [inaudible] large. Whatís interesting is as we scan beta, this shows the eignenvalues. And so up here you have 15 nonzero. And you can see at different values of beta, eigenvalues basically being extinguished. They go to zero.

And so what happens is, right here, you end up with three from here on. And in fact, these are the right ones. So if we take beta as some number in here like this, we actually do get a rank three model. You can do polishing in a case like this, too, obviously. Well, you can figure out what polishing is in this case. But in this case, if we take 0.1357, thatís the tradeoff curve, you find that the angle between the subspace, which is the range of X and the range transposed is 6.8 degrees. And that we nailed the diagonal entries; we actually got the firm, specific volatilities within about 7 percent. So this is just an example of this kind of thing.

Okay, so this actually pretty much covers up this Ė the whole topic of sort of L_1 and cardinality. And the idea is instead of just thinking it as sort of a basic method where you just reflexively throw in an L_1 norm, actually these extensions show that, first of all, it comes up in other areas like rank, minimization. These internet methods, actually very few people know about them. They are not even used. Most people arenít even using polishing. Most people arenít even using the asymmetric L_1 stuff. So if you are interested in getting sparse solutions, there is actually better things than L_1 available.

And certainly, these things like these LP Ė I canít say it, LP norms Ė LP measures, I don't know. I don't know what word to say there that is okay. Iíll just say it; LP, quote, norms, unquote, for P less than one. Those were these log approximations and these iterations. All of these things work. And I should also say Ė so I guess Emanuel Candiz and I had a long conversation like a year ago. And he said that L_1 is the least squares of the 21st Century. And, okay, itís a good Ė thatís a good Ė I mean thatís good, actually. Itís not bad. I think itís a little bit of an exaggeration, but itís not too far off, right.

It basically says that they are going to be the same way everybody needed to know about least squares and throughout the 20th Century, eventually everybody did. And by the way, by least square I mean fancy methods like Kalman Filtering, and quadratic control, and things like that. If you call that least squares, then you know a lot of signal processing, and image processing, and all sorts of other stuff ended up being least squares, period. And I think a lot is going to end up being L_1 as people move forward. So, okay, so weíll quit here unless there is some questions about this material.

And then the next topic is actually going to be Ė we are going to jump to model predictive control. So weíll Ė thatís going to be our next topic. Good. Good. Weíll quit here.

[End of Audio]

Duration: 63 minutes