Instructor (Stephen Boyd):I have a feeling that weíre on. Confirmed. Can you go down to the pad and Iíll make a couple of announcements. The first is that homework four is posted. I said that I wouldnít announce these types of things. In fact, I think we posted it yesterday. The next one Ė I shouldnít have to say this, but can you turn off all amplification in here? I shouldnít have to say this, but the midterm is actually the end of next week, so itís actually eight days from now. Weíre more panicked than you are, just for the record. We have a lot of work to do on that. Itís coming along actually quite nicely, I have to say.

Thatís of course next Friday to Saturday or Saturday to Sunday. What weíll do, I think, is today post last yearís midterm, if you just want to see what one looks like. You will also find out where homework problems come from. Many homework problems started life as midterm and final exam problems. Weíll post one so that you know what it looks like and so on, and then maybe a bit later weíll post the solutions. In fact, as to whether or not you really have to go over last yearís midterm, I think you actually donít really have to unless you want to.

If youíve been following the homework and understanding all of it and the lectures, youíre welcome to do last yearís midterm. Let me not discourage you from that. I donít think you really need to. Let me say a couple of other things about the midterm. The midterm will cover through lecture eight, which is material weíll cover a little bit today and finish up next Tuesday. It will cover through homework four, so thatís the coverage of the midterm. Weíll probably put something to that effect on the website so you know. Any questions about the material from last time or the midterm?

Weíll continue. What weíre doing today is actually just looking at some extremely useful extensions of least squares, and many of them involve this idea of multi objectively squares, so in multi objectively squares, instead of having just one objective like AX-Y norm squared that you want small, there are actually two, and so you want them simultaneously small.

Now one of the problems there is the semantics of that is not clear. It doesnít really make any sense to say please minimize these two things. It makes absolutely no sense. The first thing you have to work out is what is the semantics? By the way, if theyíre not competing, it does make sense, but thatís an extremely rare case. Otherwise, you have to figure out what even does it mean to minimize two objectives.

So we started looking at that last time. As a thought experiment, we did the following. We simply took ever XNRN and we evaluated J1 and J2, the two objectives. You want both small. For every X, we put a point. All the shaded region shows you pairs as weíve written it, J2, J1, which are achievable, and then the clear area here are pairs that are not achievable. We talked about this last time. We talked about the following idea, that if a certain X corresponds to this point, then basically, all the Xs corresponding that map into what is lower and to the left Ė these are actually points that are unambiguously better than this one.

Everything up and to the right is unambiguously worse. Unambiguously better or worse means that on both objectives, itís better. The interesting parts are the second and fourth quadrants, because here, itís ambiguous. In fact, this is where weíre going to have to really work out what the semantics is. Here, a point Ė if you want to compare a point here and here, in fact, one of the ways youíd say this is youíd say that in fact, these two points are not comparable. One has better J1 but worse J2. The other has better J2 but worse J1. Theyíre incomparable is what you say.

The boundary here, which is called the Pareto optimal Ė these are Pareto optimal points. This is the Pareto boundary. Itís also called the optimal tradeoff curve. These points are characterized in the following way. Any point on here, thereís no other point thatís better. Thatís what it means. The least you can say, and in fact, thatís all you can say.

If someone says you have a multi-objective problem, you want J1 and J2 small, and theyíre not yet willing to commit to how you trade off one and the other. If they simply say no, I want both objectives small, you can already say something of substance. You can say the following. You can say that that point is a stupid choice. Why? All of these are better. You can say thatís an infractive choice, but it canít be done.

If someone wants to minimize these two objectives, the only non-stupid choice Ė non-stupid and feasible choice Ė are gonna be the points on this boundary. Youíve already done a lot to say that you can focus your effort on these points on this optimal tradeoff curve. Thatís the basic idea in this. This extends to three objectives, four and so on and so forth. Itís an idea thatís completely obvious and that you probably already have in your head anyway.

Thereís a very common method to find points on that curve, and it works something like this. Before we do that, I want to talk about what the curve might look like qualitatively. Letís talk about that, and let me try to do this consistently. Iíll draw my first objective there and my second here. This is what it looks like. That tradeoff curve can have lots of different looks. Iím gonna draw a couple of them. One would be something like this. It actually has to go down. These are all achievable. Thatís actually quite interesting. Iíll make this three and Iíll make this one.

This is very, very interesting. In fact, if you work out that this is the tradeoff curve, and youíll see how to do this very soon. This has huge meaning and implication for this problem. The way you would describe this sort of casually or informally is this. Basically, you would say there isnít much of a tradeoff because the lowest you could ever get J1 might be over here, and that might be 0.9. Thatíd be the lowest value you could get J1 while ignoring J2. It might be here. This might have an [inaudible] or something, and this might be 2.6. So here, youíd say the smallest you could ever make J2 ignoring J1 is 2.6.

On the other hand, look at this. Thereís these points right around here, which give up ten percent in each objective and yet get both. This is so obvious that I almost hate to go over this. This is the proverbial knee of the curve. Itís an efficient point. These ideas are extremely important to have because I guarantee you will be working on problems. You will finish something or do something and the point you will find will be right here. Thatís whatís gonna happen. My opinion is itís not good enough to simply return that point. Itís not responsible. The correct thing to do is to say oh, yeah, I got J1 down to 0.9.

Letís say itís ride quality. It doesnít really matter what it is. Say I got the ride quality down to 0.9. Theyíd say thatís great. Youíd say but you know what? This is when you go back and thereís a design review. Youíd say you know what, though? It turns out if we accept a ride quality of 1.0, I can do it with one quarter of the fuel. I think if you donít point out that thereís this point here, I think youíre actually being irresponsible. The same goes for over here. If someone says find me a point and you find this point and you say Ė itíd be like if youíre doing circuit design. You could say oh, I can make that thing clock at 2.6 gigahertz. But actually, if it clocks at 2.45, Iíll use one-half the power.

As to which is the best choice, it depends. But the point is to me, itís irresponsible if you donít point this fact out. Basically, you donít ever even do Ė when you do least squares and things like that. Anything involving this, you should always just as a matter of responsible engineering Ė you will do studies like this just to check. You wiggle things around to see if things could dramatically change.

This is one where thereís essentially no tradeoff. To really get no tradeoff, you do this. Thatís absolutely no tradeoff. This point Ė actually now, itís great. This is the one case where you can say that is the optimal point. Thatís the only time when a biobjective or multi-objective problem has a unique, well-defined answer where the semantics is clear. That point is good. Itís the best one. No other point would be reasonable here. Any other point would be worse than that point. This is when thereís absolutely no tradeoff.

Now, letís look at the other extreme. The other extreme happens and the other extreme looks like this. You have a point there and a point there, and this might look something like that. That might be the tradeoff curve. Now, thereís a tradeoff. In fact, this is the opposite. Now the tradeoff is almost linear in the sense that when you give up one, you actually gain in the other by a fixed amount. This is what people call a strong tradeoff, and the slope actually Ė one of the names for the slope is the exchange rate.

Youíre actually exchanging J1 for J2 when you move here. When you go from this design to this design, what have you done? Youíre doing something like youíre exchanging J2 for J1. Youíre doing better on J1 by giving up on J2. The slope literally is sometimes called on the streets the exchange rate.

This is conceptual models. Weíre gonna leave them up here and weíre gonna come back to them. Letís look at the idea of the weighted sum objective, which comes up independent of discussion of tradeoff curves and things like that. Itís also completely normal if you have two objectives to Ė if you want to come up with some answer to actually just add them with some weight in between them. So I add this function plus Mu, this objective plus Mu times that one, and the idea is that Mu is supposed to give a relative weight between J1 and J2. Question?

Great question. I was trying to go real fast and kind of avoid that question. Iíll do it. Could the optimal tradeoff curve look like that? You want me to draw it in like that? It cannot. It actually has to be convex here. It has to curve up, and thatís because for least square problems, the J1 and J2 are both convex functions. Thatís not part of this class. I was drawing them the way they must look in 263. If you have non-convex functions, they can absolutely look like that. I will show you one that they cannot look like. It cannot look like this ever. It canít look like this. Thatís not possible.

First of all, these points are not Pareto optimal because if thatís a feasible design, everything above and to the right of this point has a technical Ė the technical name is itís a bad design. That means that this is not part of the tradeoff curve. In this case, the tradeoff curve for something that would look like that actually is discontinuous. Itís got this point and then itís got this line segment. These things can happen in the general case with general objectives. They canít happen with quadratic objectives like youíll see in 263.

Before a weighted sum objective, it turns out that you can interpret this easily on a J1 J2 plot or J2 J1 plot, and thatís this way. If I look at level curves of this composite objective Ė thatís J1 plus Mu J2 Ė in the J2 J1 plane, these are nothing but lines with slope minus Mu. If you were to minimize J1 plus Mu J2, hereís what youíre doing. You are doing nothing but this. Youíre moving this line with a slope of minus Alpha, which is fixed, and you simply move it down until you last have contact with the set of achievable points.

That is always a point on the Pareto optimal curve, and actually, thereís a lot more interesting stuff about that point. Another one is this Ė if you were to zoom in locally, these local slope here would be exactly Mu. Let me summarize that. By minimizing J1 plus Mu J2, you will actually find a point on this tradeoff curve. Thatís fact number one. Number two, you will find a point where the local exchange rate is exactly Mu or where the angle is Mu. This picture, though simple, explains everything. If I increase Mu, what happens?

If you increase Mu Ė letís think about what happens. If I increase Mu, what youíre really saying is you know what? I care more about J2 than I said before. Presumably, weíll find a new point where J2 is smaller. Youíre gonna pay for that. J1 is gonna go up. Thatís the way these things work.

Letís just see if you can get that visually. Itís very simple. If you crank Mu up, thatís the weight. You simply change the slope like that and you do the same experiment. You take this thing and you move it until it just touches and sure enough, thatís the new point for the new slope. Here I cranked up Mu by a factor of three or something. Thatís a new point. Sure enough, itís a new point on the optimal curve, and indeed, it has reduced J2 and to pay for that, it has increased J1.

If you have the ability to minimize the weighted sum of the objectives, you can actually now sweep out the optimal tradeoff curve by simply sweeping Mu over some range and minimizing this weighted sum objective and you will sweep out points. By the way, if the Mus you choose are not over a big enough range, you will sweep out just a little tiny thing.

In fact, in practice, a lot of people use Mu on a log scale because it has to go for usually a pretty big range. This is just a practical detail. Conceptually, you simply solve this problem for lots of values for Mus, store the design and the J2 J1 achieved, and plot that. You have the optimal tradeoff curve. Thatís exactly how these curves were created.

Not only that, but the picture gives you a lot of geometric intuition about what happens when you mess with Mu. Now, I want to go back to my two tradeoffs. Hereís a problem where there is no tradeoff. Letís do the one where thereís a slight but very small tradeoff. Letís do that one first. J2 J1 and Iím gonna put a slight but small tradeoff like that. Now, letís talk about minimizing J1 plus Mu J2. What will happen when I minimize J1 plus Mu J2? As I vary Mu, what happens?

Well, when you fix Mu, you get a slope like this, and you simply march down this thing until you first lose contact with it and you get a point there. Now, you change Mu a lot like that, and you go down here and you get a new point. What you should see here is that in fact the Xs are not changing much. Youíre always getting Ė over a huge range of Mu, youíre getting points right around there. In other words, youíre getting the knee of the curve over some huge range of Mus. Everybody see this? The two things Ė hereís what youíll notice. Number one, the actual design you get is largely insensitive to Mu.

Thatís the first thing. If you crank Mu to ten to the eight, then you might start getting something up here. And if Mu is ten to the minus eight, you might start getting a point down here. But the point is that for this huge range of Mus in the middle, you have a lot of Mus and youíre just tracing out this little tiny thing here. By the way, if you see that, it means youíre seeing a problem where thereís not that much tradeoff. The two objectives are not particularly competing in this case. Thatís kind of the idea.

Letís do the other one now. Letís do this one. Honestly, I donít know why I drew it with J1 vertical. Letís do the other one where thereís a strong tradeoff. Hereís the curve like that. Now, letís talk about minimizing a weighted sum. What happens now? How sensitive Ė what happens is you vary Mu and minimize the weighted sum objective. What happens? Itís very sensitive. Basically, for Mu below some number, you kind of get points here. Letís say it flattens out over here. For Mu above some number, you start getting points over here.

Right when Mu is around this slope, right as you sweep Mu through that point, this thing jumps tremendously. Everybody see that? Thatís the point here. You will see this, and it has a meaning. This is the meaning. It means youíve got a linear tradeoff. If the tradeoff in here were exactly linear, youíd actually get an amazing thing where the weighted sum objective would jump from this point all the way to that point with nothing in between. It would be absolutely discontinuous. For quadratic functions like weíre looking at, that canít happen. That could happen.

When you get many dimensions, all of these things Ė you can get all of these phenomena. You can get parts where the surface is kind of angled. You can get other parts where itís very flat, and as you mess with weights, things will jump from one place to another. Other regions where you mess with the weights and in that case, itís a tangent hyper plane touching this optimal tradeoff surface. You mess with the normal and it kind of rolls around and doesnít do very much. You can get all of these kind of phenomena, but itís important to understand these ideas.

Now letís talk about how you would specifically do this for a biobjective least squares problem. How do you minimize this? The way we can take two quadratic objectives and reduce it to a problem weíve already solved is all we have to do is say that the sum of this norm squared plus that norm squared is just this. Itís absolutely nothing more. You can check.

The top part of this is AX minus Y. The bottom part is square root Mu FX minus square root Mu G. Now, the norm squared of a stacked vector is the norm squared of the top plus the norm squared of the bottom. Norm squared of the top is that term. Norm squared of the bottom is that. If you like, you can put the square of Mu in both of these places, but when you square it and pull it outside, it looks like that.

That means weíre done, because this we know how to do. This is no problem. Weíll call that A~ or something like that. It just means in [inaudible] itís even simpler. Itís something like AN Ė if you really want to do this, something like that times F. I hope Iím doing this right and then backslash and then Y and square root Mu times G. There you go. Thereís the code for it. You shouldnít have to write that down.

The formula for it is this. Itís gonna be A~ transpose A, one of the inverse, times A~ transpose. Well, A~ transpose A Ė you can work out analytically what that is. Thatís it. Thatís A~ transpose A~ like that. You see something actually quite beautiful here. You see A transpose a plus Mu of transpose F, and thatís how this works. Over here, you get A transpose Y plus Mu F transpose G. It just sort of works out. Itís a very pretty formula.

Letís look at some examples just to see how this works. These are going to be really simple examples, but just to see what happens. This is our friend the unit less mass on a frictionless able. We have a ten second period. We apply forces X1 through X10, each one for a second in turn. Weíre just gonna have one. We donít care about the velocity. We care about the position, FT equals ten, and so we have Y equals A transpose X where A is in [inaudible], and I think you may remember what A is. I think it goes down by one. Itís large for the first one and then goes down a half or something.

By the way, when you see a vehicle or objection motion problem where the person specifying the problem tells you where this object has to be but doesnít seem to care how fast itís going, generally speaking, that corresponds to a non-positive social use. Usually, this would be something like a missile hitting something. When someone says no, Iíd like you to be Ė if you say what about the velocity and they go it doesnít matter, that Ė you should be suspicious at that point.

This is one of those cases where there are no specifications about the velocity of this mass, so no specification on the velocity. Here, you just want to be near the point one. This is a stupid problem. We could solve it by hand. Itís totally idiotic, but just to give you a rough idea, we could work out what this is. J2 is the sum of the squares of the forces used. This has units of Newtons squared.

You might ask why would you care about the sum of the squares of the forces applied? Let me ask that question. Many people take whole courses here where the entire course, everything is quadratic. You go take a controlled course in aero astro, everything Ė squared Newtons, integrated Ė thatís all you see. Why do you care about it? It corresponds to energy.

Iím going to tell you something. Thatís not true. Thatís what they tell you in those classes. Thatís what I should be telling you now. Thatís the party line. That's what weíre supposed to say, that it corresponds to energy. Thatís total nonsense. Generally speaking, I know of almost no case where it actually corresponds to energy. By the way, any case where it does correspond to energy like in a disk drive servo, thereís actually no limit on energy and no one really cares. I said it. Iím being honest.

Now why do we really care? Why do we work with the sum of the squares? What do you think? Thank you. Itís easy to analyze. Right. Because we can. Thatís why. I just wanted to be clear on this. There are lots of other things here. If this was a thruster, you could probably care. Thatís something you would really care about. Thatís the field use. You might care about this. Thatís the maximum force applied.

Why? Because this dictates how big a thruster you have to get. This has practical use. The sum of the squares, it might, but itís very unlikely. You never get an actuator, and this comes up in signal processing, too, where it says here it is. It needs 28 volts in, five amps, and it never says under no circumstances should the norm of the input exceed this. You just wonít see that. Thatís not what it is.

Weíre gonna go back, now that Iíve said that we do this because we can. Letís go back. If anyone asks you and you donít want to get into this big argument, you just say it represents energy. If they buy it, move on quickly. Thatís my recommendation. Itís kind of a stupid problem, so letís talk about some things here that we can just know immediately.

This says that basically youíre gonna apply a force for ten seconds. Youíre gonna move this mass and youíll be charged Ė there are two things you care about Ė how much you miss being displaced one meter and the sum of the squares. Let me ask some questions. Does it make any sense to move the mass backwards before moving it forwards? Obviously not because youíre running up a J2 bill and not for any particularly good reason in terms of J1.

Does it make sense to overshoot the target, which is one point, and to say here are my masses and say oh, look at that. My final position was 1.1. I overshot. No, thatís totally idiotic because youíll run up a bill here for overshooting and itís stupid because for the same amount, you actually could have landed right at the point and run up zero bill here. Youíre gonna always undershoot. You can also figure out in this problem that youíre always gonna push, and youíre gonna push more at first because itís more efficient. You can figure out a lot of this before you ever even form a formula.

The optimal X is this. Thatís a function of Mu. Thatís that tradeoff parameter. As we vary Mu, weíre gonna get different full trajectories. By the way, you can even work out an analytical formula for this. Thatís not the point and it doesnít really matter. Hereís the optimal tradeoff curve. So there it is. Itís very pretty. Hereís the energy. Notice how I said that without making any apology. Hereís the energy and here is the square of the missing distance. This curve actually hits zero at a point. It doesnít approach zero [inaudible]. It hits it.

This is a very interesting point that weíre going to discuss either later today or maybe Tuesday. At this point, you are hitting the target exactly, and youíre using the minimum energy possible. Thatís what it means to say that this curve hits this line. Zero means Y is one. At ten seconds, that mass is exactly at a position of one. Thatís the energy bill you run up. There are many, many force programs that will displace the mass one meter after ten seconds. They all lie along this line, and theyíre characterized by using more energy. This is gonna be the least energy thing that gets you right there.

What about here? Does this curve hit or is this [inaudible]? Does it hit it? Letís ask the question. Would it be possible to run up a bill J2 of zero? Could you? It doesnít move. You could do that. You could just have X equals zero. So you do nothing. Youíre doing very well in terms of J2. You couldnít do any better. You just take the hit, which is the cost on J1, and thatís here.

By the way, this is a beautiful example. Take a look at what that curve looks like near zero. So basically, if someone comes to you and says Iím sorry, Iím just not gonna do anything, this curve Ė not only does it have a steep slope, it has infinite slope there. That says that with extremely small levels of force applied, you can reduce your miss-hit distance by a correspondingly very large amount. Thatís the picture.

This is a silly example. You could have done all of this analytically or figured it all out, and thereís no surprises here. Trust me, if this was a vehicle with 12 states and 13 different inputs representing different thrusters and control surfaces you can actuate and things like that, this is not obvious. You already have four lines or five lines of code that will beat anything any person could ever come up with, and Iím talking about good pilots and things like that. Same code. Three lines. I think itís just the one I wrote before. Itís not even three. Three with a lot of comments, actually.

This stuff looks simple. This example is stupid. Trust me, even if these matrices Ė if the dimensions of vectors get to be five, ten, let alone 500 or a thousand, youíre doing stuff that is absolutely impossible for someone doing intuitive based stuff to even come close to.

Now thereís a very famous special case of biobjective least squares. Itís where the second objective is really simple. Itís just the norm squared. Here, the way to understand the semantics of it is you have a problem where you say I want a good fit. I want AX minus Y norm squared to be small, but I donít want to do it if the only way to do that is to have a giant X. I want some tradeoff there. I will accept a poorer fit in return for a modest X.

Where you operate on that curve determines the tradeoff of the optimal size of X versus the fit. At least one end of the trade we actually know. When you only care about J1, thatís just least squares. Thatís classic least squares. We know the solution. If you only cared about J2, letís get that out of the way right now.

If you only cared about J2, whatís the best choice of X? Zero. And the objective on the other side? Norm Y squared. The two end points of this tradeoff curve are now known. By the way, thatís an exercise you should always do is figure out what you can say about the endpoints, because all the action then goes down in between the two.

In this case, you get X is A transpose A plus Mu I inverse A transpose Y. This has got lots of names. Maybe the most common is tickenoff regularization. In statistics, you will hear the following phrase. Thereís probably many others. In statistics, this is called ridge regression, and Mu is called the ridge parameter. In tickenoff regularization, Mu is called the regularization parameter.

Iíll show you something kind of cool about this. This formula makes sense for any A Ė skinny, fat, full rank or not full rank. I have to kind of justify that, so Iím going to. Hereís my claim. My claim is that A transpose A Ė thereís lots of ways to prove this, but Iíll do one. Iím gonna claim that thatís invertible provided only that Mu is positive. This formula even makes sense for A equals zero. Itís a stupid formula, but it makes perfect sense. Iím saying provided here Mu is positive. If A is zero, no problem. Itís X equals Mu I inverse. Mu I is perfectly invertible times zero, so X is zero in that case.

By the way, when Mu is zero, you recover least squares, and now this formula is one you better watch out for, because that formula only makes sense when A is skinny and full ranking. It parses, but it doesnít pass the semantics test if, for example, A is fat. The weird thing is you can do tickenoff regularization if A is fat and this makes perfect sense.

This is why some people use tickenoff regularization because theyíre lazy or they tried this and they got some error or something somewhere. Some software told them that they were trying to invert something that was singular into working precision, and they were like, well, whatever, and then they just put in plus 1E minus 6I, and they said now itís working. Trust me, you see a lot of that.

Let me justify why this matrix here is in fact invertible provided Mu is positive. Totally irrespective of A Ė I donít care about the size of A. I donít care about the values, rank Ė couldnít care less. Letís check. What we have to do is we have to do an [inaudible] experiment, and youíd say, well, look, thatís a square matrix. Suppose it were singular. That means thereís some vector Z, which gets mapped to zero, and Z is non-zero. Thatís what it means. It means this matrix here [inaudible] a square being singular means itís got an element in a null space. Itís non-zero. Thatís the case.

Letís do something here. If this is the case, Iím gonna multiply that equation on the left by A transpose, and Iím gonna write this down. Thatís surely zero, and this is a zero vector. Thatís zero number. Why? Because Z transpose times the zero vector is zero, obviously. Now what Iím gonna do is Iím gonna take this and letís expand it. Iím gonna write it this way. Iím just going slow here.

Plus Mu and Iíve commuted the Mu and I get that. Now, Iím gonna write this as norm AZ squared plus Mu norm Z squared equals zero, and now Iím gonna use a very deep mathematical fact. If you have a sum of two non-negative numbers and the sum is zero, you can make an amazing conclusion. That is that both the numbers are zero. Everybody follow me? Norm AZ square to zero, so norm AZ is zero, and norm Z is zero.

If you want to know where does the Mu positive come in, itís right now, because if Mu is zero, all I can say is that. This says because this is not zero up here, we have our contradiction and weíre done. Thatís why this works. Thatís one way to say it. Another way to say it is to look at the stacked matrix and just show that the stacked matrix [inaudible] is full rank. Thatís the other way to look at it Ė that if you take any matrix at all, any size, any shape, any values, and you stick below it, square root Mu a positive number times the identity, that matrix is full rank and skinny. Thatís the other way to think of it, which is also probably a better way.

This is called tickenoff regularization and has lots of applications. Here are the types of applications you would see. Itís very common in estimation inversion, and it works this way. Typically something like X minus Y is a sensor residual. Here, if you choose the X that minimizes sensor residual and you like the X you see, great. No problem.

But in fact, you might have some prior information that X is small. If you have prior information that X is small or another application is where [inaudible] is actually your model Y is AX is actually only approximately valid, and itís certainly only valid for X small. Weíre gonna see an example of that momentarily.

So the regularization trades off sensor fit in the size of X. Thatís what youíd do. By the way, this comes up in control, estimation, communications, and itís always the same story. Thereíll be some parameter and some method or algorithm, and when you turn this knob, because thatís really what Mu is. Itís a knob that basically tunes your irritation is what it really does. It tells the algorithm what youíre more irritated by.

As you turn Mu, what will happen is at first youíll get Ė if you turn it all the way one way, youíll get a control system thatís kind of very slow and doesnít do such a great job but it doesnít use huge forces. You turn the knob all the way the other way and youíll get something thatís very snappy and is using very big forces or things like that.

In communications, you get something that equalizes things out beautifully, but itís very sensitive to noise. You turn the parameter the other way around and you get something that is gonna be really Ė itís kind of very calm, doesnít overreact any noise or anything like that. It cleans it up a little bit but not much. These are the types of things youíll see all over the place.

Let me mention some examples in image processing. Thatís a very famous one. We should actually add that to the notes. A very famous one in image processing is this. Itís called La Placean regularization. Let me say what that is. Youíve already done one, or you will at 5:00 p.m. today. Look at an image reconstruction problem. That image reconstruction problem, the sensor measurements are pretty good and there wonít be any problem. It will just work. Why? Because we arranged it to.

However, in general, youíll have the same sort of thing, and actually what you want to do there is you want to add Ė let me just explain what weíre doing. I want to estimate an image, so I have an image here and I have some pixels. What Iím estimating, my X is sort of the value in each of these things. If with your sensors you estimate X and you get some crazy numbers that vary pixel by pixel by huge amounts, this kind of hints trouble, because normally when a person writes down pixels, thereís sort of an implicit assumption that youíre sampling fine enough.

So for example, if this were plus ten and that was minus 30 and there were wild swings here, what youíd do when you looked at that is that youíd probably say you need to sample finer is probably what youíd say. So now my question is letís say that AX minus Y Ė X is a vectorized version Ė itís a rasterized version of the image. This is gonna be misfit from my sensor readings. I want the following. I want to trade off the fit Ė thatís this thing Ė with the smoothness of the image. Now, Iím waiting for you to tell me what do I do.

Letís add a new objective. To do this, we have to add a new objective, and the new objective is gonna be not smoothness, actually. Itís the roughness. We have to write down a roughness objective. Someone please suggest a roughness objective, which is kind of a norm. Perfect. Weíre gonna form a vector, which is the difference between neighboring pixels.

We could have vertical or horizontal. We could have diagonal. It doesnít matter. We could have all of them. Iíll skip a few details here. Theyíre not that complicated. When you do that, you can actually write that as a matrix D, because itís really a differentiation. Iíll take DX and I could have things like D horizontal and D vertical X.

This is a new image whose value at a certain pixel is the horizontal difference here, and thatís the vertical difference. By the way, if both of these matrices Ė describe the null space, actually, of these two matrices. Whatís the null space of these two? Constant. [Inaudible], but in fact this would be any image that is constant along horizontal lines but can vary in X. The null space in this one would be any image which is constant along vertical lines but can vary this way. I donít know if I got that right. I think I did. Something like that.

So what you do is simply this. Weíre done. There. Thatís a least squares problem. Now youíve turned Mu. You turned Mu all the way to zero, and you get the problem youíre doing now. You get no regularization. In other words, if you like what you see, in this case, thereís no problem.

If, however, you do this Ė by the way, in many problems with noise and things like that, when you actually just minimize this, youíll get an image which is way too wiggly and stuff like that. Now, you turn Mu up. By the way, if you turn Mu all the way up, what does the solution of this look like? Donít do the math. I just want the intuition. Yeah, itís totally smeared out. Itís just a big, gray mess, and itís just equal to the average value or something like that.

Somewhere with Mu in between, youíre gonna see the right picture, and then you might ask how do people choose the regularization? Everyone see what Iím saying? This is how you use regularization. This is it. I donít know why we donít have an example of that in the notes. Weíll add one, I guess. Then thereís the question of how do you choose Mu? How do people choose Mu? We should distinguish two things Ė how do they really choose Mu and then when someone asks them, how do they choose Mu? What do they do?

How do they really do it? They try this for one Mu and they go no, itís too smeared out. Reduce Mu and they go nah, itís getting too much speckle in there. I can still see some weird artifacts and they go increase Mu. They iterate over Mu. Theyíre just tweaking Mu. Thatís how they really do it. What happens if someone says in a formal design review how did you choose Mu?

Then they go, well, I calculated this optimal tradeoff curve and statistically, this corresponds to the posterior variance of this, that and the other thing, and I used the such and such model and thatís how I came up with Mu equals two times ten to the minus three. Thatís what they would say. Whereas in fact, they tried Mu equals 0.1. It got too smeared out and they tried Mu equals 1E minus six, and they didnít get enough regularization. Thatís how they really did it. Any questions about this?

By the way, if you know about least squares and regularization and tricks like this, youíre actually on your way to Ė this can be very effective in a lot of problems. By the way, one more point about this. If you go back down to the pad here, this penalizes smoothness, but suppose you also care about the size of X. Suppose you run this but X is huge Ė itís smooth, but huge. How would you reign in the size? Iíll take my pen out. What do you do?

You got it. Less Lambda norm X squared. How would you choose Lambda? By messing around. How would you say you chose Lambda? You would talk about the optimal tradeoff surface and tangent and exchange rates and then if youíve had some statistics, you could throw in some statistical justification. But you found it by fiddling with it. Itís not just total fiddling with it. As you increase Lambda, you can say one thing about the image Ė what happens? It gets smaller. Itís gonna be a little bit rougher and itís gonna have a little bit of a worse fit to the measurements, if thatís what these are. Thatís it.

So now you know how regularization works. It works quite well. Next topic is related. Itís also a huge topic Ė non-linear least squares, NLLS, and here it is. It says I have a bunch of functions. Now, for us, we have the residuals as AX minus Y. Thatís an affine function. Itís linear plus a constant. What we do in least squares is we minimize the sum of the components of the residuals. Now the question is what if these RIs are non-affine? Thatís the general case. This is called the non-linear least squares problem. Actually, of course, this residualís not linear, either. Itís affine. Linear sometimes means affine.

How do you solve a problem like that? Weíll get to that in a minute, but letís just look at some examples. These come up all the time, non-linear least squares. A perfect example is GPS type problems where you have ranges so you have measured ranges and from that, you want to estimate where a point is. There, you donít have to linearize. You would just minimize.

This is actually now the exact range error squared. This comes up in tons of places. In estimation problems, it would come up because you have some kind of non-linear sensor. Instead of something like a line integral through something, you might have something that is non-linear. This comes up all the time. Thatís an example.

How do you solve these problems, and here, I have to tell you something. The first thing that has to be admitted, if weíre being honest Ė thereís nothing that great about being honest, but to be totally honest, no one can actually solve this problem in general. Thatís the truth is basically this problem cannot be solved. Instead, we have heuristics that solve this problem. They donít really solve it. That would be they solve it like that. You donít actually really solve. So non-linear least squares problems in general are not solved period.

If you go to Wikipedia, if you go to Google and type in non-linear least squares, whole books, everything all over the place. You will probably find nothing that admits this fact. Thatís a very big difference from linear least squares or least squares that weíd been looking at so far. We said that A transpose A inverse A transpose Y is the least square solution. We werenít lying. That vector minimizes the norm of R if R is AX minus Y absolutely. Thereís no fine print, nothing. Thatís the minimizer. All methods for [inaudible] a least squares problem donít have that property. They are all heuristics.

You probably wonít find out certainly from people who have an algorithm for this. It gets kind of weird after awhile to say things like howíd you do that? After awhile, in [inaudible], you say I solved a non-linear least squares problem. Technically, that is false. Youíll see that in papers. Itís false. When someone says that in a paper, thereís always the question because thereís two options Ė either A, they know they havenít solved it and theyíre a liar because theyíre saying in a paper they solved it or B, they donít even know that they may not have solved the problem.

Itís usually the latter. Thatís usually the problem. Theyíre just totally clueless. Theyíre like I donít know, I got the software. I downloaded it from the web. It was called non-linear least squares. It didnít complain. You have to know itís all a heuristic. You donít get the answer.

By the way, in practice, you very often do get the answer, but you donít know that you got the answer. I made enough of a point about that. Having said all that, and also having said that I will probably slip into the informal language where Iíll say how do you solve non-linear least square problems, the answer is you donít because you canít because no one knows how to do it in general.

How do you approximately solve or something like that, so you have to put a qualifier like an asterisk, and then at the bottom of all these pages you just have a little note where it says solve, and then down here, Iíll just write it in so weíre all cool here, it says not really, but maybe. That would be the right thing. Thatís just in force until I say itís not.

So how do you solve least square problems? Thereís lots of methods that are heuristics and they often do a very good job for whatever application youíre doing. The very, very famous one is Gauss Newton. By the way, the name of the method suggests that this was not developed five years ago. This is not exactly a new method. Itís actually pretty straightforward. It goes like this. You have a starting guess for X, and what you do is you linearize R near the current guess.

Now, linearize means find an affine approximation of R near there. Then what you do is you replace that non-linear function with this affine one. Affinely squares, which on the streets is called linearly squares Ė we know how to do that. Thatís what weíve been doing for three days. Thatís no problem. Then you update. Thatís the new X. Then you repeat so you get a different model. This is called the Gauss Newton method. Hereís some more detail, and hereís the way it works.

You take the current residual and you form Ė basically, what youíre doing is you form an approximation that says if X were to be near XK, then I would have R of X is about equal to R of XK plus Ė thatís the Jacobean Ė times the deviation. Here, if you wanted to, you could say something like provided something like X is near XK. I can put a little comment there like that.

Now what we do is we write down the linearized approximation. We rewrite this to make it look like an affine function, but we can put it like this. It doesnít really matter. Now what you do is you minimize the sum of the squares of these things over choice of X. That gives you your next iterate. The next iterate is that, and itís just a formula for that, which is this thing. You repeat until this converges, which, by the way, it need not. Question?

Absolutely. Youíre right, and Iím very glad you brought that up. The question was this, that last lecture, I believe I was ranting and raving about calculus versus least squares where calculus is getting a linear model of something thatís super duper accurate right near the point, but itís vague about the range over which itís Ė did you notice that in your calculus class? They go this is a really good estimate near this thing, and you go, well, what does near mean? And theyíre like, you know, near. Donít you Ė thatís the beauty of all this, right? I donít have to say how near it is. Itís the limit.

The point was how about doing Gauss Newton where you donít use the derivative of the Jacobean but you get a least squares based model? Instead of this, use a least squares based model, and that was your question, and I can tell you this. That is an outstanding method, often far superior to Gauss Newton. My answer to your question is yes, thatís a good idea. It often works really well. In fact, thereís a name for this. In the context of filtering, when youíre doing estimation of a dynamic system, thatís called a particle filter. Let me just say a little bit about that.

By the way, I had some pseudo code. At this level, itís so vague that the linearized near current guess, you could use several methods there. One would be calculus, and the one I just described is calculus. In fact, linearized near current guess Ė that could be done by a least squares fit, just as you recommend. You wouldnít call this Gauss Newton anymore, but you might or something.

Anyway, in the context of filtering, itís called a particle filter. These work unbelievably well. Instead of this thing, which is sort of calculus, what you would do instead is youíd take XK and youíd add to XK little [inaudible], and youíd actually evaluate R of XK for a whole bunch of points right around there, and then youíd fit the best affine model to what youíve got, and then all the method would work the same. That would work really well, by the way.

By the way, one little comment here. This says provided X is near XK, so you could well get into trouble here, because you linearize Ė in this case, we did it by calculus Ė you linearize and at least it says if X stays near XK, this is a good model. But you just solved this problem, and thereís no reason to believe that this point is near XK. If itís not near XK, then the whole premise for how you got XK plus one is in doubt, because youíre using this model which need not be valid. Thereís a very cool trick, which since we just did regularization, I can tell you what it is. The super cool trick is this. You add this. There you go.

This says minimize the sum of the squares of my model for the residual. Thatís what this says. This says but please donít go very far from where you are now, because if you do, thereís no reason to believe thatís a good model. By the way, this is called something.

This is sometimes called a trust region term because when you form this model Ė by the way, either with least squares or with calculus, you have the idea of what is the area around XK where you trust that model? Thatís the trust region. This basically is your trust region term. Everybody got this? Once you know about regularization, you start realizing you should be using it in a lot of places, and this would be a perfect example.

What I described here was the pure Gauss Newton. Letís look at an example and see how this works. Hereís an example. We have ten beacons. No linearization here, at least not in the main problem. The true position, thatís this point right here, and thatís where we started off, and let me tell you what the measurements are. I have ten measurements. What I know is the distance of a point to each of these beacons. These are noisy.

By the way, if they were noise free, this problem would be trivial. If they were noise free Ė if I knew the range to this point, I would draw a circle with that radius around that point. I would do that for all the points and they would all intersect in one point and Iíd say thatís where you are. The problem is the range measurements have errors. In fact, theyíre considerable errors. Theyíre plus/minus 0.5. So theyíre not really going to all come to one single point. Weíll use Gauss Newton for that. Actually, here, thereís other methods you could use, but this is just a simple example.

By the way, this is what the non-linear least squares objective looks like as you vary X. This is going to be Ė that happens to be the true global solution in this problem, but you see, this is not a quadratic function, which would be a nice, smooth bowl. Itís got all these horrible little bumps and things like that.

In this problem, it doesnít have any local minima that are not global, but we could easily have done that by putting some point here, and then they would have had a little valley here that would have filled up with a lake or something. As soon as you have that, I guarantee you Gauss Newton method will, given the wrong initial point, land right there very happily, at which point you have not solved the problem and you donít know it.

Okay. If you run Gauss Newton on this starting from here and youíre way, way far away. In this case, it actually works. In a sense, you actually do compute the solution, but you donít know it. I only know it because I plotted it. Itís got two variables and I plotted it for everything. But the minute youíve got five variables, youíre not going to know it anyway. What happens now is the objective and the Gauss Newton iterate just keeps going down, and you can see it in about five steps. It hits what appears to be a minimum, and it takes maybe five, six steps or something like that, and thatís it.

By the way, you donít really know Ė the final estimate, by the way, was quite good. The final estimateís minus 3.3 plus 3.3, so thatís the actual true position. Thatís where you started. After one step, you were here, and two you were here, then there, then there, and now you can see youíre sort of in the region where youíre going to get the answer. Itís pretty good. You are getting from the least squares part of it.

You are actually getting the blending of ten sensor measurements, so youíre getting the power of blending. Some people call that a blending gain or something like that, and I forget what they call it in GPS. They have some beautiful, colorful word for blending lots of sensors and measurements and then actually ending up with an estimate thatís better than any of the individual sensors.

Here, you actually end up with an unusually good estimate Ė better, actually, than the accuracy of your individual sensors. Thatís the picture. In this case, it actually worked. In other words, we got the global solution, but thatís only because I plotted it. I think I already mentioned this.

Let me ask Ė well, I can ask you. Suppose you have a problem. Real problems donít have two variables, I might add, right? You have two variables, you plot it and you use your eyeball, okay? This is silly. You have three, you write three or four loops and you go to lunch. Letís just bear that in mind. Real problems have ten variables, 100, 1,000. You cannot plot pictures like that.

So you have a non-linear least squares problem where you are estimating, letís say, some non-linear estimate. Maybe itís a topography thing. You have a non-linear sensor. Itís a variation on the problem youíre doing for homework right now, except instead of having a linear sense, you have a non-linear one. How big is that problem we gave you? Thirty by thirty? Tiny. Thatís 900 variables. Thatís pretty small.

Now you run a Gauss Newton in 900 variables and you get an image. By the way, if youíre imaging somebodyís head and it came out looking like a head, thatís good. That would be your feedback that something is approximate. If it came out not looking like a head, that wouldnít be good.

What would you do as a practical matter there to check whether you got in fact or just to enhance your confidence that you may have actually minimized the non-linear least squares? What would you do? Exactly. Youíd run it once and youíd see what you got. By the way, if you had a pretty good estimate ahead of time, and thatís actually likely to help. You start with that. But what you might do is exactly what was just suggested. You run it multiple times from different starting points and you just see what happens.

Here are some of the things that can happen. The first is that no matter where you started from, it always goes back to the same thing. What do you know if you do that 50 times? Letís be very careful in our verbs. I used the word know, so what do you know when you run it 50 times and you keep getting the same point? Hereís what you know. You know nothing. I mean both in theory and in practice. You know absolutely nothing, because R900 is a huge place, and the fact that you just sampled 50 points out of it, thatís nothing. You know nothing.

Now as a practical matter, what can you say when someone says to you I did the imagining. There it is. I believe thatís the image. It looks good. Itís clearly a head and thereís somebodyís brain there. It looks good. Someone says do you know thatís the global minimum and you go no. But Iíll tell you what I did do. Last night when I went home, I started up a batch of doing 1,000 of these from different initial conditions, and I found that in something like Ė letís suppose you found in all of them, you say in every single case, I converged the exact same final estimate. Theyíd go cool.

Youíre getting the global minimum. If thereís no lawyers present, you can say good bet. But other than that, if someone says what can you really say you know, you canít say anything. You can say I donít know. You would say if you ran 50 and got the same answer each time, youíd say that as a practical matter, you have enhanced your confidence in the design. Actually, thereís a great phrase, which makes absolutely no sense and is wrong. Itís very useful. Itís this. You could say exhaustive simulation. Have you heard this phrase? Thatís great.

So this is what you say to someone when you open the door and you say hop it. You go are you sure this works? You go no problem. We used exhaustive simulation. Hop in. Thatís a very useful phrase. Of course, it makes absolutely no sense if thereís more than three or four things varying, and is generally speaking just wrong. You say we checked a million values of bursts of wind and things like that, all sorts of terrible things. Weíve simulated them all. Of course, you cannot.

If you find yourself in a situation, you can always use that phrase. Then you mention the number, and as long as this person doesnít take the nth root of that number where N is the number of parameters, everything is cool, because a million simulations in R10 is like zero. Itís like the tenth root of a million, which is a small number.

Thatís non-linear least squares. I hope we have a problem on that, but I have this weird feeling we didnít. It just didnít make it? Thatís horrible. Fortunately, we have some recourse, donít we? It seems that over the next week, you might not see non-linear least squares. Itís an important topic, and really one that should be covered in the first half of the course, and I mean covered like I think you should do one. Letís look at the next topic. Iíll just say a little bit about the beginning of the topic. Itís actually the dual of least squares. Youíll get used to this idea of duality.

I donít think weíll ever get very formal about it, but at least Iíll give you the rough idea. Let me say a little bit about duality. Itís going to involve ideas like this. Itís going to involve transposes, so there are going to be transposes. Rows are going to become columns and things like that. Null spaces are going to become ranges and thatís the kind of thing. By the way, thereís a duality between control and design and things like that and estimation, because there youíre switching the roles of X and Y, typically. These are all sort of the ideas.

Weíve done least squares so far. The dual of that or a dual of that is going to be least norm solutions [inaudible], because weíve so far been looking at over determined equations, and weíll see what this is. This is actually pretty straightforward stuff. Now weíre going to take Y equals AX, but A is fat now. That means M Ė you have fewer equations than you have variables. You have more variables than equations. Another way to say this is X is underspecified. So even if there is one solution here, thereís going to be lots, because you can have anything in null space of A, which, by the way, has to be more than just the zero element now because itís got a dimension at least N minus M here.

Weíll assume A is full rank, so that means that you have M equations and theyíre actually independent equations, so the rows of A are independent. Then all solutions look like this. The set of all Xs to satisfy AX equals Y is you find any particular solution here. You will very shortly see a particular solution. You take any particular solution and you add anything in the null space. By the way, if a person chooses a different XP here, you get the same set, because the difference of any two solutions is in the null space, and you get the same thing.

Here in this description, Z essentially parameterizes the available choices in the solution of Y equals AX, and you can say roughly that the dimension of the null space of A, which is N minus M, because A is full rank, that gives you the degrees of freedom. That says you can choose X to satisfy other specs or optimize among solutions. I guess as we talked about before, as to whether or not thatís a good or bad thing, that depends on the problem. If this is an estimation problem, degrees of freedom are not good.

Basically, thatís stuff you donít know and cannot know with the measurements you have. If this is a design problem, this is good, because it means you can do exactly what you want many ways, and therefore you can choose a way thatís to your liking. Thatís the idea. I might as well just show this and then weíll continue next time. Here is a particular solution. Itís A transpose times AA transpose inverse times Y. It should look very familiar but be a little bit off.

You are used to A transpose A inverse A transpose Y. Youíre used to that formula, and this looks very different. Itís a rearrangement. You move a few things around. It looks perfectly fine. You have to be very, very careful here, because itís very easy to write down things like that and that. What you must do, and Iíll show you my pneumonic. My pneumonic is this. You look at that and you look at that and you quickly do the syntax check. The syntax scan goes like this. If you saw A inverse, your syntax alarm would go off unless you know A is square. But here, A transpose A, thatís square, and therefore at least by syntax can be passed to the inversion function.

Thatís cool. In fact, you can multiply it by A transpose on the right. You can also form AA transpose, and thatís invertible, too, so syntactically, this one is cool, too. Both of these pass the syntax test regardless of the size of A, fat or skinny. Now letís get to the semantics test. Now, if A is fat, then this one is basically a fat times a tall matrix, and the result is you get a little one. Over here, if A is skinny, this is also a fat times a tall, and the result is a little one.

So here is the semantic pass. The semantic pass says if you propose to invert a fat times a tall matrix, unless thereís something else going on like some rank condition, youíre not in trouble yet. If these two reverse, youíre in trouble independent of what the entries of A are, because if you reverse these Ė if A is fat and you go to this formula, that is a square matrix. Thatís fine by syntax. It fails on semantics because A transpose A is going to be square, but it is low rank, and you should not invert low rank matrices. Weíll quit here and continue next week.

[End of Audio]

Duration: 78 minutes