ConvexOptimizationI-Lecture11

Instructor (Stephen Boyd):Well, starting on our next application topic, which is statistical estimation. So last time we looked at estimation from a simple fitting point-of-view, approximation in fitting. Now weíll actually look at it from a statistical point of view. Actually itís interesting because you end up kind of with the same stuff. You just get a different viewpoint. So let me give the broad setting is actually just a Ė well, itís just pure statistics. So here is pure statistics. Itís this. You have a probability density that depends on a parameter. So thatís pure statistics and I guess our staff department, our contingent, are center front and will keep me honest or at least object when needed.

So in pure statistics thereís just parameterized probability distributions and we have a parameter X and your job, you get one or more samples from one of these distributions and youíre charge is to say something intelligent about which distribution, which is to say which parameter value, generated the sample. So thatís statistics. So a standard technique is maximum likelihood estimations. In maximum likelihood estimation you do the following. You have an observation Y and you look at the density of the Ė you look at the density at Y or probability distribution if itís a distribution on like Ė if its got different points on atomic points.

You look at this and you actually choose the parameter value X that maximizes this quantity. Thatís the same as maximizing the log likelihood. Weíll see actually why the log goes in there. The log goes in there for many reasons. One is if you have independent samples that splits into a sum. And the second is that many distributions are actually log concave this time in the parameter. Now, in fact, last time Ė so far weíve looked at distributions that are concave in the argument. This is concave actually in the parameter and those are actually slightly different concepts. So weíll see this in examples. All right.

So the log of the Ė Iíll just call it a density. The log of the density has a function now of the parameter here. Not the argument here. This is called the log likelihood function. In this problem itís often the case, for example, if some of the parameters involve things like variances or covariance matrixes or theyíre just parameters that are positive or something like that if they represent rates or intensities. Theyíre clearly limited to some set. So you can add to this problem either in explicit constraint, that part that X has to be in subset C, or another way to just affect the same thing is simply to define P sub X of Y to be zero when X is outside the domain of P, at least the parameter argument.

That, of course, makes the log minus infinity and that makes it a very, very poor choice if youíre maximizing. So thatís the same as just making the constraint implicit. Okay. So these are very often convex optimization problems. By the way, they are sometimes not, if you do machine learning or statistics youíll kind of start to Ė youíll get a feel for when theyíre not. Typical examples would be missing Ė the most obvious and common example is missing data. But weíll see some examples and weíll focus on the cases where it is. So this is often a convex optimization problem. Remember, the variable here is X, which is a parameter in the distribution. Y here actually is a sample. Itís an observed sample. Okay.

Letís look at some examples. The first is just linear measurements with IID noise. So here is what you have. You have a bunch of measurements so you have YI is AI transpose X plus VI. Here X is the set of parameters that you want to estimate, but I guess if youíre a statistician you would say the following. You would say that the data YI are generated by a distribution. The distribution is parameterized by X. Okay. So thatís what it is and the way itís parameterized is this way. In fact, X only affects, as you can see, the mean of the distribution. It doesnít affect, for example, the variance, but that doesnít have to be the case.

So, I mean, the way a normal person would say this is X is a set of parameter values that you want to estimate, but this is fine. The conceptual way to say this is you have a dist Ė Y is generated from a distribution parameterized by X. Okay. So weíll take the VIís to be IID measurement noises and weíll let the density P of Z. So weíre just gonna make this scalar. Itís easier to work out what happens in the case when these are vector measurements. Itís very easy to do. Okay. So the density of this clearly is simply if you take YI minus AI transpose X and that, of course, has this density and theyíre independent. So the probability density once if you fix X, this is the parameter itís just simply the product of these because theyíre independent samples.

So you get this and thatís the density. You take the log of that. Thatís the log likelihood function, and you get this thing here. So this is maximized Ė you know, I think Iím missing a sign or something here. No, Iím not. Everythingís fine. Sorry. So your job you take the log of this. It splits out and you get the sum of the log of these probability densities and your job is to maximize that. So for this to be a convex problem in the general case in X. Remember YI are samples, theyíre data. X are the variables here in an optimization problem. What you need to happen is you need P in this case, because this appears in P. Thatís affine in X. You need P to be log concave.

So in this case, because the parameter appears afinely in P, itís good enough for P to be log concave for this to be a convex problem. So letís look at some examples. Hereís one. If the noise is Gaussian, so the density is just this. Itís nothing but that. Then the log likelihood function looks like this. Itís a constant here, has nothing to do with X, and then minus this term. But if you look at that itís a positive constant times nothing but sort of an L2 residual, right? So nothing else. So, in this case, the maximum likelihood estimate is the least squares solution. So now you know if youíre doing least squares youíre actually doing maximum likelihood estimation with assuming that the noises are Gaussian.

In fact, it doesnít matter what the variances Ė oh, sorry. All the noises have to have the same variance. If you do maximum likelihood estimation and the noises of the different measurements have different variances what does that translate to?

Student:Weighted.

Instructor (Stephen Boyd):Yes, thereís weighted least squares, of course. Thereís diagonally weighted least squares. So thatís what you do. So you can turn it around and if youíre doing diagonally weighted least squares the truth is youíre doing diagonally weighed least squares if someone asked you. You could say because I trust that measurement more than I trust that one or something like that, but if a statistician friend stops you and say what are you doing? You say Iím doing maximum likelihood estimation with the noises here having different variances. So thatís what that means. Okay. Thatís fine. So now you know what least squares is. Has a beautiful statistical interpretation.

Letís try Laplassian. So in Laplassian noise it looks like this. Itís exponential and you actually can say a lot of things about this compared to a Gaussian. The most important by far is that the tails are huge compared to a Gaussian. So E to the minus X squared is a whole lot smaller than E to the minus X when X is big. So this has huge, huge tails. I mean, obviously distributions with huger tails, however, those distributions are not log concave, but nevertheless. So thatís the main difference here. You do this and work out what this is, and actually, all youíre doing is just this. Youíre just maximizing the sum of the log of the probability density. You get a constant, thatís from this thing. And then over here you get a positive constant time the L1 norm. So thatís what youíre doing.

So if you do L1 estimation then what youíre really doing is youíre doing maximum likelihood estimation with Laplassian noise and this, sort of, makes sense actually. This makes sense that this is Y L1 estimation and you know what L1 estimation does. What L1 estimation does is it allows you to have some large out wires. Where L2 estimation would never allow you to do that. In return, it will actually drive a lot of the smaller residuals to zero. For example, to zero or small numbers. And now you can actually justify it statistically.

You can say, well, youíre really assuming that the noise, in this case, is more erratic. Youíre saying that, in fact, V does not fall off like a Gaussian. If something falls off like a Gaussian then being six sigma out is actually a very, very unlikely event. And you will change X considerably to avoid a point where youíre actually making the estimate that one of these things is six sigma out. Okay? If this is Laplassian you are much more relaxed about that. Itís a rare event, but it is not essentially for all practical purposes impossible event. And youíll actually allow a large out wire there.

So it all, sort of, makes perfect sense statistically. As another example, you have uniform noise. So just suppose the noise is uniform. Which is interesting because it basically says this noise here cannot be bigger than in absolute value of A. Absolutely cannot be. On the other hand, between minus A and A you have absolutely no idea. Youíre just completely uncommitted as to what its value is. Itís not even centered at zero. I mean, itís centered at zero, sorry. But itís not concentrated at zero. Itís not more likely to be small than it is to be large. Itís completely uniform in there. Sure enough the log likelihood function is this. Itís minus infinity unless all of these measurements are consistent, sort of, within A. They have to be.

But then itís just a constant. So here the log likelihood function actually takes on only two values. Itís a constant and itís minus infinity. So any point inside this polyhedron is a maximum likelihood estimate. Okay. So thatís what happens here. And, of course, itís not unique and so on and so forth and if you were to add to the uniform noise even the slightest tilt towards the center you might get something else or something like that. So what this does is it allows you to translate or to talk statistically about your fitting, your penalty function. So, for example, lets do one.

Suppose I told you Ė suppose Iím doing fitting and Iím using a penalty function that looks like this. Okay. So it means I donít really Ė if your residual is within plus minus one I donít care at all. I just Ė itís fine. Iím not even interested. Once itís between an absolute value less than one itís fine. Then it grows linearly and so I want to know whatís the statistical interpretation. This is maximum likelihood estimation. What noise density? Whatís the imputed noise density here? What would it be? You do. This is maximum likelihood estimation so what Ė this corresponds exactly to maximum likelihood estimation. What is it?

Student:Itís uniformally [inaudible] and the [inaudible].

Instructor (Stephen Boyd):Exactly it. Yeah. So if youíre doing dead zone linear penalty, which you could do and just defend it on very simple grounds and say, well, look if the residuals Ė look, I canít even measure it that accurately. So if itís between plus minus one I donít even care. Thatís fine. Thatís as good as my measurements are anyway. It makes no sense. And then you say, well, what about linear? Why not quadratic out here? Linear would say something like, well, sometimes there are some pretty large residuals or pretty large errors and things, so thatís why I have this linear and not quadratic. Okay? So that would be it.

The statistical interpretation is exactly this. Youíre doing maximum likelihood estimation of Y equals AX plus V. The VIís are IID and they have a distribution that looks like this. Thatís supposed to be flat by the way. I donít Ė there. Okay. So itís a uniform distribution between plus minus one and then it falls off. It has exponential tails outside. So thatís what youíre doing statistically. Makes perfect sense. Here youíd adjust everything to have integral one, of course, but that makes no difference, in fact. Itís just that thatís the shape of what you have. So if you do uniform Ė by the way, this is, of course, log concave because the negative log of this is that. And, in fact, thatís how I got Ė I mean, thatís how this came by flipping this and then taking the X. Okay. So thatís how that works.

So thatís all you do. Okay. So thereís more on that, but thatís something you can do in homework and reading and stuff like that to sort of see how that works. And this is very simple case. You can actually get to the case where youíre estimating things like covariance matrixes and things like that or other parameters and itís actually more interesting, but thatís the basic idea. Okay. Letís look at another example of something different where the noises are not additive at all. They enter in a complex non-linear way. So weíll look at logistic regression as sort of a canonical example of other classes of maximum likelihood estimation problems that are convex. So letís look at that.

I have a random variable thatís in zero one and its distribution is the following. It has a logistic distribution so the probability that Y is one is E to the A transpose U plus V. Itís E to a number divided by one plus E to the number. So that looks like this, I guess, if you plot this function like that, so thatís zero. And what this says is Ė actually letís even take a look at this. If this Ė for now just treat this as a number. A transpose U plus V. If this number is letís say zero it means that sort of the equiprobable point. It says that the probability that Y is one is half. If A transpose U plus V is like two or three, then this is very likely, but not perfectly certain to be one.

If this number is say minus two or three this is also Ė itís very likely to be zero, but not quite. So the transition zone is where this number is, letís say, between plus minus one. I mean, you can chose that number some other way. But roughly speaking when this number is zero thatís the equiprobable point. When this number is between, I donít know, plus minus a half, plus minus one, you can throw in your favorite number in there, the probability is some sort of number like 10 percent, 90, between something like that. You can work it all out. When this number is things like three and four, this is overwhelmingly probable to be one here and if itís negative three or four itís overwhelmingly probable to be zero and thatís, sort of, a picture like this. Okay.

Now, I guess to the statistician here A and B are parameters. So those are the parameters that generated the distribution and so your, our job is gonna be estimate A and B. Now, you actually Ė you canít estimate A and B if I just give you one sample from this. So youíre gonna be given multiple samples. In other words, Iíll give you a bunch of samples and for each sample Iím gonna give you U. So U here is Ė these are often called explanatory variables. So they would be other things that you measure and associate with this sample.

So this would be, for example, I donít know. If you were admitted to a hospital this would be things like blood pressure, weight, blood gas concentrations, all things like this. And this could maybe be if you died or something like that, right? So that would be that so Ė yeah, hopefully things will end up over here. All right. So thatís it. All right. So, okay. So thatís what this is. so these are the explanatory variables and what you want to estimate are the parameters A and B to parameterize this distribution. Okay. So what weíll do is weíre given a bunch of samples and a sample looks like this. Itís a pair UI and YI here. These are people who checked in and this is what happened to them. I guess in this case, zero being the good outcome. Well, we have to decide that weíll make zero the good outcome.

So you get a past data like from last year. You get this data, get a couple hundred of these or something like that and, actually, letís even just talk about what you do with this. Weíre actually gonna fit A and B. So weíre gonna do maximum likelihood estimation of the parameters A and B. By the way, once we have them we can do interesting things because someone can arrive tomorrow at the hospital, we can evaluate their U, evaluate this, and for example, if this turns out to be plus three thatís not good for them. If itís minus three I guess we can reassure them or something like that. Okay. All right.

So this is the idea. Now, how do you do this? Do you want to work out the log likelihood as such is nothing, but the likelihood function is the product of these things because weíre assuming these are independent samples from a logistic distribution. So you take the product of these. You could take, which is this, and what weíve done is weíve simply reordered the samples so that the first K1ís were the ones, I guess, who died in this case and the last M minus K plus one or something like that, whatever this is, are the ones who didnít. So thatís Ė you just reordered them. And, in that case, you get the Ė you just multiply all this out. You can see that in the denominator you always get this thing. Thatís for all of them.

And the numerator take this and you take the log of this and this thing is the log of the product of the X. That comes out here as this afinely term and this one over here is minus log sum X. Now, of course, log Ė well, by the way, why did I call this log sum X? Well, Iím going fast now, but someone want to justify? How did I know just immediately that this thing was convex?

Student:A longer sum of the two.

Instructor (Stephen Boyd):Yeah. But thereís a one there. That one is expo of zero. This is the log of E to the zero plus E to the A transpose UI plus V, like that. Okay. So itís log sub X. Itís the two-term log sub X culled with zero comma A transpose UI plus V. Okay. So this is convex in this argument thatís afine this whole thing is convex. With a minus sign itís concave. So this is concave. Okay. So thatís the picture. By the way, if you want you can actually draw this function as a function of AI transpose U plus Ė yeah, that would be called the logistic. Well, if I put in a negative sign here, which I havenít done yet, it would be called the logistic loss function. And it would be something like this. I wonder if I can get it right. Letís see. Which do you want? If this is small, then that is about zero and itís that, so I think it looks like that at zero.

I may have gotten that right, but if it doesnít look like this then flip it over and if that doesnít work flip it this way and if that doesnít work flip it this way again. So in one of those orientations with some number of minus signs and so on the logist Ė I guess actually people do logistic loss they refer to it as this. Itís a function that looks like that. Thatís where this gets linear in this case. Approximately linear. Anyway, so but this is something to be maximized. But is this covered in Ė I know a bunch have taken 229, ES229. Is this there? Come on. Is logistic regression covered in?

Student:It is, yeah. Yes.

Instructor (Stephen Boyd):It is. Okay. All right. So hereís just an example. Itís a super simple example. This is, as with most examples that you can show on a slide, theyíre examples of what Ė theyíre examples where you definitely did not need any advance theory to figure out how these things work. Okay. So if, in fact, thereís only one explanatory variable, which is here, U, and then a bunch of outcomes here, then you definitely donít need anything in advance. Just look at the data with your eye. So thatís not the point of this. Iíll give you some examples of where logistic regression works. Where itís Ė weíre not talking what you do with your eye.

So hereís a whole bunch of data. Itís 50 measurements. And each measurement consists of this. Itís a value of U here and then itís a one or zero. So thatís all it is. I mean, this could be anything. All right. And you can see roughly the following. That when U is larger you Ė there you get more ones. I mean, you can see that because thereís sort of a concentration up here and fewer down here and you can see that if U is smaller thereís fewer Ė youíre more likely to be zero. Okay. But you can see thereís lots of exceptions. Hereís somebody with a very low U that turned out to be one and somebody with a high U that turned out t be zero. Okay.

Oh, I should mention one thing. What do you Ė well, okay. What this shows is A and B have been estimated. A and B do nothing but control what here in the logistic distribution? B controls the point where this is neutral and A controls the spread. Itís the width of the transition zone. So A controls how stretched out this thing is and B controls, sort of, where it is and you can see that it is lined up about at what would be a very good decision point if you were forced to make a hard classifier here. Itís lined up quite well with something that approximately separates, not perfectly, but approximately separates these cases from these.

And this spread here tells you roughly how much indecision there is. Okay. Let me ask you a couple of questions here. What do you imagine would happen if the data had looked like this? What if there were none there and there were none there? Okay. So basically thereís this data and thereís this data. What do you think? What do you think the logistic Ė well, first of all, letís just talk informally here about what you think the logistic estimate Ė the logistic maximum likelihood is gonna be? Just draw a picture. What do you think? Whatís it gonna look like? I mean, like this? No. Itís gonna be much sharper. In fact, itís gonna be infinitely sharp. Itís gonna basically look like this.

Itís gonna be a perfect thing like that. Right? And the truth is, this thing can actually move left or right and it makes absolutely no difference. And the point is that actually the data here is actually linearly separable, right? Thereís a perfect classifier. That corresponds exactly to the logistic regression problem being unbounded. If youíre doing maximization itís unbounded above. Okay. So unboundedness above of the log likelihood. It means you get an estimate. It basically means you can make the the log likelihood function as large as you want. Okay. And that corresponds to this perfect separation like this. Okay.

You have to check me on this. So this is the Ė okay. What would be uses for this? I am not quite sure, but I believe actually some of the best spam filters are done using logistic regression. You hear both things. Something called support vector machine, which youíll see very shortly, and logistic regression. So my guess, I think itís both. So thatís how that works. There, of course, the dimensions are very different. Theyíre not one, right? Youíve got thousands and thousands of features and things like that and your estimating over very, very large data sets. Okay. This makes sense?

So this is an example where itís not linear remotely obviously. I mean, this here it enters in a very, very complicated way. The measure Ė if you like to think of this is a measurement or an outcome work this way. By the way, I should mention to those of you in areas like signal processing and things like that you might think, well, this doesnít have anything to do with me. Actually, I beg to differ. This is a one-bit measurement. Okay. So this is exactly Ė well, with a particular distribution. You change the distribution itís something else. So you will Ė if you do signal processing with low bit rate measurements, including, for example, single bit measurements, you will exactly get problems like this. Not quite this one. If itís Gaussian and then you take a sign youíre actually gonna get Ė itís called probit regression and you get a different function here, but itís the same everything otherwise.

Itís kind of the same. Okay. So that was our whirlwind tour of a more complex parameter estimation Ė a maximum likelihood estimation problem. Okay. And let me mention, actually, with this one I can actually mention some of the cases where the canonical cases where you get non-convex problems and this you probably look at in machine learning and other applied classes and things like that. What would happen is you get a bunch of data samples like this, but for some of the samples some components of the UIís would be missing. Thatís the missing data problem. And there itís not Ė the resulting problem is not convex. Okay.

Thereís some very good uristics that you can imagine. Like assuming a value of U, carrying out the regression, then going back, plugging in the most likely value, you know, estimating U and alternating between these two and these are all schemes that are used and so on, but, okay. Now weíre gonna look at hypothesis testing. So how does hypothesis testing works? It works like this and for this we go back to the simplest possible case. You can get very complicated other cases like multiple hypotheses and not just one, but multiple ones. Sorry, two, but, sorry. The single hypothesis testing is actually quite simple, but anyway. So weíll go to the simplest one.

This is where you have two and weíll just take a random variable on the letters one through m. So just very, very simple random variable. And whatís gonna happen is either a sample, which is literally Ė itís just an integer between one and m. It was either generated by the distribution of P or the distribution of Q. So these are non-negative numbers that add up to one. Good as vectors actually. That P and Q. The distribution because we have a finite set here. So thatís the question. I give you a sample and you Ė I give you a sample or I give you a couple samples of something like that and your job is to estimate was it this distribution or this one?

Now, of course, this could be very easy. For example, if X turns out to be one and this is kind of intuitive, and P1 is .9 and Q1 is .002, right? Then itís just intuitively clear. Itís quite clear. Itís a very good guess. By the way, not a perfect guess, but a very good guess that, in fact, X came from this distribution of P and not Q. Okay. So thatís roughly Ė itís not that unobvious what to do here as in these other things. Itís not that itís intuitively unobvious. The question is how exactly do you estimate and how do you make a better estimator than an intuitive one. So weíll look at the idea of a randomized detector. So if a randomized detector looks like this.

Itís a two by n matrix. So it looks like this. And what it means is this. Each column is actually a probability distribution on the two outcomes like hypothesis one and hypothesis two. Okay. And it says that if this is the outcome that occurs you should guess one with this probability or guess two with this probability. So thatís the idea. Now, of course, often these things look like that. Okay. Something like that. What this is if this is what you observe in X you simply guess it came from one. If this is what actually is observed you guess it came from hypothesis two or Q. Okay. So if these are zero one entries itís a deterministic detector.

And the deterministic detector is silly. It just means basically you have partition X, the outcome space, into two groups where if the outcome is in one set you declare that you guess its hypothesis one. If itís in the compliment you say itís hypothesis two. I mean, that, sort of, makes sense. However, you can have something like this. You can have something weird like that. That means that if this is the outcome of observed you then go off and you toss a coin and with 80 percent probability you guess that it was hypothesis one and with 20 percent probability you guess itís hypothesis two. Now, by the way, this is sort of like randomized algorithms in computer science.

This just looks weird like why on earth if youíre trying to make an intelligent statement or guess what happened it doesnít seem like gambling on your own is actually gonna help anything. I mean, why on earth would you say, well, tell me happened. You go, excuse me for a minute and get out a coin and flip it. It just doesnít look Ė thereís something that doesnít make sense. Which is kind of like in a randomized algorithm, right? When someone says, hereís my problem instance. Iím talking computer science and you say how do you solve it? And you go hang on just a minute and you get out the coin and you start flipping it and you say, look, what are you doing? And you say, no, Iím, gonna try to solve your problem. You say my problem has no probability in it.

I gave you a problem instance, please give me the answer. You go hey, back off. Come on. And you keep flipping the coin. It just looks Ė you know what Iím saying. After you get used to this intellectually itís okay, but at first it looks very odd. So the idea of having a deterministic detector Ė also, itís weird. Because you say, hey, output three happened and you go, yeah, sure that came from hypothesis one. Then the next day you say output three happened and you go, yep, that was hypothesis two. All right. So this looks very inconsistent and strange. Actually, weíre gonna soon see what a randomized detector can do for you. I mean, itís still weird, I guess. Itís like if youíre in physics you have to Ė some day you have to either come to some Ė you have to sum an equilibrium, for example, with quantum mechanics and things like that. It doesnít make any sense and same with these. Or not. But then you just know itís an unresolved intellectual issue. Okay.

So this is what a detector matrix is and it describes a randomized detector and if youíre uncomfortable with that you can make it all zero one matrix. Of course, there is a one in each column and this becomes an encoding of a partition of the outcome space. Okay. Now, if I simply multiply this matrix T by P and Q. So if I take T and then I multiply by PQ, like this, thatís lined up like this, then I get four probabilities here and theyíre actually quite interesting. So Iíll tell you what they are. And I guess we should start with Ė well, I donít know. We could start with the one one entry.

The one one entry is the probability that you guess hypothesis one is correct when, in fact, the sample is generated from hypothesis one. So this entry is a good entry Ė sorry, itís an entry that you want near one. Actually, if it were one it would mean that you would be absolutely infallible. You could never Ė this entry. Letís go down and look at this one. This entry is thatís the probability that you guess hypothesis two when X is generated by distribution one. Okay.

So thatís the Ė this is a false positive. Weíre taking, I guess, hypothesis two to be positive. Okay. So thatís a false positive. Now, weíll get to that. This one is two two entry so these two add up to one. I mean, theyíre like a conditional distribution in fact. I mean, theyíre conditional probabilities. This entry is the probability that you were Ė that, in fact, the distribution was generated by distribution two and you guessed correctly distribution two. So thatís this one. Again, thatís something you want one. And this is Ė so these are the ones you want small are these off-diagonals thatís probability of false positive, probability of false negative and you want those small. So, in fact, what you really have here is a bi-criterion problem because what you really like Ė well, of course, youíd really like this matrix to be the identity matrix.

That means youíre absolutely infallible. That means whenever distribution Ė it comes from distribution one you correctly estimate the distribution one and so on. You make no mistakes of any kind. No false positives. No false negatives. But you really have Ė in general, you have a tradeoff between these two false alarm probabilities. Okay. And I guess thereís lots of names for this tradeoff. I actually donít know what itís called in statistics. I know what itís called in signal processing. Itís called the ROC, which is the receiver operating characteristic, but which goes back to like World War II or something like that. Whatís it called in statistics?

Student:[Inaudible]

Instructor (Stephen Boyd):Really. Maybe it came from you guys. Who knows. It doesnít sound like it to me, but Ė okay. All right. So this is Ė thatís the bottom you want and let Ė we can actually talk about some of the extreme points. Letís see. How small could I make the probability of false positive be and how might I do that? How small can I make the false probability Ė the probability of a false positive?

Student:Zero.

Instructor (Stephen Boyd):I can make it zero. How?

Student:Always guess one.

Instructor (Stephen Boyd):Yeah. You just guess one no matter what happens. Then, in that case, you will never be accused of having guessed two when, in fact, one generated the sample. Okay. So that Ė by the way, that scheme also has another huge advantage, which is great simplicity. You donít have to even listen to what happened to guess where it came from. And, obviously, I could make false negative zero as well by simply always saying itís positive. All you do is you always guess itís Ė sorry, guess itís negative always. And I think weíve got it wrong. This was Ė sorry, this was. To make a Ė no, sorry, thatís right. You guess one here, you guess two always. Okay.

So will Ė I mean, this is actually correctly viewed as a bi-criterion problem. Okay. So itís a bi Ė oh, and by the way, what happens if thereís multiple hypotheses like 12? What happens if thereís 12? Letís talk about it. Suppose thereís 12 because this is not exactly complicated stuff. Weíre just multiplying the matrix out. What happens in the case of 12 hypotheses? Wow, this would be a good homework problem or something like that. So what happens?

Student:[Inaudible]

Instructor (Stephen Boyd):You have a 12 by 12 matrix. And how many of these bad guys off the diagonal do you have?

Student:144 minus 12, so 132.

Instructor (Stephen Boyd):Yeah. But some of them arenít Ė oh, okay. Thatís fine. Yeah. Thatís fine. Okay. Yeah. So roughly 70 or something like that, right? Okay. So weíve got a bunch of them, right? And so now you have a big tradeoff where you might have strong concerns about Ė you might have different concerns about guessing hypothesis three when, in fact, the truth came from whatever hypothesis two. That might be one you really care about making low or something and you can imagine after then it would get interesting, but weíre just looking at it in the simple case. Okay.

So you end up with this. If itís a bi-criterion problem. You want to minimize these two things subject to Ė this just says the column sums in T are one, obviously, and this says that theyíre non-negative and so this is obvious. By the way, notice that everything here is linear. So if this a bi-criterion LP. Actually, itís a trivial bi-criterion LP. So letís scalorize it, so we put in a weight lambda. If you put in a weight lambda you add this up and this is silly. Thatís a linear function of T of the entries in T. Itís extremely straightforward. This says subject to that. Itís not only a little Ė I mean, so this is one of those linear programs, I guess, that one of those trivial linear programs that you can just directly solve.

Itís kind of obvious. In fact, you can optimize each column of T completely separately because thatís linear and so itís separable in T. Gives you a sum of contributions from each column. The constraints on T are simply that each column is in the unit simplex and I think you just did a problem on that or you did one a while ago or something. Anyway, and itsí extremely simple how to choose? You simply choose this. If P is bigger than lambda Q you choose the first hypothesis. The other way around, you choose this one. So you end up with a deterministic detector and this is called a likelihood ratio. Likelihood ratio test because what you really do is you look at this and you compare it to a threshold and then you guess which distribution it came from.

By the way, this extends to the idea of continuous distributions and things like that and itís very old. Itís from easily the 20ís or something like that. Roughly, yeah, 20ís Iíd say. Maybe even earlier, right? This is Ė

Student:I donít know. I think thatís about right.

Instructor (Stephen Boyd):20ís? Okay. Weíll just say the 20ís, right? It goes back to Ė so now you see you have a nice mathematical way to go back to the simple case. You just Ė you look. How do you guess which one it is? Oh, by the way, if lambda is one then you simply choose whichever one is the more likely. By the way, weíll see what the lambda equals one means. Lambda equals one Ė well, you can tell me. What does lambda mean here? What is lambda equals one? Because a maximum likelihood test would simply choose whichever had the maximum likelihood. That corresponds exactly to lambda equals one. Whatís the meaning in the bi-criterion problem for lambda equals one? Has a very specific meaning. What does it minimize? That minimizes something for false positives and false negative maximum likelihood. What is it?

Student:[Inaudible] of error.

Instructor (Stephen Boyd):The sum? Yeah. Which is otherwise known as the Ė if you put a lambda Ė if lambdaís one here it means, actually, that youíre minimizing the sum of this and this. But the sum of this and this has an interpretation. Itís called being wrong. Right? So youíre minimi Ė itís called an error. So youíre minimizing the probability of error. That corresponds exactly to lambda equals one. Okay. By the way, this ties back to maximum. Like this is, sort of, the justification for maximum likelihood. So you can say maximum likelihood detector you would then argue, youíd argue and youíd be correct, minimizes the probability of being wrong and youíre absolutely neutral. False positive has the same weight as false negative. You have a question?

Student:Yeah. For this do we have to assume that thereís kind of an equal likelihood of being distribution E and distribution 2?

Instructor (Stephen Boyd):Nope.

Student:Because itís Ė

Instructor (Stephen Boyd):Weíre doing statistics.

Student:Okay. But, I mean Ė

Instructor (Stephen Boyd):Youíre not even allowed to say that in statistics.

Student:Okay.

Instructor (Stephen Boyd):I mean, if there are Ė

Student:If youíre Basian.

Instructor (Stephen Boyd):Yeah. If youíre Ė no, no. If youíre Basian, no, no, no. Weíre not doing that right now. So thatís Ė if you say that in front of statisticians be careful. It could be physically dangerous. If there are some Basians around in a room with you and theyíre large youíre safe. Go ahead and say that, but watch out.

Student:Okay. So, I guess, it seems then that if we get a data form and thereís a Ė and we look at P and Q and thereís a very small probability that that data point came from distribution P. A very large probability that it came from Q, but if we know that 90 percent of the time Ė

Instructor (Stephen Boyd):By the way, if youíre all ready using all ready, youíve identified yourself as a Basian and youíd be all ready in big trouble.

Student:Okay. Well, I donít Ė

Instructor (Stephen Boyd):Thatís okay. Iíll tell you in a minute. Okay. Youíre safe here.

Student:Okay.

Instructor (Stephen Boyd):Donít try this out in certain parts of Sequoia. Okay. Donít try it. But go on.

Student:Okay. But, anyway, if you then Ė if you know that 90 percent of the time itís distribution P then even though itís a small probability that your outcome came from distribution P, the fact that distribution P is much more likely should probably influence your choice.

Instructor (Stephen Boyd):Absolutely. So, actually, this is a very good point to say this. So weíre doing statistics. In statistics there is no prior distribution. That you canít say even something like this is more likely or whatís the likelihood of it being this or that? Youíre just neutral. Okay. Iím Ė by the way, Iím neutral on this. Iím not partisan in any way on this. We have to be careful because a lot of people are in machine learning in this room too and I donít know where their sympathies lie. A peer statistician would never Ė youíre not even allowed to say something like whatís Ė this is a more probable out Ė which one is more probable? The minute you say that now youíre doing Basian estimation.

Now, I should say this. That if you do Basian estimation, then instead of the maximum likelihood you do something called MAP, which is Maximum A Posteriori Probability. Okay. And itís, of course, quite sensible if you have some ideas about what Ė which of these things. If I told you ahead of time Ė by the way, you could redo all of this and itís very, very easy in that case. If you do things like saying, well, yeah, it could come from these two distributions, but, in fact, with 30 percent chance it comes from P and 70 from Q. And you could work out everything that would happen here and youíd want something called the max. Then you could talk about the probability of the actual probability of being wrong. Things like that.

Now, the good news is this. By the way, we could have done that earlier as well in maximum likelihood estimation, like, for example, here we could have done MAP. All that happens is very cool. You add an extra term, so what is a log likelihood function or a likelihood function Ė a log likelihood function turns into a log of a conditional density. Then you multiply by the log of a prior density and you get regularization of terms here. So thatís what would happen. But you need to know just socially speaking that watch out youíre treading on very dangerous Ė just saying things like that can get you in big trouble. Whatís that?

Student:That wasnít a good thing to start.

Instructor (Stephen Boyd):No, Iím neutral. Iím totally neutral, but just you can say that in the wrong place and be very sorry. So the good news is from the convex optimization point of view they all lead to Ė so the way that would happen is if I was being Ė if I had some Basians coming on one direction and I had Daphne Collar coming on one side and who would be the most extreme one in statistics? Who would Ė Who?

Student:No, I Ė

Instructor (Stephen Boyd):Whoever youíre not gonna name any names.

Student:People are pretty flexible now.

Instructor (Stephen Boyd):Theyíre pretty flexible, yeah. So that was the right answer. And I had a statistical fundamentalist coming in. A group of them coming in on the left. I would Ė for us itís very simple. You just add a term here, which we would just say is this regularization? Itís regular. And theyíd say thatís not regularization, thatís log of a prior. And Iíd say, no, no, this is just regularized maximum likelihood or whatever of that. Thanks for bringing that up because I just wanted to Ė I mean, if actually I think if you read the book itís neutral and, I believe, honest about what it says. It just says if you choose to believe that youíre getting a sample from a parameterized distribution and your job is to estimate the distribution thatís a statistician. Thatís statistics. You can do that.

Student:Are also statisticians.

Instructor (Stephen Boyd):Now you know who I hang out with and what they are. Okay, sorry.

Student:Itís Frequentists is the Ė

Instructor (Stephen Boyd):Frequentists, Okay. So Iím not totally up on the details of these very schisms, but I do know that thereís been a lot of wars about it and a lot of bloodshed actually over these issues, but for us theyíre all just convex problems and theyíre just innocent little extra terms that end up in here. So that was a very long answer to your question. But itís a good Ė just as a matter of safety to know. Asking that innocent question in the wrong place could get you in deep trouble. So not really, but Ė actually, maybe it could. Well, we wonít go in. All right. So letís look at some examples now. Oh, we have our interpretation of maximum likelihood.

Maximum likelihood says you scan Ė if outcome three has occurred you scan the two probability distributions, you go to outcome three, and you see which has the higher likelihood. Thatís a maximum likelihood detector that corresponds exactly to lambda equals one. Lambda equals one corresponds precisely to saying that you treat false positives and false negatives equally. Okay. Thatís maximum likelihood. That will minimize the sum of the two, which is something like the probability of it being wrong. Okay. Now, you could also do a mini max detector.

You could say, well, I want to minimize the maximum of the two error probabilities. Now, this one actually Ė of course, it translates to an LP immediately, but this one actually, generally speaking, in fact, in the finite case essentially always has a non-deterministic solution and so letís see what that is. So hereís an example where these two distributions and itís kind of look, anybody can figure out what happened here. If itís outcome one you should probably guess it came from P and if itís outcome three you should probably guess it came from Q. Everyoneís gonna agree on that. Okay.

And thereís actually not much else here to do, right? Because two other outcomes and obviously if itís outcome two you should kind of lean towards saying it was two, but you should equivocate or something like that. Something like that. Okay. All right. So hereís the ROC, or Receiver Operating Characteristic, although itís generally drawn this way and I donít know why, but anyway, it often is. So here are the two. Thereís a probability of false positive and false negative. Weíve all ready talked about these things. This one is the detector that simply says Ė sorry, yeah. You have zero false negative that means you just simply always say itís positive. You just announce itís positive and you can have zero probability of false negative.

Hereís a tradeoff curve here and actually itís on, of course, itís piecewise linear. I mean thatís obvious because youíre minimizing a piecewise linear function over some linear inequality constraints. So this curve itís Ė well, this region is polyhedral and the vertices here correspond to different things. In fact, the vertices correspond to the three thresholds in lambda. Lambda is the slope here and so as you vary lambda to get a tradeoff curve itís kind of boring because if lambda is smaller than this you get that point. As you increase lambda you will click over to three, this point, and itís just some point. These are the deterministic detectors, one, two, three and four. As you increase lambda like this itís always the same.

Right at this point you switch over to a different threshold. Then you keep going like this and then eventually you get so high that the safest thing to do is simply to guess that nothing ever happened and that way you will have zero false positive always. Okay. So thereís not many points on the tradeoff curve, but the mini max one thatís very simple. Itís the maximum of the two and the level curves of the maximum Ė this dash line here is equal, is the line of equal false probability and false negatives and you can see very clearly that itís not a determinant. Itís right in between these two and itís not a deterministic detector.

So if you want to have a detector here which minimizes the maximum probability of making either a false positive or a false negative then youíre gonna have to use a non-deterministic Ė a randomized detector. Okay. Now, I mean, this is all kind of stupid looking in cases like this. Thatís fine. But if you want to make this non-trivial, itís very, very simple. You just imagine something with 12 outcomes and vectors, which are Ė and probability solutions on a thousand points. Okay. Very, very simple. And now you want to decide on a detector and now you start throwing in insane things like how much you care about guessing itís No. 7 when, in fact, the truth was No. 3 and you start throwing in how much you care about all these things and all of a sudden itís not obvious exactly how to do this.

Then when you solve an LP you get something that Ė or an LP or whatever. Actually, itís always going to be an LP. When you solve an LP youíll actually get a detector that will beat, soundly beat, according to that measure whatever your criterion is. Okay. So thatís the picture. Okay. So thatís that. Thereís a lot more you can do with mini max detector Ė with detectors here. And thereís a couple more topics in the book, which you will read about. Okay. Now, weíre gonna look at our last topic in this whirlwind tours experiment design. This is quite useful, very useful, and I think not that well Ė in some fields itís not that well diffused, knowledge of experiment design.

Even in areas where people actually often do experiments and get Ė construct models from data and things like that. Okay. So experiment design goes like this. And weíll talk about how it works. Weíll do the simplest case, which is linear measurements like this and weíll just say that the noises are IID and zero one. You can change all of this, but letís just do that. Well, the maximum likelihood of least squares estimate is just this. And the error is zero mean. Thatís X hap minus X is zero mean and has a covariance matrix, which is this thing. So thatís the covariance.

By the way, here the noises all have noise power one. So the norm of A is something like a signal to noise ratio and now Iím talking if youíre in signal processing or communications. So thatís because I just made the noises have power one. So if A is large Ė if the norm of A is large thatís a very good measurement. Itís a high quality measurement. Itís a high signal noise ratio. If the norm of A is small Ė the norm of A is zero, thatís an utterly useless measurement because itís just a sample from noise and it has nothing to do with X. Okay. So thatís if you want to get a rough idea. A larger A is larger signal to noise. Thatís what norm of A is. Itís literally the signal to noise, something like that ratio because the noise power is one. Okay.

All right. And Iím sort of assuming there norm of X is on the order of one, but that Ė with that assumption that just scales the signal to noise. Itís still the case that if you double A itís, sort of, twice as good from a noise point-of-view, standard deviation anyway as if you hadnít doubled A. Okay. Now, this is the error covariance and itís a sum of diags inverse. Thatís very interesting. Okay. So we can say a lot of interesting things about this, but first is if that matrix were singular that would not be good. Okay. And what that means is the matrix is singular it means you basically havenít taken a set of linearly independent Ė you havenít taken enough measurements basically. Sorry, measurements that span X. Okay. So that would be the case there.

So if youíve taken enough measurements so that the measurement vectors span our end then that means this thing is not singular and then you take the inverse and, of course, thatís the error covariance and now itís clear now Iím gonna be very rough here. To make this matrix Ė you want this matrix small and weíll talk about what small can mean. It can mean lots of things. You want the matrix small. Roughly speaking the first thing you want is you want whatís inside it since thatís an inverse to be big. Good. That corresponds perfectly because if A is big Ė if the Aís are big then that means you have high signal to noise ratio and that makes this error covariance small.

Now, the interesting thing is itís not scalar, itís actually matrix inverse and thatís very interesting because it means what the error covariance is is not simply Ė it doesnít depend just on the individual Aís. Itís actually how they all go together. Okay. Oh, thereís one exception. If the AIís are mutually orthogonal then this inverse you can just do it, change coordinates, and do it and theyíre independent and then they donít interact in any way whatsoever. But, in general, what this says is the following is you take a bunch of measurements, those are characterized by A1, A2, A3, and so on, and then you calculate the error covariance and the error covariance kind of depends on actually the geometry of the whole set of these things. I mean, this is kind of obvious, but thatís the idea.

And, for example, if you want to make an error code Ė a confidence ellipsoid it would depend on this covariance thing. Okay. So experiment design is this. It says I give you a pallet of possible experiments you can carry out. So weíll call those V1 through VP and these are just experiments you can carry out and now your job is to choose some number of experiments that will make this error covariance matrix as small as possible. So another way of saying that is I give you a pallet of possible experiments you can choose and the simplest thing would be I give you 50 possible measurements and I say you may choose 1,000 measurements from these 50. Okay. And you choose a thousand and you want that choice of a thousand from those 50 to be mutually maximally informative because whatís gonna happen is youíre gonna take all those measurements together. Youíre gonna blend them, do least squares, and do all the good blending that least squares is gonna do.

Together theyíre gonna give you an error covariance like that. Letís talk about some choices. If someone said, well, since your one has the highest signal to noise ratio, so Iím gonna choose all 1,000 measurements to have the signal to noise ratio ten because all these others have signal to noise ratio one. Any comments on that? Is it a good choice? If V1 has a signal noise ratio of ten and all the others have signal noise ratio one then you can choose any of them. So you can Ė why would you not always choose the first measurement. Itís ten times cleaner than any of the other measurements. Whatís the problem?

Student:This is not an [inaudible].

Instructor (Stephen Boyd):Yeah. You do unbelievably badly here because if A Ė if all the AIís are equal to V1 this one is ranked one. And it is by a long shot not invertible. Everybody see that? So the point is youíre gonna be forced. You canít do a greedy thing and just choose high quality measurements. Itís actually something where you have to choose all of them together. Does this make sense? And actually if youíre confused itís probably because this is actually quite trivial and Iím just making it sound complicated, but, anyway, okay. Thatís Ė but I do want to point this out. Okay.

So you can write this this way. Itís a vector optimization problem over the positive semi-definite cone and hereís what it says. So obviously all the Ė it doesnít matter what order you choose the measurements in thatís clear because all youíre gonna do is sum these dyads here. You just sum the dyads. So it doesnít matter which measurement I do first. What matters if I have a thousand measurements to take I have a pallet of 50 choices. What matters is how many measurements of type one do I take? How many of type two and so on. And weíre gonna call those MK. So MK is the number of measurements of choice K I make. And so I get this problem here. Now the MPís are integers and they have to be non-negative and, yeah, they add up to my budget. By the way, this is just the simplest version. I can also put a cost on each measurement.

I cannot only put a pallet of measurements in front of you. I can also put a price tag on every measurement and you can have a budget in money or time and youíd get a Ė then this budget would be something different. I can add other constraints to this if I want. I can add a time, money, all sorts of other constraints. This is just a simplest case where all the measurements are equal. You have a Ė youíre allowed to choose [inaudible]. Okay. Now, this problem here Ė if you just say experiment design this is what experiment design is. Itís this problem right here. Okay. And this problem in general is hard to solve, but there are some regimes where itís relatively easy to solve.

Actually, the feasible set is essentially the set of partitions of M. By the way, if M is really small, like three, then this is pretty easy to solve, or four or something like that. Then this is easy to solve. The other extreme is when M is very large because what you do is you rewrite this this way and you let lambda K be MK over M and this is now the fraction of the experiments of the M total budget youíre gonna use of which you will take experiment type Ė measurement type K. So this fraction. Okay. Now, I havenít changed anything here. Actually the truth would be something like this. The real problem would have M lambda K is in Z, like that. Okay.

So this real problem Ė this is the problem that gets you back to that one. It says that the lambdaís I chose have to be multiples of M is a thousand. I think multiples of .001, right? So I comment this out because canít handle it basically. You comment this out and you get Ė now you get something called the relaxed experiment design because Iíve relaxed this constraint. You all ready know that because Ė wait, is that on the current homework? I donít know because weíre working on homeworkís like one and two ahead. Are you doing something with a relaxation and Boolean variables now?

Student:Yeah.

Instructor (Stephen Boyd):Okay. Good. Then you know what relaxation is. Okay. So relaxation is commenting out the constraints that you canít handle. Thatís really what it is. There was a question? Yeah?

Student:Yeah. I was wondering if you canít handle [inaudible] problem guessing stuff convex? Itís like your M is not convex.

Instructor (Stephen Boyd):The only problem Ė the only thing thatís not convex is that.

Student:Yeah. But you canít handle it.

Instructor (Stephen Boyd):You could, but not in right not in what weíre doing in the class. No, in general, you canít. So these problems can be very difficult.

Student:[Inaudible]

Instructor (Stephen Boyd):I believe it is, but donít Ė

Student:Because it looks like [inaudible].

Instructor (Stephen Boyd):Yeah. I think I could make Ė Iím pretty sure I could make this NPR, but thereíd be no Ė Iíd go to Google and Iíd type experiment design, NP hard and thereíd probably be five papers showing aversions of experiment design for NP hard. Iím guessing, but I donít know. Just to make sure.

Student:Could you also get rid of that constraint if we could use like a steptastic experiment design?

Instructor (Stephen Boyd):Yes. Okay. So that actually is an excellent question and so let me explain how this works. When you do a relaxation Ė I mention this to you because youíre doing one now. Okay. So the way you deal with relaxation is take very good points Ė good time to talk about it. Iíll say more about it later in the class, but when youíre doing your relaxation. Now, first letís talk about Ė letís say what the truth is. The truth is you have a constraint you canít handle and you just comment it out. Itís kind of like the truth of why do you least squares? You do least squares because you can and you know how to. Thatís why you do it, okay. Then you can construct all sorts of stories, which are Ė some of which are sort of true, not true. And someone says why are you doing this? You go oh, everyone, well, maximum likelihood asymptotically blah blah blah estimator Gaussian blah blah blah.

And they say is the noise really Gaussian? And you go, yeah, yeah, sure it is. Yeah. Anyway, you do Ė also, by the way, if you repeat those lame excuses often enough youíll actually start to believe them, right? So youíll actually Ė you start it off as just you couldnít handle it, but what happens is after youíve been successful doing least squares for like 15 or 20 years you actually start believing it that things are Ga- anyway. All right. So this is a great example. Youíre doing experiment designs. You canít handle this, so you simply comment it out and youíre gonna solve this problem here. Now, by the way, in this problem you could get very close because you can bound how Ė how far can you be off if M is a thousand? Right?

So the point is if you solve this problem with lambda being just a real number these are numbers between zero and one. Itís a probability distribution. Each such number is at most .0005 away from an integer multiple .001. So you could actually bound that, right? In this case. So you could actually Ė you donít have to fall back on some totally lame excuse and something like that, but the alternative is actually better because it works universally. It was your suggestion and it goes like this. When someone says what are you doing? Well you say Iím solving this problem here and you go yeah, but this thing has to be an integer multiple of M. And you go oh, no, thatís very unsophisticated. Iím actually doing a Ė what Iím really doing is starcastic experiment. Itís like randomized Ė actually its exactly randomized detector.

And you go this is very sophisticated. This is how Ė this is like a randomized algorithm, randomized detector. This is much more sophisticated than just committing to 179 of type one, 28 of type two, and so on that add up to a thousand. You go no, no, no, this is much more sophisticated. Iím coming up with a randomized experiment design and Iím gonna come up with a probability distribution on the possible measurements and then what I will do is I will ask for a measure Ė each time someone asks I can carry out a measurement. Iíll flip a coin and do a randomized experiment and Iíll use those probabilities.

Now, of course, this is totally lame, this argument, but it often works. So I do recommend trying it and when you relax something and see if you get away with it because it just makes things easier. Did I answer your question?

Student:One other quick one. Can you do better like that?

Instructor (Stephen Boyd):What? With this?

Student:Using starcastic design can you do better than if you deterministics?

Instructor (Stephen Boyd):Can you do better? Oh, yes, because youíve removed a constraint. So you always do better in some sense, right? I mean, the problem, the only problem with solving this problem and not the original one, is when someone comes along and says yeah, but I have a thousand measurements. Youíve gotta come up with some numbers that add up to a thousand and you go, yeah, well, 187.22. And they go what does that mean? I canít do 187.22 experiments of type one. I can do 187 or 188, which is it? So thatís the problem. And you go no, but the 22 is more sophisticated because if you had to do a hundred thousand youíd be doing 18722 or something like that. So anyway. I mean, donít Ė in this case, if M is large you can bound how far off you can be and itís not a big deal.

In the Boolean case youíre doing right now, by the way, in those itís often not the case that you can bound how far youíre off. Okay. All right. So this is the problem. By the way, these things work really, really well. I should also add the same thing youíre doing in your homework now or are doing or will be doing shortly. The same methods work for experiment design. They work unbelievably well. So if you were to Ė if someone actually said choose a thousand experiments to carry out from this palate of 20 youíd solve this problem, youíd get some lambdaís, youíd just round them to numbers. Like, youíd say, okay, 187 of the first one. By the way, that would be a valid choice of experiment design, right?

And it would have a certain value of E, which is a covariance matrix. Okay. The relaxation gives you a lower bound and then youíd say someone could say is that the optimal? And then you could honestly say donít know, but I canít be more than .01 percent off. So itís good enough. So the same thing. You only know what Iím talking about if youíve started on the homework, which is maybe not very many people, but anyway, weíll just move on.

Okay. So the other question is how do you scalorize the fact that itís a vector problem? Well, it is a vector problem. Itís covariance matrices and, by the way, its experiment design. You get interesting stuff and itís just what you think. I mean, basically what happens is this. Iíll just draw a confidence ellipsoid. You make one experiment design and you might get this confidence ellipsoid. Okay. You choose another blend of experiments Ė well, if you choose another blend of experiments and you get this itís very simple. The second was a better choice than the first. Okay. This is the clear unambiguous section, but, in fact, the way it really works is something like this.

You choose one set of experiments and you get that as your confidence ellipsoid and you can choose another set of experiments and you get this. Okay. And now you can say which suite of Ė which experiment design is better? The answer is it means that itís total nonsense. Itís a multi-criterion problem. Now, by the way, if youíre estimating X1, which one is better? Two. Obviously the second one is better, right? If for some reason you wanted to estimate something along this axis obviously Ė oh, sorry. That axis, then one would be better. Okay. So you have to scalorize this and thereís lots of ways to scalorize it and each way to scalorize it, by the way, ends up with a name of experiment design.

The most common one by far multiple Ė many books written on it is D-Optimal Experiment Design. Iím guessing D comes from determinant, but honestly I donít know. So the D-Optimal Experiment Design you minimize the log depth of the inverse or you maximize the determinant of the covariance matrix. Okay. So thatís what you do. As a beautiful interpretation of confidence ellipsoids you are minimizing the volume of the confidence ellipsoid. Okay. So thatís the way Ė that would be the geometric interpretation. Thereís others. If you put a trace Ė if you minimize the trace of this which, by the way, has another beautiful interpretation. The trace, by the way, is the expected this. Itís the expected value of X hat minus X true norm squared. So if you minimize the trace and thatís called A-Optimal design is the other one and thereís others.

Thereís E-Optimal design and they go on and on. Okay. So this is clearly Ė I mean, itís a convex problem because weíre doing relaxed experiment design. Itís a convex problem. You know what we should do? We should just do a homework problem on that. Now that they know what relaxations are. Just super simple one. Sorry, weíre way behind on that. You should do that where you get the bound and then you do the design. Okay. Weíll do that. So here I Ė actually Iíll think Iíll quit here, but just weíll look at an example just to see how this works and then Iíll talk about this last business.

So hereís some possible experiments that you can carry out. So these are the Aís. So basically what it says is youíre gonna measure Ė youíre actually gonna make a linear measurement and I noticed that these things are farther out. They have a larger magnitude than these. That means that this is a set of sensors here that have higher signal to noise ratio than these measurements. Okay. Now, it should be kind of obvious what you really want to do, if possible, is to have high signal to noise ratio measurements which are orthogonal because if theyíre orthogonal it turns out itís gonna transfer Ė when you do the inverse the condition number is small and big will translate to small when you invert the matrix and so on. And so, sure enough, I think we add something like 20 possible experiments Ė the palate of experiments offered was 20, two were selected.

And the two that were selected make tons of sense. Didnít select any of these because these measurements kind of do the same thing, but with a better signal to noise ratio. So these were opted for No. 1. No. 2 it shows these ones that, sort of, were farthest apart from each other. The measurements that were farther apart from each other. By the way, if you do things like GPS and stuff like that these things will make perfect sense to you, right? These are the Ė it says you want to take measurements. In fact, youíd even call this like geometric gain or something like that is what youíd call this. So this is what happened.

So obviously in this case you did not need to solve a convex problem to be told that you should take high signal noise ratio measurements instead of low ones and you should, if possible, take ones that are nearly orthogonal. You donít need that. Trust me. You have a problem where youíre estimating 50 variables, or ten even, and you have a hundred possible measurements. There is absolutely no way you can intuit what the correct mixture of what the right experiment design is and the result can be very, very good. But anyway, I thought we all ready decided. Youíll find out. Youíll do one. Okay. Weíll quit here.

[End of Audio]

Duration: 77 minutes