MachineLearning-Lecture11

Instructor (Andrew Ng):Okay. Good morning. Welcome back.

What I want to do today is actually wrap up our discussion on learning theory and sort of on Ė and Iím gonna start by talking about Bayesian statistics and regularization, and then take a very brief digression to tell you about online learning. And most of todayís lecture will actually be on various pieces of that, so applying machine learning algorithms to problems like, you know, like the project or other problems you may go work on after you graduate from this class.

But letís start the talk about Bayesian statistics and regularization. So you remember from last week, we started to talk about learning theory and we learned about bias and variance. And I guess in the previous lecture, we spent most of the previous lecture talking about algorithms for model selection and for feature selection. We talked about cross-validation. Right?

So most of the methods we talked about in the previous lecture were ways for you to try to simply the model. So for example, the feature selection algorithms we talked about gives you a way to eliminate a number of features, so as to reduce the number of parameters you need to fit and thereby reduce overfitting. Right? You remember that? So feature selection algorithms choose a subset of the features so that you have less parameters and you may be less likely to overfit. Right?

What I want to do today is to talk about a different way to prevent overfitting. And thereís a method called regularization and thereís a way that lets you keep all the parameters. So hereís the idea, and Iím gonna illustrate this example with, say, linear regression. So you take the linear regression model, the very first model we learned about, right, we said that we would choose the parameters via maximum likelihood. Right? And that meant that, you know, you would choose the parameters theta that maximized the probability of the data, which is parameters theta that maximized the probability of the data we observe. Right?

And so to give this sort of procedure a name, this is one example of most common frequencies procedure, and frequency, you can think of sort of as maybe one school of statistics. And the philosophical view behind writing this down was we envisioned that there was some true parameter theta out there that generated, you know, the Xs and the Ys. Thereís some true parameter theta that govern housing prices, Y is a function of X, and we donít know what the value of theta is, and weíd like to come up with some procedure for estimating the value of theta. Okay? And so, maximum likelihood is just one possible procedure for estimating the unknown value for theta.

And the way you formulated this, you know, theta was not a random variable. Right? Thatís what why said, so theta is just some true value out there. Itís not random or anything, we just donít know what it is, and we have a procedure called maximum likelihood for estimating the value for theta. So this is one example of whatís called a frequencies procedure.

The alternative to the, I guess, the frequency school of statistics is the Bayesian school, in which weíre gonna say that we donít know what theta, and so we will put a prior on theta. Okay? So in the Bayesian school students would say, ďWell donít know what the value of theta so letís represent our uncertainty over theta with a prior.Ē So for example, our prior on theta may be a Gaussian distribution with mean zero and curvalence matrix given by tau squared I. Okay?

And so Ė actually, if I use S to denote my training set, well Ė right, so theta represents my beliefs about what the parameters are in the absence of any data. So not having seen any data, theta represents, you know, what I think theta Ė it probably represents what I think theta is most likely to be. And so given the training set, S, in the sort of Bayesian procedure, we would, well, calculate the probability, the posterior probability by parameters given my training sets, and Ė letís write this on the next board.

So my posterior on my parameters given my training set, by Bayesí rule, this will be proportional to, you know, this. Right? So by Bayesí rule. Letís call it posterior. And this distribution now represents my beliefs about what theta is after Iíve seen the training set.

And when you now want to make a new prediction on the price of a new house, on the input X, I would say that, well, the distribution over the possible housing prices for this new house Iím trying to estimate the price of, say, given the size of the house, the features of the house at X, and the training set I had previously, it is going to be given by an integral over my parameters theta of probably of Y given X comma theta and times the posterior distribution of theta given the training set. Okay?

And in particular, if you want your prediction to be the expected value of Y given the input X in training set, you would say integrate over Y times the posterior. Okay? You would take an expectation of Y with respect to your posterior distribution. Okay?

And you notice that when I was writing this down, so with the Bayesian formulation, and now started to write up here Y given X comma theta because this formula now is the property of Y conditioned on the values of the random variables X and theta. So Iím no longer writing semicolon theta, Iím writing comma theta because Iím now treating theta as a random variable.

So all of this is somewhat abstract but this is Ė and it turns out Ė actually letís check. Are there questions about this? No? Okay.

Letís try to make this more concrete. It turns out that for many problems, both of these steps in the computation are difficult because if, you know, theta is an N plus one-dimensional vector, is an N plus one-dimensional parameter vector, then this is one an integral over an N plus one-dimensional, you know, over RN plus one. And because numerically itís very difficult to compute integrals over very high dimensional spaces, all right?

So usually this integral Ė actually usually itís hard to compute the posterior in theta and itís also hard to compute this integral if theta is very high dimensional. There are few exceptions for which this can be done in closed form, but for many learning algorithms, say, Bayesian logistic regression, this is hard to do.

And so whatís commonly done is to take the posterior distribution and instead of actually computing a full posterior distribution, chi of theta given S, weíll instead take this quantity on the right-hand side and just maximize this quantity on the right-hand side. So let me write this down. So commonly, instead of computing the full posterior distribution, we will choose the following. Okay?

We will choose whatís called the MAP estimate, or the maximum a posteriori estimate of theta, which is the most likely value of theta, most probable value of theta onto your posterior distribution. And thatís just ont max chi of theta. And then when you need to make a prediction, you know, you would just predict, say, well, using your usual hypothesis and using this MAP value of theta in place of Ė as the parameter vector youíd choose. Okay?

And notice, the only difference between this and standard maximum likelihood estimation is that when youíre choosing, you know, the Ė instead of choosing the maximum likelihood value for theta, youíre instead maximizing this, which is what you have for maximum likelihood estimation, and then times this other quantity which is the prior. Right?

And letís see, when intuition is that if your prior is theta being Gaussian and with mean zero and some covariance, then for a distribution like this, most of the [inaudible] mass is close to zero. Right? So thereís a Gaussian centered around the point zero, and so [inaudible] mass is close to zero. And so the prior distribution, instead of saying that you think most of the parameters should be close to zero, and if you remember our discussion on feature selection, if you eliminate a feature from consideration thatís the same as setting the source and value of theta to be equal to zero.

All right? So if you set theta five to be equal to zero, thatís the same as, you know, eliminating feature five from the your hypothesis. And so, this is the prior that drives most of the parameter values to zero Ė to values close to zero. And youíll think of this as doing something analogous, if Ė doing something reminiscent of feature selection. Okay? And it turns out that with this formulation, the parameters wonít actually be exactly zero but many of the values will be close to zero.

And I guess in pictures, if you remember, I said that if you have, say, five data points and you fit a fourth-order polynomial Ė well I think that had too many bumps in it, but never mind. If you fit it a Ė if you fit very high polynomial to a very small dataset, you can get these very large oscillations if you use maximum likelihood estimation. All right?

In contrast, if you apply this sort of Bayesian regularization, you can actually fit a higher-order polynomial that still get sort of a smoother and smoother fit to the data as you decrease tau. So as you decrease tau, youíre driving the parameters to be closer and closer to zero. And that in practice Ė itís sort of hard to see, but you can take my word for it. As tau becomes smaller and smaller, the curves you tend to fit your data also become smoother and smoother, and so you tend less and less overfit, even when youíre fitting a large number of parameters. Okay?

Letís see, and one last piece of intuition that I would just toss out there. And you get to play more with this particular set of ideas more in Problem Set 3, which Iíll post online later this week I guess. Is that whereas maximum likelihood tries to minimize, say, this, right?

Whereas maximum likelihood for, say, linear regression turns out to be minimizing this, it turns out that if you add this prior term there, it turns out that the authorization objective you end up optimizing turns out to be that. Where you add an extra term that, you know, penalizes your parameter theta as being large. And so this ends up being an algorithm thatís very similar to maximum likelihood, expect that you tend to keep your parameters small. And this has the effect.

Again, itís kind of hard to see but just take my word for it. That strengthening the parameters has the effect of keeping the functions you fit to be smoother and less likely to overfit. Okay? Okay, hopefully this will make more sense when you play with these ideas a bit more in the next problem set. But letís check questions about all this.

Student:The smoothing behavior is it because [inaudible] actually get different [inaudible]?

Instructor (Andrew Ng):Letís see. Yeah. It depends on Ė well most priors with most of the mass close to zero will get this effect, I guess. And just by convention, the Gaussian prior is whatís most Ė used the most common for models like logistic regression and linear regression, generalized in your models. There are a few other priors that I sometimes use, like the Laplace prior, but all of them will tend to have these sorts of smoothing effects.

All right. Cool.

And so it turns out that for problems like text classification, text classification is like 30,000 features or 50,000 features, where it seems like an algorithm like logistic regression would be very much prone to overfitting. Right? So imagine trying to build a spam classifier, maybe you have 100 training examples but you have 30,000 features or 50,000 features, that seems clearly to be prone to overfitting. Right? But it turns out that with this sort of Bayesian regularization, with [inaudible] Gaussian, logistic regression becomes a very effective text classification algorithm with this sort of Bayesian regularization.

Alex?

Student:[Inaudible]?

Instructor (Andrew Ng):Yeah, right, and so pick Ė and to pick either tau squared or lambda. I think the relation is lambda equals one over tau squared. But right, so pick either tau squared or lambda, you could use cross-validation, yeah. All right? Okay, cool.

So all right, that was all I want to say about methods for preventing overfitting. What I want to do next is just spend, you know, five minutes talking about online learning. And this is sort of a digression. And so, you know, when youíre designing the syllabus of a class, I guess, sometimes there are just some ideas you want to talk about but canít find a very good place to fit in anywhere. So this is one of those ideas that may seem a bit disjointed from the rest of the class but I just want to tell you a little bit about it.

Okay. So hereís the idea. So far, all the learning algorithms weíve talked about are whatís called batch learning algorithms, where youíre given a training set and then you get to run your learning algorithm on the training set and then maybe you test it on some other test set. And thereís another learning setting called online learning, in which you have to make predictions even while you are in the process of learning. So hereís how the problem sees. All right?

Iím first gonna give you X one. Letís say thereís a classification problem, so Iím first gonna give you X one and then gonna ask you, you know, ďCan you make a prediction on X one? Is the label one or zero?Ē And youíve not seen any data yet. And so, you make a guess. Right? You guess Ė weíll call your guess Y hat one. And after youíve made your prediction, I will then reveal to you the true label Y one. Okay? And not having seen any data before, your odds of getting the first one right are only 50 percent, right, if you guess randomly.

And then I show you X two. And then I ask you, ďCan you make a prediction on X two?Ē And so you now maybe are gonna make a slightly more educated guess and call that Y hat two. And after youíve made your guess, I reveal the true label to you. And so, then I show you X three, and then you make your guess, and learning proceeds as follows.

So this is just a lot of machine learning and batch learning, and the model settings where you have to keep learning even as youíre making predictions, okay? So I don't know, setting your website and you have users coming in. And as the first user comes in, you need to start making predictions already about what the user likes or dislikes. And thereís only, you know, as youíre making predictions you get to show more and more training examples.

So in online learning what you care about is the total online error, which is sum from I equals one to MC if you get the sequence of M examples all together, indicator Y hat I not equal to Y hi. Okay? So the total online error is the total number of mistakes you make on a sequence of examples like this.

And it turns out that, you know, many of the learning algorithms you have Ė when you finish all the learning algorithms, youíve learned about and can apply to this setting. One thing you could do is when youíre asked to make prediction on Y hat three, right, one simple thing to do is well youíve seen some other training examples up to this point so you can just take your learning algorithm and run it on the examples, you know, leading up to Y hat three. So just run the learning algorithm on all the examples youíve seen previous to being asked to make a prediction on certain example, and then use your learning algorithm to make a prediction on the next example.

And it turns out that there are also algorithms, especially the algorithms that we saw that you could use the stochastic gradient descent, that, you know, can be adapted very nicely to this. So as a concrete example, if you remember the perceptron algorithms, say, right, you would say initial the parameter theta to be equal to zero.

And then after seeing the Ith training example, youíd update the parameters, you know, using Ė youíve see this reel a lot of times now, right, using the standard perceptron learning rule. And the same thing, if you were using logistic regression you can then, again, after seeing each training example, just run, you know, essentially run one-step stochastic gradient descent on just the example you saw. Okay?

And so the reason Iíve put this into the sort of ďlearning theoryĒ section of this class was because it turns that sometimes you can prove fairly amazing results on your total online error using algorithms like these. I will actually Ė I don't actually want to spend the time in the main lecture to prove this, but, for example, you can prove that when you use the perceptron algorithm, then even when the features XI, maybe infinite dimensional feature vectors, like we saw for simple vector machines. And sometimes, infinite feature dimensional vectors may use kernel representations. Okay?

But so it turns out that you can prove that when you a perceptron algorithm, even when the data is maybe extremely high dimensional and it seems like youíd be prone to overfitting, right, you can prove that so as the long as the positive and negative examples are separated by a margin, right.

So in this infinite dimensional space, so long as, you know, there is some margin down there separating the positive and negative examples, you can prove that perceptron algorithm will converge to a hypothesis that perfectly separates the positive and negative examples. Okay? And then so after seeing only a finite number of examples, itíll converge to digital boundary that perfectly separates the positive and negative examples, even though you may in an infinite dimensional space. Okay?

So letís see. The proof itself would take me sort of almost an entire lecture to do, and there are sort of other things that I want to do more than that. So you want to see the proof of this yourself, itís actually written up in the lecture notes that I posted online. For the purposes of this classí syllabus, the proof of this result, you can treat this as optional reading. And by that, I mean, you know, it wonít appear on the midterm and you wonít be asked about this specifically in the problem sets, but I thought itíd be Ė

I know some of you are curious after the previous lecture so why you can prove that, you know, SVMs can have bounded VC dimension, even in these infinite dimensional spaces, and how do you prove things in these Ė how do you prove learning theory results in these infinite dimensional feature spaces. And so the perceptron bound that I just talked about was the simplest instance I know of that you can sort of read in like half an hour and understand it.

So if youíre interested, there are lecture notes online for how this perceptron bound is actually proved. Itís a very [inaudible], you can prove it in like a page or so, so go ahead and take a look at that if youíre interested. Okay?

But regardless of the theoretical results, you know, the online learning setting is something that you Ė that comes reasonably often. And so, these algorithms based on stochastic gradient descent often go very well. Okay, any questions about this before I move on? All right. Cool.

So the last thing I want to do today, and was the majority of todayís lecture, actually can I switch to PowerPoint slides, please, is I actually want to spend most of todayís lecture sort of talking about advice for applying different machine learning algorithms.

And so, you know, right now, already you have a, I think, a good understanding of really the most powerful tools known to humankind in machine learning. Right? And what I want to do today is give you some advice on how to apply them really powerfully because, you know, the same tool Ė it turns out that you can take the same machine learning tool, say logistic regression, and you can ask two different people to apply it to the same problem. And sometimes one person will do an amazing job and itíll work amazingly well, and the second person will sort of not really get it to work, even though it was exactly the same algorithm. Right?

And so what I want to do today, in the rest of the time I have today, is try to convey to you, you know, some of the methods for how to make sure youíre one of Ė you really know how to get these learning algorithms to work well in problems. So just some caveats on what Iím gonna, I guess, talk about in the rest of todayís lecture.

Something I want to talk about is actually not very mathematical but is also some of the hardest, most conceptually most difficult material in this class to understand. All right? So this is not mathematical but this is not easy. And I want to say this caveat some of what Iíll say today is debatable. I think most good machine learning people will agree with most of what I say but maybe not everything I say. And some of what Iíll say is also not good advice for doing machine learning either, so Iíll say more about this later.

What Iím focusing on today is advice for how to just get stuff to work. If you work in the company and you want to deliver a product or youíre, you know, building a system and you just want your machine learning system to work. Okay? Some of what Iím about to say today isnít great advice if you goal is to invent a new machine learning algorithm, but this is advice for how to make machine learning algorithm work and, you know, and deploy a working system.

So three key areas Iím gonna talk about. One: diagnostics for debugging learning algorithms. Second: sort of talk briefly about error analyses and ablative analysis. And third, I want to talk about just advice for how to get started on a machine-learning problem. And one theme thatíll come up later is it turns out youíve heard about premature optimization, right, in writing software. This is when someone over-designs from the start, when someone, you know, is writing piece of code and they choose a subroutine to optimize heavily. And maybe you write the subroutine as assembly or something.

And thatís often Ė and many of us have been guilty of premature optimization, where weíre trying to get a piece of code to run faster. And we choose probably a piece of code and we implement it an assembly, and really tune and get to run really quickly. And it turns out that wasnít the bottleneck in the code at all. Right? And we call that premature optimization. And in undergraduate programming classes, we warn people all the time not to do premature optimization and people still do it all the time. Right?

And turns out, a very similar thing happens in building machine-learning systems. That many people are often guilty of, what I call, premature statistical optimization, where they heavily optimize part of a machine learning system and that turns out not to be the important piece. Okay? So Iíll talk about that later, as well.

So letís first talk about debugging learning algorithms. As a motivating example, letís say you want to build an anti-spam system. And letís say youíve carefully chosen, you know, a small set of 100 words to use as features. All right? So instead of using 50,000 words, youíre chosen a small set of 100 features to use for your anti-spam system. And letís say you implement Bayesian logistic regression, implement gradient descent, and you get 20 percent test error, which is unacceptably high. Right?

So this is Bayesian logistic regression, and so itís just like maximum likelihood but, you know, with that additional lambda squared term. And weíre maximizing rather than minimizing as well, so thereís a minus lambda theta square instead of plus lambda theta squared. So the question is, you implemented your Bayesian logistic regression algorithm, and you tested it on your test set and you got unacceptably high error, so what do you do next? Right?

So, you know, one thing you could do is think about the ways you could improve this algorithm. And this is probably what most people will do instead of, ďWell letís sit down and think what couldíve gone wrong, and then weíll try to improve the algorithm.Ē Well obviously having more training data could only help, so one thing you can do is try to get more training examples. Maybe you suspect, that even 100 features was too many, so you might try to get a smaller set of features.

Whatís more common is you might suspect your features arenít good enough, so you might spend some time, look at the email headers, see if you can figure out better features for, you know, finding spam emails or whatever. Right. And right, so and just sit around and come up with better features, such as for email headers. You may also suspect that gradient descent hasnít quite converged yet, and so letís try running gradient descent a bit longer to see if that works. And clearly, that canít hurt, right, just run gradient descent longer.

Or maybe you remember, you know, you remember hearing from class that maybe Newtonís method converges better, so letís try that instead. You may want to tune the value for lambda, because not sure if that was the right thing, or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression.

So I only listed eight things here, but you can imagine if you were actually sitting down, building machine-learning system, the options to you are endless. You can think of, you know, hundreds of ways to improve a learning system. And some of these things like, well getting more training examples, surely thatís gonna help, so that seems like itís a good use of your time. Right?

And it turns out that this [inaudible] of picking ways to improve the learning algorithm and picking one and going for it, it might work in the sense that it may eventually get you to a working system, but often itís very time-consuming. And I think itís often a largely Ė largely a matter of luck, whether you end up fixing what the problem is.

In particular, these eight improvements all fix very different problems. And some of them will be fixing problems that you donít have. And if you can rule out six of eight of these, say, you could Ė if by somehow looking at the problem more deeply, you can figure out which one of these eight things is actually the right thing to do, you can save yourself a lot of time. So letís see how we can go about doing that.

The people in industry and in research that I see that are really good, would not go and try to change a learning algorithm randomly. There are lots of things that obviously improve your learning algorithm, but the problem is there are so many of them itís hard to know what to do. So you find all the really good ones that run various diagnostics to figure out the problem is and they think where a problem is. Okay?

So for our motivating story, right, we said Ė letís say Bayesian logistic regression test error was 20 percent, which letís say is unacceptably high. And letís suppose you suspected the problem is either overfitting, so itís high bias, or you suspect that, you know, maybe you have two few features that classify as spam, so thereís Ė

Oh excuse me; I think I wrote that wrong. Letís firstly Ė so letís forget Ė forget the tables.

Suppose you suspect the problem is either high bias or high variance, and some of the text here doesnít make sense. And you want to know if youíre overfitting, which would be high variance, or you have too few features classified as spam, itíd be high bias. I had those two reversed, sorry. Okay?

So how can you figure out whether the problem is one of high bias or high variance? Right? So it turns out thereís a simple diagnostic you can look at that will tell you whether the problem is high bias or high variance. If you remember the cartoon weíd seen previously for high variance problems, when you have high variance the training error will be much lower than the test error. All right?

When you have a high variance problem, thatís when youíre fitting your training set very well. Thatís when youíre fitting, you know, a tenth order polynomial to 11 data points. All right? And thatís when youíre just fitting the data set very well, and so your training error will be much lower than your test error.

And in contrast, if you have high bias, thatís when your training error will also be high. Right? Thatís when your data is quadratic, say, but youíre fitting a linear function to it and so you arenít even fitting your training set well.

So just in cartoons, I guess, this is a Ė this is what a typical learning curve for high variance looks like. On your horizontal axis, Iím plotting the training set size M, right, and on vertical axis, Iím plotting the error. And so, letís see, you know, as you increase Ė if you have a high variance problem, youíll notice as the training set size, M, increases, your test set error will keep on decreasing. And so this sort of suggests that, well, if you can increase the training set size even further, maybe if you extrapolate the green curve out, maybe that test set error will decrease even further. All right?

Another thing thatís useful to plot here is Ė letís say the red horizontal line is the desired performance youíre trying to reach, another useful thing to plot is actually the training error. Right? And it turns out that your training error will actually grow as a function of the training set size because the larger your training set, the harder it is to fit, you know, your training set perfectly. Right?

So this is just a cartoon, donít take it too seriously, but in general, your training error will actually grow as a function of your training set size. Because smart training sets, if you have one data point, itís really easy to fit that perfectly, but if you have 10,000 data points, itís much harder to fit that perfectly. All right?

And so another diagnostic for high variance, and the one that I tend to use more, is to just look at training versus test error. And if thereís a large gap between them, then this suggests that, you know, getting more training data may allow you to help close that gap. Okay? So this is what the cartoon would look like when Ė in the case of high variance.

This is what the cartoon looks like for high bias. Right? If you look at the learning curve, you see that the curve for test error has flattened out already. And so this is a sign that, you know, if you get more training examples, if you extrapolate this curve further to the right, itís maybe not likely to go down much further. And this is a property of high bias: that getting more training data wonít necessarily help.

But again, to me the more useful diagnostic is if you plot training errors well, if you look at your training error as well as your, you know, hold out test set error. If you find that even your training error is high, then thatís a sign that getting more training data is not going to help. Right?

In fact, you know, think about it, training error grows as a function of your training set size. And so if your training error is already above your level of desired performance, then getting even more training data is not going to reduce your training error down to the desired level of performance. Right? Because, you know, your training error sort of only gets worse as you get more and more training examples. So if you extrapolate further to the right, itís not like this blue line will come back down to the level of desired performance. Right? This will stay up there. Okay?

So for me personally, I actually, when looking at a curve like the green curve on test error, I actually personally tend to find it very difficult to tell if the curve is still going down or if itís [inaudible]. Sometimes you can tell, but very often, itís somewhat ambiguous. So for me personally, the diagnostic I tend to use the most often to tell if I have a bias problem or a variance problem is to look at training and test error and see if theyíre very close together or if theyíre relatively far apart. Okay?

And so, going back to the list of fixes, look at the first fix, getting more training examples is a way to fix high variance. Right? If you have a high variance problem, getting more training examples will help. Trying a smaller set of features: that also fixes high variance. All right? Trying a larger set of features or adding email features, these are solutions that fix high bias. Right? So high bias being if youíre hypothesis was too simple, you didnít have enough features. Okay?

And so quite often you see people working on machine learning problems and theyíll remember that getting more training examples helps. And so, theyíll build a learning system, build an anti-spam system and it doesnít work. And then they go off and spend lots of time and money and effort collecting more training data because theyíll say, ďOh well, getting more dataís obviously got to help.Ē But if they had a high bias problem in the first place, and not a high variance problem, itís entirely possible to spend three months or six months collecting more and more training data, not realizing that it couldnít possibly help. Right?

And so, this actually happens a lot in, you know, in Silicon Valley and companies, this happens a lot. There will often people building various machine learning systems, and theyíll often Ė you often see people spending six months working on fixing a learning algorithm and you couldíve told them six months ago that, you know, that couldnít possibly have helped. But because they didnít know what the problem was, and theyíd easily spend six months trying to invent new features or something.

And this is Ė you see this surprisingly often and this is somewhat depressing. You couldíve gone to them and told them, ďI couldíve told you six months ago that this was not going to help.Ē And the six months is not a joke, you actually see this. And in contrast, if you actually figure out the problemís one of high bias or high variance, then you can rule out two of these solutions and save yourself many months of fruitless effort. Okay?

I actually want to talk about these four at the bottom as well. But before I move on, let me just check if there were questions about what Iíve talked about so far. No? Okay, great.

So bias versus variance is one thing that comes up often. This bias versus variance is one common diagnostic. And so, for other machine learning problems, itís often up to your own ingenuity to figure out your own diagnostics to figure out whatís wrong. All right? So if a machine-learning algorithm isnít working, very often itís up to you to figure out, you know, to construct your own tests. Like do you look at the difference training and test errors or do you look at something else? Itís often up to your own ingenuity to construct your own diagnostics to figure out whatís going on.

What I want to do is go through another example. All right? And this one is slightly more contrived but itíll illustrate another common question that comes up, another one of the most common issues that comes up in applying learning algorithms.

So in this example, itís slightly more contrived, letís say you implement Bayesian logistic regression and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So itís rejecting, you know, 2 percent of Ė itís rejecting 98 percent of your spam mail, which is fine, so 2 percent of all spam gets through which is fine, but is also rejecting 2 percent of your good email, 2 percent of the email from your friends and thatís unacceptably high, letís say.

And letís say that a simple vector machine using a linear kernel gets 10 percent error on spam and 0.01 percent error on non-spam, which is more of the acceptable performance you want. And letís say for the sake of this example, letís say youíre trying to build an anti-spam system. Right? Letís say that you really want to deploy logistic regression to your customers because of computational efficiency or because you need retrain overnight every day, and because logistic regression just runs more easily and more quickly or something. Okay?

So letís say you want to deploy logistic regression, but itís just not working out well. So question is: What do you do next? So it turns out that this Ė the issue that comes up here, the one other common question that comes up is a question of is the algorithm converging. So you might suspect that maybe the problem with logistic regression is that itís just not converging. Maybe you need to run iterations.

And it turns out that, again if you look at the optimization objective, say, logistic regression is, letís say, optimizing J of theta, it actually turns out that if you look at optimizing your objective as a function of the number of iterations, when you look at this curve, you know, it sort of looks like itís going up but it sort of looks like thereís absentiles.

And when you look at these curves, itís often very hard to tell if the curve has already flattened out. All right? And you look at these curves a lot so you can ask: Well has the algorithm converged? When you look at the J of theta like this, itís often hard to tell. You can run this ten times as long and see if itís flattened out. And you can run this ten times as long and itíll often still look like maybe itís going up very slowly, or something. Right? So a better diagnostic for what logistic regression is converged than looking at this curve.

The other question you might wonder Ė the other thing you might suspect is a problem is are you optimizing the right function. So what you care about, right, in spam, say, is a weighted accuracy function like that. So A of theta is, you know, sum over your examples of some weights times whether you got it right. And so the weight may be higher for non-spam than for spam mail because you care about getting your predictions correct for spam email much more than non-spam mail, say.

So letís say A of theta is the optimization objective that you really care about, but Bayesian logistic regression is that it optimizes a quantity like that. Right? Itís this sort of maximum likelihood thing and then with this two-nom, you know, penalty thing that we saw previously. And you might be wondering: Is this the right optimization function to be optimizing. Okay? And: Or do I maybe need to change the value for lambda to change this parameter? Or: Should I maybe really be switching to support vector machine optimization objective?

Okay? Does that make sense?

So the second diagnostic Iím gonna talk about is letís say you want to figure out is the algorithm converging, is the optimization algorithm converging, or is the problem with the optimization objective I chose in the first place? Okay?

So hereís the diagnostic you can use. Let me let Ė right. So to just reiterate the story, right, letís say an SVM outperforms Bayesian logistic regression but you really want to deploy Bayesian logistic regression to your problem. Let me let theta subscript SVM, be the parameters learned by an SVM, and Iíll let theta subscript BLR be the parameters learned by Bayesian logistic regression.

So the optimization objective you care about is this, you know, weighted accuracy criteria that I talked about just now. And the support vector machine outperforms Bayesian logistic regression. And so, you know, the weighted accuracy on the support-vector-machine parameters is better than the weighted accuracy for Bayesian logistic regression.

So further, Bayesian logistic regression tries to optimize an optimization objective like that, which I denoted J theta. And so, the diagnostic I choose to use is to see if J of SVM is bigger-than or less-than J of BLR. Okay? So I explain this on the next slide.

So we know two facts. We know that Ė well we know one fact. We know that a weighted accuracy of support vector machine, right, is bigger than this weighted accuracy of Bayesian logistic regression. So in order for me to figure out whether Bayesian logistic regression is converging, or whether Iím just optimizing the wrong objective function, the diagnostic Iím gonna use and Iím gonna check if this equality hold through. Okay?

So let me explain this, so in Case 1, right, itís just those two equations copied over. In Case 1, letís say that J of SVM is, indeed, is greater than J of BLR Ė or J of theta SVM is greater than J of theta BLR. But we know that Bayesian logistic regression was trying to maximize J of theta; thatís the definition of Bayesian logistic regression.

So this means that theta Ė the value of theta output that Bayesian logistic regression actually fails to maximize J because the support back to machine actually returned the value of theta that, you know does a better job out-maximizing J. And so, this tells me that Bayesian logistic regression didnít actually maximize J correctly, and so the problem is with the optimization algorithm. The optimization algorithm hasnít converged.

The other case is as follows, where J of theta SVM is less-than/equal to J of theta BLR. Okay? In this case, what does that mean? This means that Bayesian logistic regression actually attains the higher value for the optimization objective J then doesnít support back to machine. The support back to machine, which does worse on your optimization problem, actually does better on the weighted accuracy measure.

So what this means is that something that does worse on your optimization objective, on J, can actually do better on the weighted accuracy objective. And this really means that maximizing J of theta, you know, doesnít really correspond that well to maximizing your weighted accuracy criteria. And therefore, this tells you that J of theta is maybe the wrong optimization objective to be maximizing. Right? That just maximizing J of theta just wasnít a good objective to be choosing if you care about the weighted accuracy. Okay?

Can you raise your hand if this made sense? Cool, good.

So that tells us whether the problem is with the optimization objective or whether itís with the objective function. And so going back to this slide, the eight fixes we had, you notice that if you run gradient descent for more iterations that fixes the optimization algorithm. You try and use this method fixes the optimization algorithm, whereas using a different value for lambda, in that lambda times norm of data squared, you know, in your objective, fixes the optimization objective. And changing to an SVM is also another way of trying to fix the optimization objective. Okay?

And so once again, you actually see this quite often that Ė actually, you see it very often, people will have a problem with the optimization objective and be working harder and harder to fix the optimization algorithm. Thatís another very common pattern that the problem is in the formula from your J of theta, that often you see people, you know, just running more and more iterations of gradient descent. Like trying Newtonís method and trying conjugate and then trying more and more crazy optimization algorithms, whereas the problem was, you know, optimizing J of theta wasnít going to fix the problem at all. Okay?

So thereís another example of when these sorts of diagnostics will help you figure out whether you should be fixing your optimization algorithm or fixing the optimization objective. Okay?

Let me think how much time I have. Hmm, letís see. Well okay, we have time. Letís do this. Show you one last example of a diagnostic. This is one that came up in, you know, my studentsí and my work on flying helicopters. This one actually, this example is the most complex of the three examples Iím gonna do today. Iím going to somewhat quickly, and this actually draws on reinforcement learning which is something that Iím not gonna talk about until towards Ė close to the end of the course here, but this just a more complicated example of a diagnostic weíre gonna go over.

What Iíll do is probably go over this fairly quickly, and then after weíve talked about reinforcement learning in the class, Iíll probably actually come back and redo this exact same example because youíll understand it more deeply. Okay?

So some of you know that my students and I fly autonomous helicopters, so how do you get a machine-learning algorithm to design the controller for helicopter? This is what we do. All right? This first step was you build a simulator for a helicopter, so, you know, thereís a screenshot of our simulator. This is just like a Ė itís like a joystick simulator; you can fly a helicopter in simulation.

And then you choose a cost function, itís actually called a [inaudible] function, but for this actually Iíll call it cost function. Say J of theta is, you know, the expected squared error in your helicopterís position. Okay? So this is J of theta is maybe itís expected square error or just the square error. And then we run a reinforcement-learning algorithm, youíll learn about RL algorithms in a few weeks. You run reinforcement learning algorithm in your simulator to try to minimize this cost function; try to minimize the squared error of how well youíre controlling your helicopterís position. Okay?

The reinforcement learning algorithm will output some parameters, which Iím denoting theta subscript RL, and then youíll use that to fly your helicopter. So suppose you run this learning algorithm and you get out a set of controller parameters, theta subscript RL, that gives much worse performance than a human pilot. Then what do you do next? And in particular, you know, corresponding to the three steps above, there are three natural things you can try. Right?

You can try to Ė oh, the bottom of the slide got chopped off. You can try to improve the simulator. And maybe you think your simulatorís isnít that accurate, you need to capture the aerodynamic effects more accurately. You need to capture the airflow and the turbulence affects around the helicopter more accurately. Maybe you need to modify the cost function. Maybe your square error isnít cutting it. Maybe what a human pilot does isnít just optimizing square area but itís something more subtle. Or maybe the reinforcement-learning algorithm isnít working; maybe itís not quite converging or something. Okay?

So these are the diagnostics that I actually used, and my students and I actually use to figure out whatís going on. Actually, why donít you just think about this for a second and think what youíd do, and then Iíll go on and tell you what we do. All right, so let me tell you what Ė how we do this and see whether itís the same as yours or not. And if you have a better idea than I do, let me know and Iíll let you try it on my helicopter.

So hereís a reasoning that I wanted to experiment, right. So, yeah, letís say the controller output by our reinforcement-learning algorithm does poorly. Well suppose the following three things hold true. Suppose the contrary, I guess. Suppose that the helicopter simulator is accurate, so letís assume we have an accurate model of our helicopter. And letís suppose that the reinforcement learning algorithm, you know, correctly controls the helicopter in simulation, so we tend to run a learning algorithm in simulation so that, you know, the learning algorithm can crash a helicopter and itís fine. Right?

So letís assume our reinforcement-learning algorithm correctly controls the helicopter so as to minimize the cost function J of theta. And letís suppose that minimizing J of theta does indeed correspond to accurate or the correct autonomous flight. If all of these things held true, then that means that the parameters, theta RL, should actually fly well on my real helicopter. Right? And so the fact that the learning control parameters, theta RL, does not fly well on my helicopter, that sort of means that ones of these three assumptions must be wrong and Iíd like to figure out which of these three assumptions is wrong. Okay?

So these are the diagnostics we use. First one is we look at the controller and see if it even flies well in simulation. Right? So the simulator of the helicopter that we did the learning on, and so if the learning algorithm flies well in the simulator but it doesnít fly well on my real helicopter, then that tells me the problem is probably in the simulator. Right? My simulator predicts the helicopterís controller will fly well but it doesnít actually fly well in real life, so could be the problemís in the simulator and we should spend out efforts improving the accuracy of our simulator.

Otherwise, let me write theta subscript human, be the human control policy. All right? So letís go ahead and ask a human to fly the helicopter, it could be in the simulator, it could be in real life, and letís measure, you know, the means squared error of the human pilotís flight. And letís see if the human pilot does better or worse than the learned controller, in terms of optimizing this objective function J of theta. Okay?

So if the human does worse, if even a very good human pilot attains a worse value on my optimization objective, on my cost function, than my learning algorithm, then the problem is in the reinforcement-learning algorithm. Because my reinforcement-learning algorithm was trying to minimize J of theta, but a human actually attains a lower value for J of theta than does my algorithm. And so that tells me that clearly my algorithmís not managing to minimize J of theta and that tells me the problemís in the reinforcement learning algorithm.

And finally, if J of theta Ė if the human actually attains a larger value for theta Ė excuse me, if the human actually attains a larger value for J of theta, the human actually has, you know, larger mean squared error for the helicopter position than does my reinforcement learning algorithms, thatís I like Ė but I like the way the human flies much better than my reinforcement learning algorithm.

So if that holds true, then clearly the problemís in the cost function, right, because the human does worse on my cost function but flies much better than my learning algorithm. And so that means the problemís in the cost function. It means Ė oh excuse me, I meant minimizing it, not maximizing it, thereís a typo on the slide, because that means that minimizing the cost function Ė my learning algorithm does a better job minimizing the cost function but doesnít fly as well as a human pilot. So that tells you that minimizing the cost function doesnít correspond to good autonomous flight. And what you should do it go back and see if you can change J of theta. Okay?

And so for those reinforcement learning problems, you know, if something doesnít work Ė often reinforcement learning algorithms just work but when they donít work, these are the sorts of diagnostics you use to figure out should we be focusing on the simulator, on changing the cost function, or on changing the reinforcement learning algorithm.

And again, if you donít know which of your three problems it is, itís entirely possible, you know, to spend two years, whatever, changing, building a better simulator for your helicopter. But it turns out that modeling helicopter aerodynamics is an active area of research. There are people, you know, writing entire PhD theses on this still. So itís entirely possible to go out and spend six years and write a PhD thesis and build a much better helicopter simulator, but if youíre fixing the wrong problem itís not gonna help.

So quite often, you need to come up with your own diagnostics to figure out whatís happening in an algorithm when something is going wrong. And unfortunately I don't know of Ė what Iíve described are sort of maybe some of the most common diagnostics that Iíve used, that Iíve seen, you know, to be useful for many problems. But very often, you need to come up with your own for your own specific learning problem.

And I just want to point out that even when the learning algorithm is working well, itís often a good idea to run diagnostics, like the ones I talked about, to make sure you really understand whatís going on. All right? And this is useful for a couple of reasons. One is that diagnostics like these will often help you to understand your application problem better. So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying job to apply machine-learning algorithms to some application problem of, you know, significant economic interest. Right?

And youíre gonna be working on one specific important machine learning application for many months, or even for years. One of the most valuable things for you personally will be for you to get in Ė for you personally to get in an intuitive understanding of what works and what doesnít work your problem. Sort of right now in the industry, in Silicon Valley or around the world, there are many companies with important machine learning problems and there are often people working on the same machine learning problem, you know, for many months or for years on end.

And when youíre doing that, I mean solving a really important problem using learning algorithms, one of the most valuable things is just your own personal intuitive understanding of the problem. Okay? And diagnostics, like the sort I talked about, will be one way for you to get a better and better understanding of these problems.

It turns out, by the way, there are some of Silicon Valley companies that outsource their machine learning. So thereís sometimes, you know, whatever. Theyíre a company in Silicon Valley and theyíll, you know, hire a firm in New York to run all their learning algorithms for them.

And Iím not a businessman, but I personally think thatís often a terrible idea because if your expertise, if your understanding of your data is given, you know, to an outsource agency, then if you donít maintain that expertise, if thereís a problem you really care about then itíll be your own, you know, understanding of the problem that you build up over months thatíll be really valuable. And if that knowledge is outsourced, you donít get to keep that knowledge yourself. I personally think thatís a terrible idea. But Iím not a businessman, but I just see people do that a lot, and just.

Letís see. Another reason for running diagnostics like these is actually in writing research papers, right? So diagnostics and error analyses, which Iíll talk about in a minute, often help to convey insight about the problem and help justify your research claims.

So for example, rather than writing a research paper, say, thatís says, you know, ďOh well hereís an algorithm that works. I built this helicopter and it flies,Ē or whatever, itís often much more interesting to say, ďHereís an algorithm that works, and it works because of a specific component X. And moreover, hereís the diagnostic that gives you justification that shows X was the thing that fixed this problem,Ē and thatís where you made it work. Okay?

So that leads me into a discussion on error analysis, which is often good machine learning practice, is a way for understanding what your sources of errors are. So what I call error analyses Ė and letís check questions about this. Yeah?

Student:What ended up being wrong with the helicopter?

Instructor (Andrew Ng):Oh, donít know. Letís see. Weíve flown so many times. The thing that is most difficult a helicopter is actually building a very Ė I don't know. It changes all the time. Quite often, itís actually the simulator. Building an accurate simulator of a helicopter is very hard. Yeah. Okay.

So for error analyses, this is a way for figuring out what is working in your algorithm and what isnít working. And weíre gonna talk about two specific examples. So there are many learning Ė there are many sort of IA systems, many machine learning systems, that combine many different components into a pipeline. So hereís sort of a contrived example for this, not dissimilar in many ways from the actual machine learning systems you see.

So letís say you want to recognize people from images. This is a picture of one of my friends. So you take this input in camera image, say, and you often run it through a long pipeline. So for example, the first thing you may do may be preprocess the image and remove the background, so you remove the background. And then you run a face detection algorithm, so a machine learning algorithm to detect peopleís faces. Right?

And then, you know, letís say you want to recognize the identity of the person, right, this is your application. You then segment of the eyes, segment of the nose, and have different learning algorithms to detect the mouth and so on. I know; she might not want to be friend after she sees this. And then having found all these features, based on, you know, what the nose looks like, what the eyes looks like, whatever, then you feed all the features into a logistic regression algorithm. And your logistic regression or soft match regression, or whatever, will tell you the identity of this person. Okay?

So this is what error analysis is. You have a long complicated pipeline combining many machine learning components. Many of these would be used in learning algorithms. And so, itís often very useful to figure out how much of your error can be attributed to each of these components. So what weíll do in a typical error analysis procedure is weíll repeatedly plug in the ground-truth for each component and see how the accuracy changes.

So what I mean by that is the figure on the bottom left Ė bottom right, letís say the overall accuracy of the system is 85 percent. Right? Then I want to know where my 15 percent of error comes from. And so what Iíll do is Iíll go to my test set and Iíll actually code it and Ė oh, instead of Ė actually implement my correct background removal. So actually, go in and give it, give my algorithm what is the correct background versus foreground.

And if I do that, letís color that blue to denote that Iím giving that ground-truth data in the test set, letís assume our accuracy increases to 85.1 percent. Okay? And now Iíll go in and, you know, give my algorithm the ground-truth, face detection output. So Iíll go in and actually on my test set Iíll just tell the algorithm where the face is. And if I do that, letís say my algorithmís accuracy increases to 91 percent, and so on.

And then Iíll go for each of these components and just give it the ground-truth label for each of the components, because say, like, the nose segmentation algorithmís trying to figure out where the nose is. I just in and tell it where the nose is so that it doesnít have to figure that out. And as I do this, one component through the other, you know, I end up giving it the correct output label and end up with 100 percent accuracy.

And now you can look at this table Ė Iím sorry this is cut off on the bottom, it says logistic regression 100 percent. Now you can look at this table and see, you know, how much giving the ground-truth labels for each of these components could help boost your final performance. In particular, if you look at this table, you notice that when I added the face detection ground-truth, my performance jumped from 85.1 percent accuracy to 91 percent accuracy. Right?

So this tells me that if only I can get better face detection, maybe I can boost my accuracy by 6 percent. Whereas in contrast, when I, you know, say plugged in better, I don't know, background removal, my accuracy improved from 85 to 85.1 percent. And so, this sort of diagnostic also tells you that if your goal is to improve the system, itís probably a waste of your time to try to improve your background subtraction. Because if even if you got the ground-truth, this is gives you, at most, 0.1 percent accuracy, whereas if you do better face detection, maybe thereís a much larger potential for gains there. Okay?

So this sort of diagnostic, again, is very useful because if your is to improve the system, there are so many different pieces you can easily choose to spend the next three months on. Right? And choosing the right piece is critical, and this sort of diagnostic tells you whatís the piece that may actually be worth your time to work on.

Thereís sort of another type of analyses thatís sort of the opposite of what I just talked about. The error analysis I just talked about tries to explain the difference between the current performance and perfect performance, whereas this sort of ablative analysis tries to explain the difference between some baselines, some really bad performance and your current performance.

So for this example, letís suppose youíve built a very good anti-spam classifier for adding lots of clever features to your logistic regression algorithm. Right? So you added features for spam correction, for, you know, sender host features, for email header features, email text parser features, JavaScript parser features, features for embedded images, and so on.

So now letís say you preview the system and you want to figure out, you know, how well did each of these Ė how much did each of these components actually contribute? Maybe you want to write a research paper and claim this was the piece that made the big difference. Can you actually document that claim and justify it?

So in ablative analysis, hereís what we do. So in this example, letís say that simple logistic regression without any of your clever improvements get 94 percent performance. And you want to figure out what accounts for your improvement from 94 to 99.9 percent performance. So in ablative analysis and so instead of adding components one at a day, weíll instead remove components one at a time to see how it rates.

So start with your overall system, which is 99 percent accuracy. And then we remove spelling correction and see how much performance drops. Then weíll remove the sender host features and see how much performance drops, and so on. All right? And so, in this contrived example, you see that, I guess, the biggest drop occurred when you remove the text parser features. And so you can then make a credible case that, you know, the text parser features where what really made the biggest difference here. Okay?

And you can also tell, for instance, that, I don't know, removing the sender host features on this line, right, performance dropped from 99.9 to 98.9. And so this also means that in case you want to get rid of the sender host features to speed up computational something that would be a good candidate for elimination. Okay?

Student:Are there any guarantees that if you shuffle around the order in which you drop those features that youíll get the same Ė

Instructor (Andrew Ng):Yeah. Letís address the question: What if you shuffle in which you remove things? The answer is no. Thereís no guarantee youíd get the similar result. So in practice, sometimes thereís a fairly natural of ordering for both types of analyses, the error analyses and ablative analysis, sometimes thereís a fairly natural ordering in which you add things or remove things, sometimes thereís isnít. And quite often, you either choose one ordering and just go for it or Ė

And donít think of these analyses as sort of formulas that are constants, though; I mean feel free to invent your own, as well. You know one of the things thatís done quite often is take the overall system and just remove one and then put it back, then remove a different one then put it back until all of these things are done. Okay.

So the very last thing I want to talk about is sort of this general advice for how to get started on a learning problem. So hereís a cartoon description on two broad to get started on learning problem. The first one is carefully design your system, so you spend a long time designing exactly the right features, collecting the right data set, and designing the right algorithmic structure, then you implement it and hope it works. All right?

The benefit of this sort of approach is you get maybe nicer, maybe more scalable algorithms, and maybe you come up with new elegant learning algorithms. And if your goal is to, you know, contribute to basic research in machine learning, if your goal is to invent new machine learning algorithms, this process of slowing down and thinking deeply about the problem, you know, that is sort of the right way to go about is think deeply about a problem and invent new solutions.

Second sort of approach is what I call build-and-fix, which is we input something quick and dirty and then you run error analyses and diagnostics to figure out whatís wrong and you fix those errors.

The benefit of this second type of approach is that itíll often get your application working much more quickly. And especially with those of you, if you end up working in a company, and sometimes Ė if you end up working in a company, you know, very often itís not the best product that wins; itís the first product to market that wins. And so thereís Ė especially in the industry. Thereís really something to be said for, you know, building a system quickly and getting it deployed quickly.

And the second approach of building a quick-and-dirty, Iím gonna say hack and then fixing the problems will actually get you to a system that works well much more quickly. And the reason is very often itís really not clear what parts of a system are easier to think of to build and therefore what you need to spends lot of time focusing on.

So thereís that example I talked about just now. Right? For identifying people, say. And with a big complicated learning system like this, a big complicated pipeline like this, itís really not obvious at the outset which of these components you should spend lots of time working on. Right? And if you didnít know that preprocessing wasnít the right component, you could easily have spent three months working on better background subtraction, not knowing that itís just not gonna ultimately matter.

And so the only way to find out what really works was inputting something quickly and you find out what parts Ė and find out what parts are really the hard parts to implement, or what parts are hard parts that could make a difference in performance. In fact, say that if your goal is to build a people recognition system, a system like this is actually far too complicated as your initial system. Maybe after youíre prototyped a few systems, and you converged a system like this. But if this is your first system youíre designing, this is much too complicated.

Also, this is a very concrete piece of advice, and this applies to your projects as well. If your goal is to build a working application, Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your data. And very often, and if you just take the data youíre trying to predict and just plot your data, plot X, plot Y, plot your data everywhere you can think of, you know, half the time you look at it and go, ďGee, how come all those numbers are negative? I thought they should be positive. Somethingís wrong with this dataset.Ē

And itís about half the time you find something obviously wrong with your data or something very surprising. And this is something you find out just by plotting your data, and that you wonít find out be implementing these big complicated learning algorithms on it. Plotting the data sounds so simple, it was one of the pieces of advice that lots of us give but hardly anyone follows, so you can take that for what itís worth.

Let me just reiterate, what I just said here may be bad advice if your goal is to come up with new machine learning algorithms. All right? So for me personally, the learning algorithm I use the most often is probably logistic regression because I have code lying around. So give me a learning problem, I probably wonít try anything more complicated than logistic regression on it first. And itís only after trying something really simple and figure our whatís easy, whatís hard, then you know where to focus your efforts.

But again, if your goal is to invent new machine learning algorithms, then you sort of donít want to hack up something and then add another hack to fix it, and hack it even more to fix it. Right? So if your goal is to do novel machine learning research, then it pays to think more deeply about the problem and not gonna follow this specifically.

Shoot, you know what? All right, sorry if Iím late but I just have two more slides so Iím gonna go through these quickly. And so, this is what I think of as premature statistical optimization, where quite often, just like premature optimization of code, quite often people will prematurely optimize one component of a big complicated machine learning system. Okay?

Just two more slides. This was Ė this is a sort of cartoon that highly influenced my own thinking. It was based on a paper written by Christos Papadimitriou. This is how progress Ė this is how developmental progress of research often happens. Right? Letís say you want to build a mail delivery robot, so Iíve drawn a circle there that says mail delivery robot. And it seems like a useful thing to have. Right? You know free up people, donít have to deliver mail.

So what Ė to deliver mail, obviously you need a robot to wander around indoor environments and you need a robot to manipulate objects and pickup envelopes. And so, you need to build those two components in order to get a mail delivery robot. And so Iíve drawing those two components and little arrows to denote that, you know, obstacle avoidance is needed or would help build your mail delivery robot.

Well for obstacle for avoidance, clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles. Now weíre gonna use computer vision to detect the objects. And so, we know that, you know, lighting sometimes changes, right, depending on whether itís the morning or noontime or evening. This is lighting causes the color of things to change, and so you need an object detection system thatís invariant to the specific colors of an object. Right? Because lighting changes, say.

Well color, or RGB values, is represented by three-dimensional vectors. And so you need to learn when two colors might be the same thing, when two, you know, visual appearance of two colors may be the same thing as just the lighting change or something. And to understand that properly, we can go out and study differential geometry of 3d manifolds because that helps us build a sound theory on which to develop our 3d similarity learning algorithms.

And to really understand the fundamental aspects of this problem, we have to study the complexity of non-Riemannian geometries. And on and on it goes until eventually youíre proving convergence bounds for sampled of non-monotonic logic. I donít even know what this is because I just made it up. Whereas in reality, you know, chances are that link isnít real. Color variance just barely helped object recognition maybe. Iím making this up. Maybe differential geometry was hardly gonna help 3d similarity learning and that linkís also gonna fail. Okay?

So, each of these circles can represent a person, or a research community, or a thought in your head. And thereís a very real chance that maybe there are all these papers written on differential geometry of 3d manifolds, and they are written because some guy once told someone else that itíll help 3d similarity learning.

And, you know, itís like ďA friend of mine told me that color invariance would help in object recognition, so Iím working on color invariance. And now Iím gonna tell a friend of mine that his thing will help my problem. And heíll tell a friend of his that his thing will help with his problem.Ē And pretty soon, youíre working on convergence bound for sampled non-monotonic logic, when in reality none of these will see the light of day of your mail delivery robot. Okay?

Iím not criticizing the role of theory. There are very powerful theories, like the theory of VC dimension, which is far, far, far to the right of this. So VC dimension is about as theoretical as it can get. And itís clearly had a huge impact on many applications. And thereís, you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know, is about theoretical as it can get. Itís like a huge application on all of computer science, the theory of NP-hardness.

But when you are off working on highly theoretical things, I guess, to me personally itís important to keep in mind are you working on something like VC dimension, which is high impact, or are you working on something like convergence bound for sampled non-monotonic logic, which youíre only hoping has some peripheral relevance to some application. Okay?

For me personally, I tend to work on an application only if I Ė excuse me. For me personally, and this is a personal choice, I tend to trust something only if I personally can see a link from the theory Iím working on all the way back to an application. And if I donít personally see a direct link from what Iím doing to an application then, you know, then thatís fine. Then I can choose to work on theory, but I wouldnít necessarily trust that what the theory Iím working on will relate to an application, if I donít personally see a link all the way back.

Just to summarize. One lesson to take away from today is I think time spent coming up with diagnostics for learning algorithms is often time well spent. Itís often up to your own ingenuity to come up with great diagnostics. And just when I personally, when I work on machine learning algorithm, itís not uncommon for me to be spending like between a third and often half of my time just writing diagnostics and trying to figure out whatís going right and whatís going on.

Sometimes itís tempting not to, right, because you want to be implementing learning algorithms and making progress. You donít want to be spending all this time, you know, implementing tests on your learning algorithms; it doesnít feel like when youíre doing anything. But when I implement learning algorithms, at least a third, and quite often half of my time, is actually spent implementing those tests and you can figure out what to work on. And I think itís actually one of the best uses of your time.

Talked about error analyses and ablative analyses, and lastly talked about, you know, different approaches and the risks of premature statistical optimization. Okay.

Sorry I ran you over. Iíll be here for a few more minutes for your questions. Thatís [inaudible] today.

[End of Audio]

Duration: 82 minutes