Instructor (Andrew Ng):Okay, good morning. Welcome back. Just one quick announcement for today, which is that this next discussion section as far as for the TAís will mostly be on, sort of, a tutorial on Matlab and Octaves. So I know many of you already have program Matlab or Octave before, but in case not, and you want to, sort of, see along the tutorial on how direct terms and Matlab, please come to this next discussion section.
What I want to do today is continue our discussion of NaÔve Bayes, which is the learning algorithm that I started to discuss in the previous lecture and talk about a couple of different event models in NaÔve Bayes, and then Iíll take a brief digression to talk about neural networks, which is something that I actually wonít spend a lot of time on, and then I want to start to talk about support vector machines, and support vector machines is the learning algorithms, the supervised learning algorithm that many people consider the most effective, off-the-shelf supervised learning algorithm. That point of view is debatable, but there are many people that hold that point of view, and weíll start discussing that today, and this will actually take us a few lectures to complete.
So letís talk about NaÔve Bayes. To recap from the previous lecture, I started off describing spam classification as the most [inaudible] example for NaÔve Bayes in which we would create feature vectors like these, right, that correspond to words in a dictionary. And so, you know, based on what words appear in a piece of email were represented as a feature vector with ones and zeros in the corresponding places, and NaÔve Bayes was a generative learning algorithm, and by that I mean itís an algorithm in which we model PFX given Y, and for NaÔve Bayes, specifically, we modeled it as product from I equals one to N, PFXI given Y, and also we model PFY, and then we use Bayes Rule, right, to combine these two together, and so our predictions, when you give it a new piece of email you want to tell if itís spam or not spam, you predict RFX over Y, PFY given X, which by Bayes Rule is RFX over Y, PFX given Y, times BY, okay?
So this is NaÔve Bayes, and just to draw attention to two things, one is that in this model, each of our features were zero, one, so indicating whether different words appear, and the length or the feature vector was, sort of, the length N of the feature vector was the number of words in the dictionary. So it might be on this version on the order of 50,000 words, say.
What I want to do now is describe two variations on this algorithm. The first one is the simpler one, which itís just a generalization to if XI takes on more values. So, you know, one thing thatís commonly done is to apply NaÔve Bayes to problems where some of these features, XI, takes on K values rather than just two values, and in that case, you actually build, sort of, a very similar model where PFX given Y is really the same thing, right, where now these are going to be multinomial probabilities rather than Bernoulliís because the XIís can, maybe, take on up to K values.
It turns out, the situation where Ė one situation where this arises very commonly is if you have a feature thatís actually continuous valued, and you choose to dispertise it, and you choose to take a continuous value feature and dispertise it into a finite set of K values, and so itís a perfect example if you remember our very first supervised learning problem of predicting the price of houses. If you have the classification problem on these houses, so based on features of a house, and you want to predict whether or not the house will be sold in the next six months, say.
Thatís a classification problem, and once you use NaÔve Bayes, then given a continuous value feature like the living area, you know, one pretty common thing to do would be take the continuous value living area and just dispertise it into a few Ė discreet buckets, and so depending on whether the living area of the house is less than 500 square feet or between 1,000 and 1500 square feet, and so on, or whether itís greater than 2,000 square feet, you choose the value of the corresponding feature, XI, to be one, two, three, or four, okay? So that was the first variation or generalization of NaÔve Bayes I wanted to talk about. I should just check; are there questions about this? Okay. Cool. And so it turns out that in practice, itís fairly common to use about ten buckets to dispertise a continuous value feature. I drew four here only to save on writing.
The second and, sort of, final variation that I want to talk about for NaÔve Bayes is a variation thatís specific to classifying text documents, or, more generally, for classifying sequences. So the text document, like a piece of email, you can think of as a sequence of words and you can apply this, sort of, model Iím about to describe to classifying other sequences as well, but let me just focus on text, and hereís the idea.
So the NaÔve Bayes algorithm as Iíve described it so far, right, given a piece of email, we were representing it using this binary vector value representation, and one of the things that this loses, for instance, is the number of times that different words appear, all right? So, for example, if some word appears a lot of times, and you see the word, you know, ďbuyĒ a lot of times. You see the word ďViagraĒ; it seems to be a common email example. You see the word Viagra a ton of times in the email, it is more likely to be spam than it appears, I guess, only once because even once, I guess, is enough.
So let me just try a different, whatís called an event model for NaÔve Bayes that will take into account the number of times a word appears in the email, and to give this previous model a name as well this particular model for text classification is called the Multivariate Bernoulli Event Model. Itís not a great name. Donít worry about what the name means. It refers to the fact that there are multiple Bernoulli random variables, but itís really Ė donít worry about what the name means.
In contrast, what I want to do now is describe a different representation for email in terms of the feature vector, and this is called the Multinomial Event Model, and, again, there is a rationale behind the name, but itís slightly cryptic, so donít worry about why itís called the Multinomial Event Model; itís just called that. And hereís what weíre gonna do, given a piece of email, Iím going to represent my email as a feature vector, and so my IF training example, XI will be a feature vector, XI sub group one, XI sub group two, XI subscript NI where NI is equal to the number of words in this email, right?
So if one of my training examples is an email with 300 words in it, then I represent this email via a feature vector with 300 elements, and each of these elements of the feature vector Ė lets see. Let me just write this as X subscript J. These will be an index into my dictionary, okay? And so if my dictionary has 50,000 words, then each position in my feature vector will be a variable that takes on one of 50,000 possible values corresponding to what word appeared in the J position of my email, okay?
So, in other words, Iím gonna take all the words in my email and you have a feature vector that just says which word in my dictionary was each word in the email, okay? So a different definition for NI now, NI now varies and is different for every training example, and this XJ is now indexed into the dictionary. You know, the components of the feature vector are no longer binary random variables; theyíre these indices in the dictionary that take on a much larger set of values.
And so our generative model for this will be that the joint distribution over X and Y will be that, where again N is now the length of the email, all right? So the way to think about this formula is you imagine that there was some probably distribution over emails. Thereís some random distribution that generates the emails, and that process proceeds as follows: First, Y is chosen, first the class label. Is someone gonna send you spam email or not spam emails is chosen for us.
So first Y, the random variable Y, the class label of spam or not spam is generated, and then having decided whether they sent you spam or not spam, someone iterates over all 300 positions of the email, or 300 words that are going to compose them as email, and would generate words from some distribution that depends on whether they chose to send you spam or not spam. So if they sent you spam, theyíll send you words Ė theyíll tend to generate words like, you know, buy, and Viagra, and whatever at discounts, sale, whatever. And if somebody chose to send you not spam, then theyíll send you, sort of, the more normal words you get in an email, okay?
So, sort of, just careful, right? XI here has a very different definition from the previous event model, and N has a very different definition from the previous event model. And so the parameters of this model are Ė letís see. Phi subscript K given Y equals one, which is the probability that, you know, conditioned on someone deciding to spend you spam, whatís the probability that the next word they choose to email you in the spam email is going to be word K, and similarly, you know, sort of, same thing Ė well, Iíll just write it out, I guess Ė and Phi Y and just same as before, okay?
So these are the parameters of the model, and given a training set, you can work out the maximum likelihood estimates of the parameters. So the maximum likelihood estimate of the parameters will be equal to Ė and now Iím gonna write one of these, you know, big indicator function things again. Itíll be a sum over your training sets indicator whether that was spam times the sum over all the words in that email where N subscript I is the number of words in email I in your training set, times indicator XIJ, SK times that I, okay?
So the numerator says sum over all your emails and take into account all the emails that had class label one, take into account only of the emails that were spam because if Y equals zero, then this is zero, and this would go away, and then times sum over all the words in your spam email, and it counts up the number of times you observed the word K in your spam emails. So, in other words, the numerator is look at all the spam emails in your training set and count up the total number of times the word K appeared in this email. The denominator then is sum over I into our training set of whenever one of your examples is spam, you know, sum up the length of that spam email, and so the denominator is the total length of all of the spam emails in your training set.
And so the ratio is just out of all your spam emails, what is the fraction of words in all your spam emails that were word K, and thatís your estimate for the probability of the next piece of spam mail generating the word K in any given position, okay? At the end of the previous lecture, I talked about LaPlace smoothing, and so when you do that as well, you add one to the numerator and K to the denominator, and this is the LaPlace smoothed estimate of this parameter, okay? And similarly, you do the same thing for Ė and you can work out the estimates for the other parameters yourself, okay? So itís very similar. Yeah?
Student:Iím sorry. On the right on the top, I was just wondering what the X of I is, and what the N of Ė
Instructor (Andrew Ng):Right. So in this second event model, the definition for XI and the definition for N are different, right? So here Ė well, this is for one example XY. So here, N is the number of words in a given email, right? And if itís the I email subscripting then this N subscript I, and so N will be different for different training examples, and here XI will be, you know, these values from 1 to 50,000, and XI is essentially the identity of the Ith word in a given piece of email, okay? So thatís why this is grouping, or this is a product over all the different words of your email of their probability the Ith word in your email, conditioned on Y. Yeah?
Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right, and I shouldnít have used K here. Let me use a different alphabet and see if that makes sense; does that make sense? Oh, you know what, Iím sorry. Youíre absolutely right. Thank you. All right. So in LaPlace smoothing, that shouldnít be K. This should be, you know, 50,000, if you have 50,000 words in your dictionary. Yeah, thanks. Great. I stole notation from the previous lecture and didnít translate it properly. So LaPlace smoothing, right, this is the number of possible values that the random variable XI can take on. Cool. Raise your hand if this makes sense? Okay. Some of you, are there more questions about this? Yeah.
Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take?
Instructor (Andrew Ng):Yeah, letís see. So LaPlace smoothing is a method to give you, sort of, hopefully, better estimates of their probability distribution over a multinomial, and so was I using X to Y in the previous lecture? So in trying to estimate the probability over a multinomial Ė I think X and Y are different. I think Ė was it X or Y? I think it was X, actually. Well Ė oh, I see, right, right. I think I was using a different definition for the random variable Y because suppose you have a multinomial random variable, X which takes on Ė letís use a different alphabet. Suppose you have a multinomial random variable X which takes on L different values, then the maximum likelihood estimate for the probability of X, PFX equals K, will be equal to, right, the number of observations. The maximum likelihood estimate for the probability of X being equal to K will be the number of observations of X equals K divided by the total number of observations of X, okay? So thatís the maximum likelihood estimate. And to add LaPlace smoothing to this, you, sort of, add one to the numerator, and you add L to the denominator where L was the number of possible values that X can take on. So, in this case, this is a probability that X equals K, and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be something else, but thatís why I add 50,000 to the denominator. Are there other questions? Yeah.
Student:Is there a specific definition for a maximum likelihood estimation of a parameter? Weíve talked about it a couple times, and all the examples make sense, but I donít know what the, like, general formula for it is.
Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood Ė so the question is whatís the definition for maximum likelihood estimate? So actually in todayís lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I was, sort of, throwing out the maximum likelihood estimates on the board without proving them. The way to actually work this out is to actually write down the likelihood.
So the way to figure out all of these maximum likelihood estimates is to write down the likelihood of the parameters, phi K given Y being zero, phi Y, right? And so given a training set, the likelihood, I guess, I should be writing log likelihood will be the log of the product of I equals one to N, PFXI, YI, you know, parameterized by these things, okay? Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are parameterized by Ė well, Iíll just drop the parameters to write this more simply Ė oh, I just put it in Ė times PFYI, okay?
So this is my log likelihood, and so the way you get the maximum likelihood estimate of the parameters is you Ė so if given a fixed training set, given a set of fixed IYIís, you maximize this in terms of these parameters, and then you get the maximum likelihood estimates that Iíve been writing out. So in a previous section of todayís lecture I wrote out some maximum likelihood estimates for the Gaussian Discriminant Analysis model, and for NaÔve Bayes, and then this Ė I didnít prove them, but you get to, sort of, play with that yourself in the homework problem as well and for one of these models, and youíll be able to verify that when you maximize the likelihood and maximize the log likelihood that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to find the way these are derived is by maximizing this, okay? Cool.
All right. So that wraps up what I wanted to say about Ė oh, so that, more or less, wraps up what I wanted to say about NaÔve Bayes, and it turns out that for text classification, the NaÔve Bayes algorithm with this second event model, the last NaÔve Bayes model I presented with the multinomial event model, it turns out that almost always does better than the first NaÔve Bayes model I talked about when youíre applying it to the specific case Ė to the specific of text classification, and one of the reasons is hypothesized for this is that this second model, the multinomial event model, takes into account the number of times a word appears in a document, whereas the former model doesnít.
I should say that in truth that actually turns out not to be completely understood why the latter model does better than the former one for text classification, and, sort of, researchers are still debating about it a little bit, but if you ever have a text classification problem, you know, NaÔve Bayes Classify is probably not, by far, the best learning algorithm out there, but it is relatively straightforward to implement, and itís a very good algorithm to try if you have a text classification problem, okay? Still a question? Yeah.
Student:So the second model is still positioning a variant, right? It doesnít actually care where the words are.
Instructor (Andrew Ng):Yes, all right.
Student:And, I mean, X variable, if my model like you had exclamation in, does that usually do better if you have enough data?
Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesnít care about the ordering of the words. You can shuffle all the words in the email, and it does exactly the same thing. So in natural language processing, thereís actually another name; itís called a Unique Grand Model in natural language processing, and thereís some other models like, sort of, say, higher order markup models that take into account some of the ordering of the words. It turns out that for text classification, the models like the bigram models or trigram models, I believe they do only very slightly better, if at all, but thatís when youíre applying them to text classification, okay?
All right. So the next thing I want to talk about is to start again to discussion of non-linear classifiers. So it turns out Ė well, and so the very first classification algorithm we talked about was logistic regression, which had the forming form for hypothesis, and you can think of this as predicting one when this estimator probability is greater or equal to 0.5 and predicting zero, right, when this is less than 0.5, and given a training set, right? Logistic regression will maybe do grade and descends or something or use Newtonís method to find a straight line that reasonably separates the positive and negative classes.
But sometimes a data set just canít be separated by a straight line, so is there an algorithm that will let you start to learn these sorts of non-linear division boundaries? And so how do you go about getting a non-linear classifier? And, by the way, one cool result is that remember when I said Ė when we talked about generative learning algorithms, I said that if you assume Y given X is exponential family, right, with parameter A, and if you build a generative learning algorithm using this, right, plus one, if this is A to one. This is exponential family with natural parameter A to zero, right.
I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true, then you end up with a logistic posterior. It actually turns out that a NaÔve Bayes model actually falls into this as well. So the NaÔve Bayes model actually falls into this exponential family as well, and, therefore, under the NaÔve Bayes model, youíre actually using this other linear classifier as well, okay?
So the question is how can you start to get non-linear classifiers? And Iím going to talk about one method today which is Ė and we started to talk about it very briefly which is taking a simpler algorithm like logistic regression and using it to build up to more complex non-linear classifiers, okay? So to motivate this discussion, Iím going to use the little picture Ė letís see. So suppose you have features X1, X2, and X3, and so by convention, Iím gonna follow our earlier convention that X0 is set to one, and so Iím gonna use a little diagram like this to denote our logistic regression unit, okay?
So think of a little picture like that, you know, this little circle as denoting a computation note that takes this input, you know, several features and then it outputs another number thatís X subscript theta of X, given by a sigmoid function, and so this little computational unit Ė well, will have parameters theta.
Now, in order to get non-linear division boundaries, all we need to do Ė well, at least one thing to do is just come up with a way to represent hypotheses that can output non-linear division boundaries, right, and so this is Ė when you put a bunch of those little pictures that I drew on the previous board, you can then get whatís called a Neural Network in which you think of having my features here and then I would feed them to say a few of these little sigmoidal units, and these together will feed into yet another sigmoidal unit, say, which will output my final output H subscript theta of X, okay? And just to give these things names, let me call the values output by these three intermediate sigmoidal units; let me call them A1, A2, A3.
And let me just be completely concrete about what this formula represents, right? So each of these units in the middle will have their own associated set of parameters, and so the value A1 will be computed as G of X transpose, and then some set of parameters, which Iíll write as theta one, and similarly, A2 will be computed as G of X transpose theta two, and A3 will be G of X transpose, theta three, where G is the sigmoid function, all right? So G of Z, and then, finally, our hypothesis will output G of A transpose theta four, right? Where, you know, this A vector is a vector of A1, A2, A3. We can append another one to it at first if you want, okay?
Let me just draw up here this Ė Iím sorry about the cluttered board. And so H subscript theta of X, this is a function of all the parameters theta one through theta four, and so one way to learn parameters for this model is to write down the cost function, say, J of theta equals one-half sum from Y equals one to M, YI minus H subscript theta of XI squared, say. Okay, so thatís our familiar quadratic cost function, and so one way to learn the parameters of an algorithm like this is to just use gradient interscent to minimize J of theta as a function of theta, okay? See, in the phi gradient descent to minimize this square area, which stated differently means you use gradient descent to make the predictions of your neural network as close as possible to what you observed as the labels in your training set, okay?
So it turns out green descent on this neural network is a specific name, the algorithm that implements grand descent is called back propagation, and so if you ever hear that all that means is Ė it just means gradient interscent on a cost function like this or a variation of this on the neural network that looks like that, and Ė well, this algorithm actually has some advantages and disadvantages, but let me actually show you. So, letís see.
One of the interesting things about the neural network is that you can look at what these intermediate notes are computing, right? So this neural network has whatís called a hidden layer before you then have the output layer, and, more generally, you can actually have inputs feed into these computation units, feed into more layers of computation units, to even more layers, to more layers, and then finally you have an output layer at the end
And one cool thing you can do is look at all of these intermediate units, look at these units and whatís called a hidden layer of the neural network. Donít worry about why itís called that. Look at computations of the hidden unit and ask what is the hidden unit computing the neural network? So to, maybe, get a better sense of neural networks might be doing, let me show you a video Ė Iím gonna switch to the laptop Ė this is made by a friend, Yann LeCun whoís currently a professor at New York University. Can I show a video on the laptop?
So let me show you a video from Yann LeCun on a neural network that he developed for Hammerton Digit Recognition. There was one other thing he did in this neural network that Iím not gonna talk about called a Convolutional Neural Network that Ė well, his system is called LeNet, and letís see. Would you put on the laptop display? Hum, actually maybe if Ė or you can just put on the screen on the side; that would work too if the big screen isnít working. Letís see. Iím just trying to think, okay, how do I keep you guys entertained while weíre waiting for the video to come on?
Well, let me say a few more things about neural network. So it turns out that when you write a quadratic cost function like I wrote down on the chalkboard just now, it turns out that unlike logistic regression, that will almost always respond to non-convex optimization problem, and so whereas for logistic regression if you run gradient descent or Newtonís method or whatever, you converse the global optimer. This is not true for neural networks. In general, there are lots of local optimer and, sort of, much harder optimization problem.
So neural networks, if youíre, sort of, familiar with them, and youíre good at making design choices like what learning rate to use, and how many hidden units to use, and so on, you can, sort of, get them to be fairly effective, and thereís, sort of, often ongoing debates about, you know, is this learning algorithm better, or is that learning algorithm better? The vast majority of machine learning researchers today seem to perceive support vector machines, which is what Iíll talk about later, to be a much more effective off-the-shelf learning algorithm than neural networks. This point of view is contested a bit, but so neural networks are not something that I personally use a lot right now because thereís a hard optimization problem and you should do so often verge, and it actually, sort of works. It, sort of, works reasonably well. Itís just because this is fairly complicated, thereís not an algorithm that I use commonly or that my friends use all time. Oh, cool.
So but let me just go and show you an example of neural network, which was for many years, you know, the most effective learning algorithm before support vector machines were invented. So hereís Yann LeCunís video, and Ė well, thereís actually audio on this too, the soundboard. So Iíll just tell you whatís happening. What youíre seeing is a trained neural network, and this display where my mouse pointer is pointing, right, this big three there is the input to the neural network.
So youíre showing the neural network this image, and itís trying to recognize what is this. The final answer output by the neural network is this number up here, right below where it says LeNet-5, and the neural network correctly recognizes this image as a three, and if you look to the left of this image, whatís interesting about this is the display on the left portion of this is actually showing the intermediate computations of the neural network. In other words, itís showing you what are the hidden layers of the neural network computing.
And so, for example, if you look at this one, the third image down from the top, this seems to be computing, you know, certain edges into digits, right? Weíre just computing digits on the right-hand side of the bottom or something of the input display of the input image, okay? So let me just play this video, and you can see some of the inputs and outputs of the neural network, and those are very different fonts. Thereís this robustness to noise. All right. Multiple digits, thatís, kind of, cool. All right.
So, just for fun, let me show you one more video, which was Ė letís see. This is another video from the various CVís, the machine that changed the world, which was produced by WGBH Television in corporation with British Foreclass Incorporation, and it was aired on PBS a few years ago, I think. I want to show you a video describing the NETtalk Neural Network, which was developed by Terry Sejnowski; heís a researcher. And so NETtalk was actually one of the major milestones in the history of neural network, and this specific application is getting the neural network to read text.
So, in other words, can you show a piece of English to a computer and have the computer read, sort of, verbally produce sounds that could respond to the reading of the text. And it turns out that in the history of AI and the history of machine learning, this video created a lot of excitement about neural networks and about machine learning. Part of the reason was that Terry Sejnowski had the foresight to choose to use, in his video, a child-like voice talking about visiting your grandmotherís house and so on.
Youíll see it in a second, and so this really created the perception of Ė created the impression of the neural network being like a young child learning how to speak, and talking about going to your grandmothers, and so on. So this actually helped generate a lot of excitement within academia and outside academia on neural networks, sort of, early in the history of neural networks. Iím just gonna show you the video.
[Begin Video] Youíre going to hear first what the network sounds like at the very beginning of the training, and it wonít sound like words, but itíll sound like attempts that will get better and better with time. [Computerís voice] The network takes the letters, say the phrase, ďgrandmotherís house,Ē and makes a random attempt at pronouncing it. [Computerís voice] Grandmotherís house. The phonetic difference between the guess and the right pronunciation is sent back through the network. [Computerís voice] Grandmotherís house. By adjusting the connection strengths after each attempt, the net slowly improves.
And, finally, after letting it train overnight, the next morning it sounds like this: Grandmotherís house, Iíd like to go to my grandmotherís house. Well, because she gives us candy. Well, and we Ė NETtalk understands nothing about the language. It is simply associating letters with sounds. [End Video]
All right. So at the time this was done, I mean, this is an amazing piece of work. I should say today there are other text to speech systems that work better than what you just saw, and youíll also appreciate getting candy from your grandmotherís house is a little bit less impressive than talking about the Dow Jones falling 15 points, and profit taking, whatever. So but I wanted to show that just because that was another cool, major landmark in the history of neural networks. Okay. So letís switch back to the chalkboard, and what I want to do next is tell you about Support Vector Machines, okay?
That, sort of, wraps up our discussion on neural networks. So I started off talking about neural networks by motivating it as a way to get us to output non-linear classifiers, right? I donít really approve of it. It turns out that youíd be able to come up with non-linear division boundaries using a neural network like what I drew on the chalkboard earlier.
Support Vector Machines will be another learning algorithm that will give us a way to come up with non-linear classifiers. Thereís a very effective, off-the-shelf learning algorithm, but it turns out that in the discussion Iím gonna Ė in the progression and development Iím gonna pursue, Iím actually going to start off by describing yet another class of linear classifiers with linear division boundaries, and only be later, sort of, in probably the next lecture or the one after that, that weíll then take the support vector machine idea and, sort of, do some clever things to it to make it work very well to generate non-linear division boundaries as well, okay? But weíll actually start by talking about linear classifiers a little bit more.
And to do that, I want to convey two intuitions about classification. One is you think about logistic regression; we have this logistic function that was outputting the probability that Y equals one, and it crosses this line at zero. So when you run logistic regression, I want you to think of it as an algorithm that computes theta transpose X, and then it predicts one, right, if and only if, theta transpose X is greater than zero, right? IFF stands for if and only if. It means the same thing as a double implication, and it predicts zero, if and only if, theta transpose X is less than zero, okay?
So if itís the case that theta transpose X is much greater than zero, the double greater than sign means these are much greater than, all right. So if theta transpose X is much greater than zero, then, you know, think of that as a very confident prediction that Y is equal to one, right? If theta transpose X is much greater than zero, then weíre gonna predict one then moreover weíre very confident itís one, and the picture for that is if theta transpose X is way out here, then weíre estimating that the probability of Y being equal to one on the sigmoid function, it will be very close to one. And, in the same way, if theta transpose X is much less than zero, then weíre very confident that Y is equal to zero.
So wouldnít it be nice Ė so when we fit logistic regression of some of the classifiers is your training set, then so wouldnít it be nice if, right, for all I such that Y is equal to one. We have theta transpose XI is much greater than zero, and for all I such that Y is equal to zero, we have theta transpose XI is much less than zero, okay? So wouldnít it be nice if this is true? That, essentially, if our training set, we can find parameters theta so that our learning algorithm not only makes correct classifications on all the examples in a training set, but further itís, sort of, is very confident about all of those correct classifications. This is the first intuition that I want you to have, and weíll come back to this first intuition in a second when we talk about functional margins, okay? Weíll define this later.
The second intuition that I want to convey, and it turns out for the rest of todayís lecture Iím going to assume that a training set is linearly separable, okay? So by that I mean for the rest of todayís lecture, Iím going to assume that there is indeed a straight line that can separate your training set, and weíll remove this assumption later, but just to develop the algorithm, letís take away the linearly separable training set. And so thereís a sense that out of all the straight lines that separate the training set, you know, maybe that straight line isnít such a good one, and that one actually isnít such a great one either, but maybe that line in the middle is a much better linear separator than the others, right?
And one reason that when you and I look at it this one seems best is because this line is just further from the data, all right? That is separates the data with a greater distance between your positive and your negative examples and division boundary, okay? And this second intuition, weíll come back to this shortly, about this final line that I drew being, maybe, the best line this notion of distance from the training examples. This is the second intuition I want to convey, and weíll formalize it later when we talk about geometric margins of our classifiers, okay?
So in order to describe support vector machine, unfortunately, Iím gonna have to pull a notation change, and, sort of, unfortunately, it, sort of, was impossible to do logistic regression, and support vector machines, and all the other algorithms using one completely consistent notation, and so Iím actually gonna change notations slightly for linear classifiers, and that will actually make it much easier for us Ė thatíll make it much easier later today and in next weekís lectures to actually talk about support vector machine.
But the notation that Iím gonna use for the rest of today and for most of next week will be that my B equals Y, and instead of be zero, one, theyíll be minus one and plus one, and a development of a support vector machine we will have H, have a hypothesis output values to the either plus one or minus one, and so weíll let G of Z be equal to one if Z is greater or equal to zero, and minus one otherwise, right? So just rather than zero and one, we change everything to plus one and minus one.
And, finally, whereas previously I wrote G subscript theta of X equals G of theta transpose X and we had the convention that X zero is equal to one, right? And so X is an RN plus one. Iím gonna drop this convention of letting X zero equals a one, and letting X be an RN plus one, and instead Iím going to parameterize my linear classifier as H subscript W, B of X equals G of W transpose X plus B, okay? And so B just now plays the role of theta zero, and W now plays the role of the rest of the parameters, theta one through theta N, okay? So just by separating out the interceptor B rather than lumping it together, itíll make it easier for us to develop support vector machines. So Ė yes.
Instructor (Andrew Ng):Oh, yes. Right, yes. So W is Ė right. So W is a vector in RN, and X is now a vector in RN rather than N plus one, and a lowercase b is a real number. Okay.
Now, letís formalize the notion of functional margin and germesh margin. Let me make a definition. Iím going to say that the functional margin of the hyper plane WB with respect to a specific training example, XIYI is Ė WRT stands for with respect to Ė the function margin of a hyper plane WB with respect to a certain training example, XIYI has been defined as Gamma Hat I equals YI times W transpose XI plus B, okay?
And so a set of parameters, W, B defines a classifier Ė it, sort of, defines a linear separating boundary, and so when I say hyper plane, I just mean the decision boundary thatís defined by the parameters W, B. You know what, if youíre confused by the hyper plane term, just ignore it. The hyper plane of a classifier with parameters W, B with respect to a training example is given by this formula, okay? And interpretation of this is that if YI is equal to one, then for each to have a large functional margin, you want W transpose XI plus B to be large, right? And if YI is equal minus one, then in order for the functional margin to be large Ė we, sort of, want the functional margins to large, but in order for the function margins to be large, if YI is equal to minus one, then the only way for this to be big is if W transpose XI plus B is much less than zero, okay?
So this captures the intuition that we had earlier about functional margins Ė the intuition we had earlier that if YI is equal to one, we want this to be big, and if YI is equal to minus one, we want this to be small, and this, sort of, practice of two cases into one statement that weíd like the functional margin to be large. And notice this is also that so long as YI times W transpose XY plus B, so long as this is greater than zero, that means we classified it correctly, okay?
And one more definition, Iím going to say that the functional margin of a hyper plane with respect to an entire training set is going to define gamma hat to be equal to min over all your training examples of gamma hat, I, right? So if you have a training set, if you have just more than one training example, Iím going to define the functional margin with respect to the entire training set as the worst case of all of your functional margins of the entire training set. And so for now we should think of the first function like an intuition of saying that we would like the function margin to be large, and for our purposes, for now, letís just say we would like the worst-case functional margin to be large, okay? And weíll change this a little bit later as well.
Now, it turns out that thereís one little problem with this intuition that will, sort of, edge us later, which it actually turns out to be very easy to make the functional margin large, all right? So, for example, so as I have a classifiable parameters W and B. If I take W and multiply it by two and take B and multiply it by two, then if you refer to the definition of the functional margin, I guess that was what? Gamma I, gamma hat I equals YI times W times transpose B. If I double W and B, then I can easily double my functional margin.
So this goal of making the functional margin large, in and of itself, isnít so useful because itís easy to make the functional margin arbitrarily large just by scaling other parameters. And so maybe one thing we need to do later is add a normalization condition. For example, maybe we want to add a normalization condition that de-norm, the alter-norm of the parameter W is equal to one, and weíll come back to this in a second. All right. And then so Ė
Okay. Now, letís talk about Ė see how much time we have, 15 minutes. Well, see, Iím trying to decide how much to try to do in the last 15 minutes. Okay. So letís talk about the geometric margin, and so the geometric margin of a training example Ė [inaudible], right? So the division boundary of my classifier is going to be given by the plane W transpose X plus B is equal to zero, okay? Right, and these are the X1, X2 axis, say, and weíre going to draw relatively few training examples here. Letís say Iím drawing deliberately few training examples so that I can add things to this, okay?
And so assuming we classified an example correctly, Iím going to define the geometric margin as just a geometric distance between a point between the training example Ė yeah, between the training example XI, YI and the distance given by this separating line, given by this separating hyper plane, okay? Thatís what Iím going to define the geometric margin to be.
And so Iím gonna do some algebra fairly quickly. In case it doesnít make sense, and read through the lecture notes more carefully for details. Sort of, by standard geometry, the normal, or in other words, the vector thatís 90 degrees to the separating hyper plane is going to be given by W divided by the norm of W; thatís just how planes and high dimensions work. If this stuff Ė some of this you have to use, take a look t the lecture notes on the website.
And so letís say this distance is gamma I, okay? And so Iím going to use the convention that Iíll put a hat on top where Iím referring to functional margins, and no hat on top for geometric margins. So letís say geometric margin, as this example, is gamma I. That means that this point here, right, is going to be XI minus gamma I times W over normal W, okay? Because W over normal W is the unit vector, is the length one vector that is normal to the separating hyper plane, and so when we subtract gamma I times the unit vector from this point, XI, or at this point here is XI. So XI minus, you know, this little vector here is going to be this point that Iíve drawn as a heavy circle, okay? So this heavy point here is XI minus this vector, and this vector is gamma I time W over norm of W, okay?
And so because this heavy point is on the separating hyper plane, right, this point must satisfy W transpose times that point equals zero, right? Because all points X on the separating hyper plane satisfy the equation W transpose X plus B equals zero, and so this point is on the separating hyper plane, therefore, it must satisfy W transpose this point Ė oh, excuse me. Plus B is equal to zero, okay? Raise your hand if this makes sense so far? Oh, okay. Cool, most of you, but, again, Iím, sort of, being slightly fast in this geometry. So if youíre not quite sure why this is a normal vector, or how I subtracted this, or whatever, take a look at the details in the lecture notes.
And so what Iím going to do is Iíll just take this equation, and Iíll solve for gamma, right? So this equation I just wrote down, solve this equation for gamma or gamma I, and you find that Ė you saw that previous equation from gamma I Ė well, why donít I just do it? You have W transpose XI plus B equals gamma I times W transpose W over norm of W; thatís just equal to gamma times the norm of W because W transpose W is the norm of W squared, and, therefore, gamma is just Ė well, transpose X equals, okay? And, in other words, this little calculation just showed us that if you have a training example XI, then the distance between XI and the separating hyper plane defined by the parameters W and B can be computed by this formula, okay?
So the last thing I want to do is actually take into account the sign of the Ė the correct classification of the training example. So Iíve been assuming that weíve been classifying an example correctly. So, more generally, to find the geometric margin of a training example to be gamma I equals YI times that thing on top, okay? And so this is very similar to the functional margin, except for the normalization by the norm of W, and so as before, you know, this says that so long as Ė we would like the geometric margin to be large, and all that means is that so long as weíre classifying the example correctly, we would ideally hope of the example to be as far as possible from the separating hyper plane, so long as itís on the right side of the separating hyper plane, and thatís what YI multiplied into this does.
And so a couple of easy facts, one is if the norm of W is equal to one, then the functional margin is equal to the geometric margin, and you see that quite easily, and, more generally, the geometric margin is just equal to the functional margin divided by the norm of W, okay? Letís see, okay. And so one final definition is so far Iíve defined the geometric margin with respect to a single training example, and so as before, Iíll define the geometric margin with respect to an entire training set as gamma equals min over I of gamma I, all right?
And so the maximum margin classifier, which is a precursor to the support vector machine, is the learning algorithm that chooses the parameters W and B so as to maximize the geometric margin, and so I just write that down. The maximum margin classified poses the following optimization problem. It says choose gamma, W, and B so as to maximize the geometric margin, subject to that YI times Ė well, this is just one way to write it, subject to Ė actually, do I write it like that? Yeah, fine. There are several ways to write this, and one of the things weíll do next time is actually Ė Iím trying to figure out if I can do this in five minutes. Iím guessing this could be difficult.
Well, so this maximizing your classifier is the maximization problem over parameter gamma W and B, and for now, it turns out that the geometric margin doesnít change depending on the norm of W, right? Because in the definition of the geometric margin, notice that weíre dividing by the norm of W anyway. So you can actually set the norm of W to be anything you want, and you can multiply W and B by any constant; it doesnít change the geometric margin. This will actually be important, and weíll come back to this later. Notice that you can take the parameters WB, and you can impose any normalization constant to it, or you can change W and B by any scaling factor and replace them by ten W and ten B whatever, and it does not change the geometric margin, okay?
And so in this first formulation, Iím just gonna impose a constraint and say that the norm of W was one, and so the function of the geometric margins will be the same, and then weíll say maximize the geometric margins subject to Ė you maximize gamma subject to that every training example must have geometric margin at least gamma, and this is a geometric margin because when the norm of W is equal to one, then the functional of the geometric margin are identical, okay?
So this is the maximum margin classifier, and it turns out that if you do this, itíll run, you know, maybe about as well as a Ė maybe slight Ė maybe comparable to logistic regression, but it turns out that as we develop this algorithm further, there will be a clever way to allow us to change this algorithm to let it work in infinite dimensional feature spaces and come up with very efficient non-linear classifiers. So thereís a ways to go before we turn this into a support vector machine, but this is the first step. So are there questions about this? Yeah.
Instructor (Andrew Ng):For now, letís just say youíre given a fixed training set, and you canít Ė yeah, for now, letís just say youíre given a fixed training set, and the scaling of the training set is not something you get to play with, right? So everything Iíve said is for a fixed training set, so that you canít change the Xís, and you canít change the Yís. Are there other questions?
Okay. So all right. Next week we will take this, and weíll talk about authorization algorithms, and work our way towards turning this into one of the most effective off-the-shelf learning algorithms, and just a final reminder again, this next discussion session will be on Matlab and Octaves. So show up for that if you want to see a tutorial. Okay. See you guys in the next class.
[End of Audio]
Duration: 74 minutes