Instructor (Andrew Ng):Okay. Good morning. I just have one quick announcement of sorts. So many of you know that it was about two years ago that Stanford submitted an entry to the DARPA Grand Challenge which was the competition to build a car to drive itself across the desert. So some of you may know that this weekend will be the next DARPA Grand Challenge phase, and so Stanford Ė the team that Ė one of my colleagues Sebastian Thrun has a team down in OA now and so theyíll be racing another autonomous car.
So this is a car that incorporates many tools and AI machines and everything and so on, and itíll try to drive itself in midst of traffic and avoid other cars and carry out the sort of mission. So if youíre free this weekend Ė if youíre free on Saturday, watch TV or search online for Urban Challenge, which is the name of the competition. It should be a fun thing to watch, and itíll hopefully be a cool demo or instance of AI and machines in action.
Letís see. My laptop died a few seconds before class started so let me see if I can get that going again. If not, Iíll show you the things I have on the blackboard instead. Okay. So good morning and welcome back. What I want to do today is actually begin a new chapter in 229 in which Iím gonna start to talk about [inaudible]. So [inaudible] today is Iím gonna just very briefly talk about clusteringís, [inaudible] algorithm. [Inaudible] and a special case of the EM, Expectation Maximization, algorithm with a mixture of [inaudible] model to describe something called Jensen and Equality and then weíll use that derive a general form of something called the EM or the Expectation Maximization algorithm, which is a very useful algorithm. We sort of use it all over the place and different unsupervised machine or any application. So the cartoons that I used to draw for supervised learning was youíd be given the data set like this, right, and youíd use [inaudible] between the positive and negative crosses and weíd call it the supervised learning because youíre sort of told what the right cross label is for every training example, and that was the supervision. In unsupervised learning, weíll study a different problem. Youíre given a data set that maybe just comprises a set of points. Youíre just given a data set with no labels and no indication of what the ďright answersĒ or where the supervision is and itís the job of the algorithm to discover structure in the data.
So in this lecture and the next couple of weeks weíll talk about a variety of unsupervised learning algorithms that can look at data sets like these and discover thereís different types of structure in it. In this particular cartoon that Iíve drawn Ė one has the structure that you and I can probably see as is that this data lives in two different crosses and so the first unsupervised learning algorithm that Iím just gonna talk about will be a clustering algorithm. Itíll be an algorithm that looks for a data set like this and automatically breaks the data set into different smaller clusters. So letís see. When my laptop comes back up, Iíll show you an example. So clustering algorithms like these have a variety of applications. Just to rattle off a few of the better-known ones I guess in biology application you often cross the different things here. You have [inaudible] genes and they cluster the different genes together in order to examine them and understand the biological function better. Another common application of clustering is market research. So imagine you have a customer database of how your different customers behave. Itís a very common practice to apply clustering algorithms to break your database of customers into different market segments so that you can target your products towards different market segments and target your sales pitches specifically to different market segments.
Something weíll do later today Ė I donít want to do this now, but you actually go to a website, like, use.google.com and thatís an example of a website that uses a clustering algorithm to everyday group related news articles together to display to you so that you can see one of the thousand news articles today on whatever the top story of today is and all the 500 news articles on all the different websites on different story of the day. And a very solid [inaudible] actually talks about image segmentation, which the application of when you might take a picture and group together different subsets of the picture into coherent pieces of pixels to try to understand whatís contained in the picture. So thatís yet another application of clustering. The next idea is given a data set like this, given a set of points, can you automatically group the data sets into coherent clusters. Letís see. Iím still waiting for the laptop to come back so I can show you an example. You know what, why donít I just start to write out the specific clustering algorithm and then Iíll show you the animation later. So this is the called the k-means clustering algorithm for finding clusteringís near the inset. The input to the algorithm will be an unlabeled data set which I write as X1, X2, [inaudible] and because weíre now talking about unsupervised learning, you see a lot of this as [inaudible] with just the Xs and no cross labels Y. So what a k-means algorithm does is the following.
This will all make a bit more sense when I show you the animation on my laptop. To initialize a set of points, called the cluster centroids, [inaudible] randomly and so if youíre [inaudible] of training data are [inaudible] then your cluster centroids, these muse, will also be vectors and [inaudible] and then you repeat until convergence the following two steps. So the cluster centroids will be your guesses for where the centers of each of the clusters are and so in one of those steps you look at each point, XI and you look at which cluster centroid J is closest to it and then this step is called assigning your point XI to cluster J. So looking at each point and picking the cluster centroid thatís closest to it and the other step is you update the cluster centroids to be the median of all the points that have been assigned to it. Okay. Letís see. Could you please bring down the display for the laptop? Excuse me. Okay. Okay. There we go. Okay. So hereís an example of the k-means algorithm and hope the animation will make more sense. This is an inch chopped off. This is basically an example I got from Michael Jordan in Berkley. So these points in green are my data points and so Iím going to randomly initialize a pair of cluster centroids. So the [inaudible] blue crosses to note the positions of New1 and New2 say if Iím going to guess that thereís two clusters in this data. Sets of k-means algorithms as follow. Iím going to repeatedly go to all of the points in my data set and Iím going to associate each of the green dots with the closer of the two cluster centroids, so visually, Iím gonna denote that by painting each of the dots either blue or red depending on which is the closer cluster centroid. Okay.
So all the points closer to the blue cross points are painted blue and so on. The other step is updating the cluster centroids and so Iím going to repeatedly look at all the points that Iíve painted blue and compute the average of all of the blue dots, and Iíll look at all the red dots and compute the average of all the red dots and then I move the cluster centroids as follows to the average of the respective locations. So this is now [inaudible] of k-means on here, and now Iíll repeat the same process. I look at all the points and assign all the points closer to the blue cross to the color blue and similarly red. And so now I have that assignments of points to the cluster centroids and finally, Iíll again compute the average of all the blue points and compute the average of all the red points and update the cluster centroids again as follows and now k-means is actually [inaudible]. If you keep running these two sets of k-means over and over, the cluster centroids and the assignment of the points closest to the cluster centroids will actually remain the same. Yeah.
Instructor (Andrew Ng):Yeah, Iíll assign that in a second. Yeah. Okay. So [inaudible]. Take a second to look at this again and make sure you understand how the algorithm I wrote out maps onto the animation that we just saw. Do you have a question?
Instructor (Andrew Ng):I see. Okay. Let me answer on that in a second. Okay. So these are the two steps. This step 2.1 was assigning the points to the closest centroid and 2.2 was shifting the cluster centroid to be the mean of all the points assigned to that cluster centroid. Okay. Okay. [Inaudible] questions that we just had, one is, does the algorithm converge? The answer is yes, k-means is guaranteed to converge in a certain sense. In particular, if you define the distortion function to be J of C [inaudible] squared. You can define the distortion function to be a function of the cluster assignments, and the cluster centroids and [inaudible] square distances, which mean the points and the cluster centroids that theyíre assigned to, then you can show Ė I wonít really prove this here but you can show that k-means is called [inaudible] on the function J. In particular, who remembers, itís called in a sense as an authorization algorithm, I donít know, maybe about two weeks ago, so called in a sense is the algorithm that weíll repeatedly [inaudible] with respect to C. Okay. So thatís called [inaudible]. And so what you can prove is that k-means Ė the two steps of k-means, are exactly optimizing this function with respect to C and will respect a new alternately. And therefore, this function, J of C, new, must be decreasing monotonically on every other variation and so the sense in which k-means converges is that this function, J of C, new, can only go down and therefore, this function will actually eventually converge in the sense that it will stop going down.
Okay. Itís actually possible that there may be several clusteringís they give the same value of J of C, new and so k-means may actually switch back and forth between different clusteringís that they [inaudible] in the extremely unlikely case, if thereís multiple clusteringís, they give exactly the same value for this objective function. K-means may also be [inaudible] itíll just never happen. That even if that happens, this function J of C, new will converge. Another question was how do you choose the number of clusters? So it turns out that in the vast majority of time when people apply k-means, you still just randomly pick a number of clusters or you randomly try a few different numbers of clusters and pick the one that seems to work best. The number of clusters in this algorithm instead of just one parameters, so usually I think itís not very hard to choose automatically. There are some automatic ways of choosing the number of clusters, but Iím not gonna talk about them. When I do this, I usually just pick of the number of clusters randomly. And the reason is I think for many clustering problems the ďtrueĒ number of clusters is actually ambiguous so for example if you have a data set that looks like this, some of you may see four clusters, right, and some of you may see two clusters, and so the right answer for the actual number of clusters is sort of ambiguous. Yeah.
Student:[Inaudible]. [Inaudible] clusters [inaudible] far away from the data point [inaudible] points and the same cluster?
Instructor (Andrew Ng):I see. Right. So yes. K-means is susceptible to [inaudible] so this function, J of C, new is not a convex function and so k-means, sort of called in a sense on the non-convex function is not guaranteed to converge the [inaudible]. So k-means is susceptible to local optimal and [inaudible]. One thing you can do is try multiple random initializations and then run clustering a bunch of times and pick the solution that ended up with the lowest value for the distortion function. Yeah.
Instructor (Andrew Ng):Yeah, letís see. Right. So what if one cluster centroid has no points assigned to it, again, one thing you could do is just eliminate it exactly the same. Another thing you can is you can just reinitialize randomly if you really [inaudible]. More questions. Yeah.
Student:[Inaudible] as a norm or can you [inaudible] or infinity norm or Ė
Instructor (Andrew Ng):I see. Right. Is it usually two norms? Letís see. For the vast majority of applications Iíve seen for k-means, you do take two norms when you have data [inaudible]. Iím sure there are others who have taken infinity norm and one norm as well. I personally havenít seen that very often, but there are other variations on this algorithm that use different norms, but the one I described is probably the most commonly used there is. Okay. So that was k-means clustering. What I want to do next and this will take longer to describe is actually talk about a closely related problem. In particular, what I wanted to do was talk about density estimation. As another k-means example, this is a problem that I know some guys that worked on. Letís say you have aircraft engine building off an assembly. Letís say you work for an aircraft company, youíre building aircraft engines off the assembly line and as the aircraft engines roll off the assembly line, you test these aircraft engines and measure various different properties of it and to use [inaudible] example Iím gonna write these properties as heat and vibrations. Right.
In reality, youíd measure different vibrations, different frequencies and so on. Weíll just write the amount of heat produced and vibrations produced. Letís say that maybe it looks like that and what you would like to do is estimate the density of these [inaudible] of the joint distribution, the amount of heat produced and the amount of vibrations because you would like to detect [inaudible] so that as a new aircraft engine rolls off the assembly line, you can then measure the same heat and vibration properties. If you get a point there, you can then ask, ďHow likely is it that there was an undetected flaw in this aircraft engine that it needs to go undergo further inspections?Ē And so if we look at the typical distribution of features we get, and we build a model for P of X and then if P of X is very small for some new aircraft engine then that would raise a red flag and weíll say thereís an anomaly aircraft engine and we should subject it to further inspections before we let someone fly with the engine. So this problem I just described is an instance of what is called anomaly detection and so a common way of doing anomaly detection is to take your training set and from this data set, build a model, P of X of the density of the typical data youíre saying and if you ever then see an example with very low probability under P of X, then you may flag that as an anomaly example.
Okay? So anomaly detection is also used in security applications. If many, very unusual transactions to start to appear on my credit card, thatís a sign to me that someone has stolen my credit card. And what I want to do now is talk about specific algorithm for density estimation, and in particular, one that works with data sets like these, that, you know, this distribution like this doesnít really fall into any of the standard text book distributions. This is not really, like, a Gaussian or a [inaudible] explanation or anything. So can we come up with a model to estimate densities that may look like these somewhat unusual shapes? Okay. So to describe the algorithm a bit a more Iím also going to use a one dimensional example rather than a two D example, and in the example that Iím going to describe Iím going to say that letís imagine maybe a data set that looks like this where the horizontal access here is the X axis and these dots represent the positions of the data set that I have. Okay. So this data set looks like itís maybe coming from a density that looks like that as if this was the sum of two Gaussian distributions and so the specific model Iím gonna describe will be whatís called a mixture of Gaussianís model.
And just be clear that the picture I have is that when visioning that maybe there were two separate Gaussianís that generated this data set, and if only I knew what the two Gaussianís were, then I could put a Gaussian to my crosses, put a Gaussian to the Os and then sum those up to get the overall density for the two, but the problem is I donít actually have access to these labels. I donít actually know which of the two Gaussianís each of my data points came from and so what Iíd like to do is come up with an algorithm to fit this mixture of Gaussianís model even when I donít know which of the two Gaussianís each of my data points came from. Okay. So hereís the idea. In this model, Iím going to imagine thereís a latent random variable, latent is just synonymous with hidden or unobserved, okay. So weíre gonna imagine thereís a latent random variable Z and XI, ZI have a joint distribution that is given as follows. We have that P of X, ZI by the chamber of probability, this is always like that. This is always true. And moreover, our [inaudible] is given by the following ZI is distributed multinomial with parameters I. And in the special case where I have just to make sure that two Gaussianís and ZI will be [inaudible], and so these parameter [inaudible] are the parameters of a multinomial distribution.
And the distribution of XI conditioned on ZI being equal to J so itís P of XI given ZI is equal to J. Thatís going to be a Gaussian distribution with [inaudible] and covariant sigler. Okay. So this should actually look extremely familiar to you. What Iíve written down are pretty much the same equations that I wrote down for the Gaussian Discriminant Analysis algorithm that we saw way back, right, except that the differences Ė instead of, I guess supervised learning where we were given the cross labels Y, Iíve now replaced Y in Gaussian Discriminant Analysis with these latent random variables or these unobserved random variables Z, and we donít actually know what the values of Z are. Okay. So just to make the link to the Gaussian Discriminant Analysis even a little more explicit Ė if we knew what the Zs were, which was actually donít, but suppose for the sake of argument that we actually knew which of, say the two Gaussianís, each of our data points came from, then you can use [inaudible] estimation Ė you can write down the likelihood the parameters which would be that and you can then use [inaudible] estimation and you get exactly the same formula as in Gaussian Discriminant Analysis. Okay. So if you knew the value of the Z, you can write down the law of likelihood and do maximum likeliness this way, and you can then estimate all the parameters of your model. Does this make sense? Raise your hand if this makes sense. Cool. Some of you have questions? Some of you didnít raise your hands. Yeah.
Student:So this ZI is just a label, like, an X or an O?
Instructor (Andrew Ng):Yes. Basically. Any other questions? Okay. So if you knew the values of Z, the Z playing a similar role to the cross labels in Gaussianís Discriminant Analysis, then you could use maximum likeliness estimation parameters. But in reality, we donít actually know the values of the Zs. All weíre given is this unlabeled data set and so let me write down the specific bootstrap procedure in which the idea is that weíre going to use our model to try and guess what the values of Z is. We donít know our Z, but weíll just take a guess at the values of Z and weíll then use some of the values of Z that we guessed to fit the parameters of the rest of the model and then weíll actually iterate. And now that we have a better estimate for the parameters for the rest of the model, weíll then take another guess for what the values of Z are. And then weíll sort of use something like the maximum likeliness estimation to set even parameters of the model. So the algorithm IĎm gonna write down is called the EM Algorithm and it proceeds as follows. Repeat until convergence and the E set, weíre going to guess the values of the unknown ZIs and in particular, Iím going to set WIJ. Okay. So Iím going to compute the probability that ZI is equal to J. So Iím going to use the rest of the parameters in my model and then Iím gonna compute the probability that point XI came from Gaussian number J. And just to be sort of concrete about what I mean by this, this means that Iím going to compute P of XI.
This step is sort of [inaudible], I guess. And again, just to be completely concrete about what I mean about this, the [inaudible] rate of P of XI given ZI equals J, you know, well thatís the Gaussian density. Right? Thatís one over E to the Ė [inaudible] and then divided by sum from O equals 1 to K of [inaudible] of essentially the same thing, but with J replaced by L. Okay. [Inaudible] for the Gaussian and the numerator and the sum of the similar terms of the denominator. Excuse me. This is the sum from O equals 1 through K in the denominator. Okay. Letís see. The maximization step where you would then update your estimates of the parameters. So Iíll just lay down the formulas here. When you see these, you should compare them to the formulas we had for maximum likelihood estimation. And so these two formulas on top are very similar to what you saw for Gaussian Discriminant Analysis except that now, we have these [inaudible] so WIJ is Ė you remember was the probability that we computed that point I came from Gaussianís. I donít want to call it cluster J, but thatís what Ė point I came from Gaussian J, rather than an indicator for where the point I came from Gaussian J. Okay. And the one slight difference between this and the formulas who have a Gaussianís Discriminant Analysis is that in the mixture of Gaussianís, we more commonly use different covariant [inaudible] for the different Gaussianís.
So in Gaussianís Discriminant Analysis, sort of by convention, you usually model all of the crosses to the same covariant matrix sigma. I just wrote down a lot of equations. Why donít you just take a second to look at this and make sure it all makes sense? Do you have questions about this? Raise your hand if this makes sense to you? [Inaudible]. Okay. Only some of you. Letís see. So let me try to explain that a little bit more. Some of you recall that in Gaussianís Discriminant Analysis, right, if we knew the values for the ZIs so letís see. Suppose I was to give you labeled data sets, suppose I was to tell you the values of the ZIs for each example, then Iíd be giving you a data set that looks like this. Okay. So hereís my 1 D data set. Thatís sort of a typical 1 D Gaussianís Discriminant Analysis. So for Gaussianís Discriminant Analysis we figured out the maximum likeliness estimation and the maximum likeliness estimate for the parameters of GDA, and one of the estimates for the parameters for GDA was [inaudible] which is the probability that ZI equals J. You would estimate that as sum of I equals sum of I from 1 to M indicator ZI equals J and divide by N. Okay. When weíre deriving GDA, [inaudible]. If you knew the cross labels for every example you cross, then this was your maximum likeliness estimate for the chance that the labels came from the positive [inaudible] versus the negative [inaudible]. Itís just a fraction of examples.
Your maximum likeliness estimate for probability of getting examples from cross J is just the fraction of examples in your training set that actually came from cross J. So this is the maximum likeliness estimation for Gaussianís Discriminant Analysis. Now, in the mixture of Gaussianís model and the EM problem we donít actually have these cross labels, right, we just have an unlabeled data set like this. We just have a set of dots. Iím trying to draw the same data set that I had above, but just with the cross labels. So now, itís as if you only get to observe the XIs, but the ZIs are unknown. Okay. So the cross label is unknown. So in the EM algorithm weíre going to try to take a guess for the values of the ZIs, and specifically, in the E step we computed WIJ was our current best guess for the probability that ZI equals J given that data point. Okay. So this just means given my current hypothesis, the way the Gaussianís are, and given everything else, can I compute the [inaudible] probability Ė what was the [inaudible] probability that the point XI actually came from cross J? What is the probability that this point was a cross versus O? Whatís the probability that this point was [inaudible]? And now in the M step, my formula of estimating for the parameters [inaudible] will be given by 1 over M sum from I equals 1 through M, sum of WIJ. So WIJ is right. The probability is my best guess for the probability that point I belongs to Gaussian or belongs to cross J, and [inaudible] using this formula instead of this one. Okay.
And similarly, this is my formula for the estimate for new J and it replaces the WIJs with these new indicator functions, you get back to the formula that you had in Gaussianís Discriminant Analysis. Iím trying to convey an intuitive sense of why these algorithmís make sense. Can you raise your hand if this makes sense now? Cool. Okay. So what I want to do now is actually present a broader view of the EM algorithm. What you just saw was a special case of the EM algorithm for specially to make sure of Gaussianís model, and in the remaining half hour I have today Iím going to describe a general description of the EM algorithm and everything you just saw will be devised, sort of thereís a special case of this more general view that Iíll present now. And as a pre-cursor to actually deriving this more general view of the EM algorithm, Iím gonna have to describe something called Jensenís and Equality that we use in the derivation.
So hereís Jensenís and Equality. Just let F be a convex function. So a function is a convex of the second derivative, which Iíve written F prime prime to [inaudible]. The functions donít have to be differentiatable to be convex, but if it has a second derivative, then F prime prime should be creating a 0. And let X be a random variable then the F applied to the expectation of X is less than the equal of 2D expectation of F of F. Okay. And hopefully you remember I often drop the square back, so E of X is the [inaudible], Iíll often drop the square brackets.
So let me draw a picture that would explain this and I think Ė Many of my friends and I often donít remember is less than or great than or whatever, and the way that many of us remember the sign of that in equality is by drawing the following picture. For this example, letís say, X is equal to 1 with a probability of one-half and X is equal to 6 worth probability 1 whole. So Iíll illustrate this inequality with an example. So letís see. So X is 1 with probability one-half and X is 6 with probably with half and so the expected value of X is 3.5. It would be in the middle here. So thatís the expected value of X. The horizontal axis here is the X axis. And so F of the expected value of X, you can read of as this point here. So this is F of the expected value of X. Where as in contrast, letís see. If X is equal to 1 then hereís F of 1 and if X equaled a 6 then hereís F of 6 and the expected value of F of X, it turns out, is now averaging on the vertical axis. Weíre 50 percent chance you get F of 1 with 50 percent chance you get F of 6 and so these expected value of F of X is the average of F of 1 and F of 6, which is going to be the value in the middle here. And so in this example you see that the expected value of F of X is greater than or equal to F of the expected value of X. Okay.
And it turns out further that if F double prime of X makes [inaudible] than Z row, if this happens, we say F is strictly convex then the inequality holds an equality or in other words, E of F of X equals F of EX, if and only if, X is a constant with probability 1. Well, another way of writing this is X equals EX. Okay. So in other words, if F is a strictly convex function, then the only way for this inequality to hold its equality is if the random variable X always takes on the same value. Okay. Any questions about this? Yeah.
Instructor (Andrew Ng):Say that again?
Student:What is the strict [inaudible]?
Instructor (Andrew Ng):I still couldnít hear that. What is Ė
Student:What is the strictly convex [inaudible]?
Instructor (Andrew Ng):Oh, I see. If double prime of X is strictly greater than 0 thatís my definition for strictly convex. If the second derivative of X is strictly greater than 0 then thatís what it means for F to be strictly convex.
Instructor (Andrew Ng):I see. Sure. So for example, this is an example of a convex function thatís not strictly convexed because thereís part of this function is a straight line and so F double prime would be zero in this portion. Letís see. Yeah. Itís just a less formal way of saying strictly convexed just means that you canít have a convex function within a straight line portion and then [inaudible]. Speaking very informally, think of this as meaning that there arenít any straight line portions. Okay. So hereís the derivation for the general version of EM. The problem was face is as follows. We have some model for the joint distribution of X of Z, but we observe only X, and our goal is to maximize the law of likelihood of the parameters of model.
Right. So we have some models for the joint distribution for X and Z and our goal is to find the maximum likeliness estimate of the parameters data where the likelihood is defined as something equals 1 to M [inaudible] probably of our data as usual. And here X is parameterized by data is now given by a sum over all the values of ZI parameterized by data. Okay. So just by taking our model of the joint distribution of X and Z and marginalizing out ZI that we get P of XI parameterized by data. And so the EM algorithm will be a way of performing this maximum likeliness estimation problem, which is complicated by the fact that we have these ZIs in our model that are unobserved. Before I actually do the math, hereís a useful picture to keep in mind. So the horizontal axis in this cartoon is the [inaudible] axis and thereís some function, the law of likelihood of theta zero that weíre trying to maximize, and usually maximizing our [inaudible] derivatives instead of the zero that would be very hard to do. What the EM algorithm will do is the following. Letís say it initializes some value of theta zero, what the EM algorithm will end up doing is it will construct a lower bound for this law of likelihood function and this lower bound will be tight [inaudible] of equality after current guessing the parameters and they maximize this lower boundary with respect to theta so weíll end up with say that value. So that will be data 1. Okay.
And then EM algorithm look at theta 1 and theyíll construct a new lower bound of theta and then weíll maximize that. So you jump here. So thatís the next theta 2 and you do that again and you get the same 3, 4, and so on until you converge to local optimum on [inaudible] theta function. Okay. So this is a cartoon that displays what the EM algorithm will do. So letís actually make that formal now. So you want to maximize with respect to theta sum of [inaudible] Ė thereís my theta, so this is sum over 1 [inaudible] sum over all values of Z. Okay. So what Iím going to do is multiply and divide by the same thing and Iím gonna write this as Q Ė okay. So Iím going to construct the probability distribution QI, that would be over the latent random variables ZI and so these QI would get distribution so each of the QI would bring in a 0 and sum over all the values of ZI of QI would be 1, so these Qs will be a probability distribution that I get to construct. Okay. And then Iíll later go describe the specific choice of this distribution QI. So this QI is a probability distribution over the random variables of ZI so this is [inaudible]. Right. I see some frowns. Do you have questions about this? No. Okay.
So the log function looks like that and thereís a concave function so that tells us that the log of E of X is greater than and equal to the expected value of log X by the other concave function form of Jensenís and Equality. And so continuing from the previous expression, this is a summary of a log and an expectation, that must therefore be greater than or equal to the expected value of the log of that. Okay. Using Jensenís and Equality. And lastly just to expand out this formula again. This is equal to that. Okay. Yeah.
Instructor (Andrew Ng):
[Inaudible]. Yeah. Okay. So this has the [inaudible] so letís say
Random variable Z, right, and Z has some distribution. Letís denote it G. And letís say I have some function G of Z. Okay. Then by definition, the expected value of G of Z, by definition, thatís equal to sum over all the values of Z, the probability of that value of Z times Z of G. Right. Thatís sort of the definition of a random variable. And so the way I got from this step to this step is by using that. So in particular, now, Iíve been using distribution QI to denote the distribution of Z, so this is, like, sum over Z of P of Z times [inaudible]. And so this is just the expected value with respect to a random variable Z joined from the distribution Q of G of Z. Are there questions?
Student:So in general when youíre doing maximum likelihood estimations, the likelihood of the data, but in this case you only say probability of X because you only have observed X whereas previously we said probability of X given the labels?
Instructor (Andrew Ng):Yes. Exactly. Right. Right. [Inaudible] we want to choose the parameters that maximizes the probability of the data, and in this case, our data comprises only the Xs because we donít reserve the Zs, and therefore, the likelihood of parameters is given by the probability of the data, which is [inaudible]. So this is all weíve done, right, we wanted to maximize the law of likelihood of theta and what weíve done, through these manipulations, weíve know constructed a lower bound on the law of likelihood of data. Okay.
And in particular, this formula that we came up, we should think of this as a function of theta then, if you think about it, theta are the parameters of your model, right, if you think about this as a function of your parameters theta, what weíve just shown is that the law of likelihood of your parameters theta is lower bounded by this thing. Okay. Remember that cartoon of repeatedly constructing a lower bound and optimizing the lower bound. So what weíve just done is construct a lower bound for the law of likelihood for theta. Now, the last piece we want for this lower bound is actually we want this inequality to hold with equality for the current value for theta.
So just refrain back to the previous cartoon. If this was the law of likelihood for theta, weíd then construct some lower bound of it, some function of theta and if this is my current value for theta, then I want my lower bound to be tight. I want my lower bound to be equal to the law of likelihood of theta because thatís what I need to guarantee that when I optimize my lower bound, then Iíll actually do even better on the true objective function. Yeah.
How do [inaudible]
Instructor (Andrew Ng):Excuse me. Yeah. Great question. How do I know that function is concave? Yeah. I donít think Iíve shown it. It actually turns out to be true for all the models we work with. Do I know that the law of bound is a concave function of theta? I think youíre right. In general, this may not be a concave function of theta. For many of the models we work with, this will turn out to be a concave function, but thatís not always true. Okay. So let me go ahead and choose a value for Q. And Iíll refer back to Jensenís and Equality. We said that this inequality will become an equality if the random variable inside is a constant. Right. If youíre taking an expectation with respect to constant valued variables.
So the QI of ZIs must sum to 1 and so to compute it you should just take P of XI, ZI, parameterized by theta and just normalize the sum to one. There is a step that Iím skipping here to show that this is really the right thing to do. Hopefully, youíll just be convinced itís true. For the actual steps that I skipped, itís actually written out in the lecture notes. So you then have the denominator, by definition, is that and so by the definition of conditional probability QI of ZI is just equal to P of ZI given XI and parameterized by theta. Okay.
And so to summarize the algorithm, the EM algorithm has two steps. And the E step, we set, we choose the distributions QI, so QI of ZI will set to be equal to a P of ZI given [inaudible] by data. Thatís the formula we just worked out. And so by this step weíve now created a lower bound on the law of likelihood function that is now tight at a current value of theta. And in the M step, we then optimize that lower bound with respect to our parameters theta and specifically to the [inaudible] of theta. Okay. And so thatís the EM algorithm. I wonít have time to do it today, but Iíll probably show this in the next lecture, but the EM algorithmís that I wrote down for the mixtures of Gaussianís algorithm is actually a special case of this more general template where the E step and the M step responded. So pretty much exactly to this E step and this M step that I wrote down. The E step constructs this lower bound and makes sure that it is tight to the current value of theta. Thatís in my choice of Q, and then the M step optimizes the lower bound with respect to [inaudible] data. Okay. So lots more to say about this in the next lecture. Letís check if thereís any questions before we close. No. Okay. Cool. So letís wrap up for today and weíll continue talking about this in the next session.
[End of Audio]
Duration: 76 minutes