Instructor (Christopher Manning):Okay. Hi, everyone. And back to CS224n. So this week is gonna be the remainder of the content on machine translation. And today Iím gonna talk more about the kind of alignment models that people have used to learn about how words translate from one language to another language. And then Iím going to imbed inside of that a discussion of the EM algorithm that we use for a lot of NLP tasks, including this alignment algorithm.

So the stuff today is really the core stuff thatís the second programming assignment. And then on Wednesday, I do more of a survey of more modern and recent work in machine translation.

So youíll hopefully remember from last time that the general picture is that weíve made this sentence-aligned date, whereby in large, the sentences are aligned so that their translations of each other. And we weíre essentially then wanting to run this algorithm, which is looking at words that could translate each other, and settle down and learn possible translations for a word, as in this baby corpus.

And so, formally, starting off for what we did for IBM Model 1, that the overall picture we have is we have a lot of French and English parallel sentences. So we have a complete conditional probability of our corpus, that the probability of all of our French stuff times the probability of all of our English stuff. And I tried to add this into this slide that wasnít there last time. And I made a boo boo in my copy and paste already since this should be a product not a sum. So we take that probability, and itís just simply the product of the probability of each French sentence given each English sentence. (Product, right there on the top line.)

So then for an individual French and English sentence, whatís its probability? Well, thereís, first of all, a probability of how long the translation is going to be. So the length of the French sentence J, given the length of the English sentence I. But then the core of it is that what weíre doing is considering over here how different French words align to English words.

And we do that in a slightly funny way. Once weíve decided the length of the French sentence, we can just take these positions and say, okay, that the probability distribution over this position being aligned to any English word, including the possibility of it being aligned to no English word, the NULL, and so we have a probability of aj = i. And then given that we make a certain alignment, we then have the probability of having generated [inaudible] given that the English is being. So the probability is fj given ei.

And in particular, in Model 1, the alignment probabilities are just taken to be uniform. So thereís kind of no real content in them at all, and so what we really have is these probabilities of fj given ei.

So what we want to do to learn these alignment models is run the EM algorithm to maximize the likelihood of big F given big E, thatís ultimately what we want to do. But given that some of the assumptions this model makes about chances of different alignments being uniform, etc., if you actually boil it down, what you end up having to do is whatís in this slide. And, as I said last time, essentially, this slide can just be a recipe for what you have to implement for the first half of the second assignment for re-implementing Model 1.

So what you do is that you just start off with fj given ei just being uniform. You just give every English word a uniform chance of being translated by the different French words that you find in your training corpus. And that includes, in particular, that you can have the special English word null generate various French words. (The French words that appear in French that arenít really translations of any English various kinds of function words.)

So what you then do is you iterate through your data and you consider each position in turn in the French sentence, and you calculate a posterior over alignments. So you want to work out whatís the probability that this French position is aligned with a particular English word. And the way that you do that is that you use your current estimates of whatís the probability of fj given ei. Because, remember, thereís no independent information about which alignments are likely. Thatís assumed to just be uniform probability distribution that weíre not changing. And then you just normalize it based on all the different possibilities for English words that you could have.

So this gives you a probability of a particular alignment for position j. So you work these out for all positions j. And then based on doing that, you then increment these counts. And these counts are kind of fractional counts, of according to your current model, how much of an evidence do you have for there being alignments between particular French words and particular English words.

So if you work out here that thereís a .3 chance that j=5 aligns to i=2, you then look at what the French and English words there are, and you increment the count for that pair based on that probability that you add .3 to it. So youíre adding these fractional counts of different alignments.

So you do this over the entire corpus. And so this is the E-Step of the EM algorithm. And then at the end, you have all of these fractional counts from going through the corpus. And what you do is you just renormalize them to turn these into probabilities. So you have fractional counts for every fj given ei. You total up how many of those there are for a particular ei, and then you divide through by that total again. And so these become the probability of fj given ei. And so then you have new estimates for this quantity. And you go back up to the top, and you go through and do it again. And you keep repeating this for a while and lo and behold you get a translation model.

So the first crucial question for understanding this is why does this work at all?

Hereís the argument why it shouldnít work at all.

Weíve got this simple model. And Iíve told you that thereís no information about the length of the French sentence that translates the English sentence; we just assume a uniform distribution up to some convenient large upper bound like 10,000 words. Thereís no information, when we went back a slide, thereís no information about the probability of aj = i; we just assume that thatís uniform and we never change those estimates. And when we start off, we start off by assuming that the probability that fj = ei is uniform. And so it sort of looks like every one of these is the same.

And when youíre considering different ei prime down here, this quantity is the same for every possible ei. So it sort of seems like youíve got the same thing dividing over some of the same things. And it kind of looks like itíll just come out the same for everything. So youíll do this for your normalization, and all your estimates will still be the same and it wonít go anywhere. Thatís not actually true. So let me pause for a moment. I mean why does this algorithm learn anything?


Instructor (Christopher Manning):Yeah, we are gonna get something out of this. I guess I want something a little bit more precise. Why does that manage to give us something out of it?


Instructor (Christopher Manning):Yeah. So in each iteration Ė like, yeah, I guess I kind of just say it here, I donít actually have an equation. At each iteration, once Iíve made these counts, count of fj given ei, I then renormalize them. So I sum these counts for all j to get the total of the count mass of things involved ei, and I divide through it so Iíve got a new estimate for next time around for fj given ei. So yes, Iím changing this at each iteration through.

But the question is, you know, given that it starts off uniform, and Iím having uniform divided by some of uniform, why do the estimates change? Why do I actually learn something? Any good ideas for that?


Instructor (Christopher Manning):Well, so what is the evidence that Iím using?


Instructor (Christopher Manning):Whatís that?


Instructor (Christopher Manning):I guess the way of focusing this question is, I mean, what is the crucial source of constraint that this algorithm is exploiting that means it actually learned something?



Instructor (Christopher Manning):I think thatís kind of close to it, yeah. Iíll call that near enough. And Iíll say the answer. So thatís kind of it, yeah.

So if here you were considering all French words and all English words in your vocabulary, and you were doing this recalculation, I mean, you would just learn nothing because it would be uniform over some of uniform, which would be the same every time, and youíd get the same thing out every time, and youíd learn absolutely nothing.

The reason that this algorithm learned something is exploits one crucial source of information, which is that youíve started off with these aligned sentences. So you have some information if for all the sentences that have the word bongo in it that 80 percent of them have some particular word in their translation then you learn something.

So the crucial reason this algorithm learned something is kind of hidden effectively over what weíre summing over here, where weíre not summing over every possible word in the vocabulary, weíre summing over the candidate words in the English sentence. And so itís precisely because weíve actually got real sentences, and weíre considering the alternatives in a real sentence, and only those words, that that gives us a source of constraint.

In particular, if we run it on a one-word sentence, where thereís a French word and an English word, what weíre going to learn from this (if we ignore nulls for a minute there are also the chance of nulls, but if we ignore nulls, and null wasnít a possibility), if we had a one-word sentence, we would learn that the probability of this is whatever it is a. Thereís only one thing positioned to sum over, so it would be the same quantity a, so the probability of the alignment would come out one. So, therefore, weíd get one count mass, which would be given for saying that that English word is translated by that French word. So we are kind of learning from what we saw in the data.

And you can extend up from there, and say, well, what about if I get a two-word sentence? Well, the first time around, again, if you ignore the possibility of null generating words, well, this is just whatever it is. And then there are two possibilities down here. And, again, both of them have the same estimate, because we started off uniform, and so for each French word youíre gonna get half of a chance that it aligns with each English words, and so these counts here are gonna be a half.

And so, effectively, the first iteration through you run this, all youíre doing is counting sentence co-occurrence counts scaled by the number of words in a sentence (i.e., if itís a 12-word sentence, and the words occurs once each in a sentence, youíre counting a twelfth). So youíre just counting fractional counts like that for the number of times that things co-occur.

And that already gives you a lot of information. Because, you know, thatís like back to the example of the [inaudible]. And if you just see things co-occurring a lot in sentences that translate each other, thatís pretty good information that they have something to do with each other.

So Iíll go on then later to some to some of the later models of the IBM models. But let me just sort of get around this point. Iíll step back and say a few remarks and then go off and do the EM part.

So in terms of typologies of learning, how many people know about supervised versus unsupervised learning?

Almost everyone. What is this? Unsupervised.

Now, in general, in machine learning does unsupervised learning work well?

I mean, people use it sometimes. But I mean I think itís true to say that throughout most of machine learning unsupervised learning doesnít work that well. I mean, thatís being a bit harsh. I mean, so we have the classic contrasts between clustering algorithms, which are unsupervised. You throw in a bit of data and you cluster it somehow and see what you get out versus classification algorithms, where what you do is you hand label data. So for something like email spam, you say, this oneís spam, that oneís not spam, this oneís spam, that oneís not spam, and you learn a spam classify using kind of standard classification algorithms.

And so I think it is fair to say, that in most of machine learning that although doing unsupervised learning is kind of fun and sexy and interesting, that for a very large space of problems, you kind of sort of get a clustering that looks kind of tantalizing, like itís capturing some structure. But the results just arenít that good and you donít know what to do with it. And so, by and large, most of modern machine learning has been driven by supervised training, supervised classification methods, where precisely somebody sits around and hand labels pieces of data and you go and learn a classifier off of it.


Instructor (Christopher Manning):Well, Iíll say that in just a moment.

So itís kind of interesting that what youíre doing here for learning these alignment models is unsupervised learning. And this is actually one of the best cases of large scaled unsupervised learning that actually works really well. So all the state of the art big machine translation systems, that are statistical machines translation systems, thereís a lot of other stuff going on. But apart from the alignment algorithms, everyone uses supped up versions of this, where youíre doing unsupervised alignment algorithms. And they just work extremely well once you have tons of data.

But there has also been some work doing supervised alignment algorithms with small hand-done alignments of small numbers of sentences. But, effectively, you can get the unsupervised data at such scale, and the unsupervised algorithms work quite well enough that itís what everyone uses for big MT systems.

And then the question was, well, wait a minute, this isnít quite unsupervised because, well, you started off with translations of sentences, and thatís kind of like supervised data, because someone provided those translations and that gave you information. And, I mean, the short answer is, yeah, thatís totally right.

I guess I should have narrowed my claim that the part thatís purely unsupervised is learning this alignment model. So that learning of the alignment model is completely unsupervised, that, yeah, thereís sort of supervision over the overall problem of how to translate a particular source. And that may be a moment when I can say just a general comment that I think turns up a lot in natural language processing. That in machine learning, thereís this classic distinction between unsupervised algorithms versus supervised algorithms. And itís a sensible thing to think of.

But a lot of the time, in natural language processing, that the binarity of that distinction isnít very useful. Because a lot of the times in NLP the kinds of things that weíd like to do is work out how we can get a lot of value from a small amount of supervision.

And so a lot of things people do in all sorts of domains, theyíre not purely unsupervised algorithms, but on the other hand, to do all of the learning supervised would require an enormous investment of resources that may not be likely or practical. And so a lot of the time people would like to do things where you have a little bit of supervision.

And so the idea there is, well, you know, suppose I could easily get my hands on a dictionary, and there are lots of dictionaries around, suppose I let myself use a dictionary but I had no other information about how words that translate in context, and Iíll learn all the rest based on the context. That seems a useful thing to try and do because we know we can exploit dictionaries that are pre-existing. So a lot of the time people are wanting to exploit pre-existing resources and easily obtainable stuff, and then build the rest of it out unsupervised.

So at this point Iíll deviate off and say a bit about the EM algorithm. But before I do, just sort of to mention a couple of the announcements. A couple of the SCPD students said, gee, it would be much nicer if the quizzes were due on Sunday rather than Friday. And so weíre going to have the quizzes due on Sunday. Donít forget that Assignment Number 1 is due on Wednesday. And, now, thereís always one or two people that blow all their late days on Assignment 1, but youíre really much better off saving them until later. So do get working on Assignment 1, and then weíll be handing out Assignment 2 just to keep everyone busy.

And the final thing I thought Iíd just mention is for contacting me and the TAís for questions. We have both the mailing list and weíve got a newsgroup. If youíve got variety questions which arenít something about your personal problems but are just not understanding how the assignment works, we really encourage you to use the newsgroup if you can because that kind of makes it easy for us to give answers where other people can go looking for answers in case they have similar problems. Of if you get really lucky, maybe one of your fellow students will tell you the answer to the problem before we even get back to it. Well, thatís lucky for you and for us, actually. So do think about using the newsgroup.

Student:[Inaudible.] Itís almost like a [inaudible].

Instructor (Christopher Manning):I could do midnight if you want, sure. Sure, we can have them due at midnight if you want.

Okay. The EM algorithm. All right. So this is this expectation maximization algorithm, which is often used for unsupervised learning.

So I know itís always the case in CS224n that thereís kind of a spectrum between people whoíve never seen the EM algorithm before and people who have already seen it in three different classes. I mean, certainly, I know for myself, the EM algorithm confused me about the first five times I saw it. So, therefore, I hope that nobody has seen it enough times that they canít learn something from my presentation here.

But, I mean, in particular I hope to kind of work through a little practical example. Because, I mean, I think a lot of the time, certainly for the kind of discreet data that we work with mainly in NLP, that, you know, the EM algorithm theory is kind of confusing. But in practice, when youíre actually sort of working through a concrete pace of what you have to do, itís actually kind of straightforward what you work out and how you work with it. Letís hope so.

So what Iím gonna do here is just for discreet data, and for an easy case in these slides of estimating the parameters of an N-gram mixture model. So the general idea is EMís a method of doing maximum likelihood estimation.

So what we want to do is we have some data, and we have some parameters, which are meant to be a model of that data. And somehow we want to fiddle around with those parameters, just like usual, to make the observed data as likely as possible. And, you know, in theory, anything youíve ever learned about in optimization you could attempt to use for that problem. But in practice, the EM algorithm is used a lot as an iterative algorithm in places where you canít find the kind of gradients that you need to use other optimization methods.

So in particular, the classic kind of place where you see the EM algorithm is when youíre doing things like mixture modeling. So here we have some data, the xi data, which is mixture of some probability distributions. So this is exactly what we saw in the language modeling part of the class, where we said, well, we can do linear interpolation of several models. We could estimate things with a trigram model or a bigram model or a unigram model, and weíre going to combine those estimates together with some waste data.

So in particular, for the problem Iím looking at here, weíre now saying the trigram, bigram and unigram model are completely fixed. Their parameters arenít being changed. All weíre wanting to do is assign these theta weights that combine them together. So we really only have two parameters in this little baby example, for a trigram, as to how to weight these models.

So for when we have the entire data set, weíve got the entire data set x, and we want to say, okay, whatís the likelihood of the data, given our parameters? Well, what itís going to be is the product over predicting each word, which is then going to be worked out as a sum over our different component models weighted by the likelihood thatís given to each component model by our theta parameters.

So you have this picture where what you want to calculate is a product over a sum. And itís when you see that configuration, thatís the classic place where you always end up using the EM algorithm. So the EM algorithm is used when you have incomplete data.

And so when the EM algorithm was originally developed, the idea of incomplete data was you were somehow missing a couple of data points. So that the idea was you were doing some experiment in the lab, and some klutz knocked over a couple of test tubes, and youíre missing those two test tubes, and that youíd like to do analysis of your experiment anyway. And so what you wanted to do was reconstruct what would have been in the two test tubes that you didnít actually get to measure the content of. And so thatís the case of really incomplete data. Thatís never the case that weíre dealing with, and the kind of models that weíre making.

What weíre assuming is that thereís certain data that we can observe, which is here, the xís, but weíre assuming because we have some kind of generative story that underlying that data is other data that we canít observe, which is here being called the y. So the y is referred to as the completions of the data. And so we are pretending our data is artificially incomplete, that thereís some underlying extra stuff going on and that you canít actually see but that weíre going to be trying to reason with.

So what weíre doing here is that our complete data is saying, well, then for the N-gram model there are the actual words that we can observe. But, secondly, weíre assuming that there are these y choices. And the y choices, according to our generative story was, was this data point generated by the trigram model, the bigram model or the unigram model so that our generative model is corresponding to our mixture model.

So when you want to generate the next word, you first of all roll your dice to see whether to generate from the trigram, bigram or unigram component models. And then having chosen one, you roll the dice again, and then you choose a particular word to have generated. And so itís the choice of which model to generate from is our y variables. And so the y variables effectively have three values, 1, 2, 3, as to whether youíre generating from the unigram, bigram or trigram model.

Student:Does the y value [inaudible]? Like [inaudible] is the y gonna be the actual testing [inaudible], like the ideal actual claim, or is it what our modeling will be filling in the blank with?

Instructor (Christopher Manning):I think itís kind of what our model will be filling in the blank with. I mean, in particular, when we actually do this, I mean for the standard case, what weíre going to actually be interested in with the y, is the probability distribution over its value. So weíre using the model to say, well, the thing that we couldnít see, itís 80 percent likely it had this value, 15 percent likely it had this value, and five percent likely it had the remaining value.

Yeah, so our yís here are just one, two, three for unigram, bigram or trigram, so that he likelihood of the completed data, where we assume that the yís were observed, is just that weíre generating the probability of the word according to a different component model. So if we imagine completed data, that we knew where the yís were, the sum of the components is gone since a yi variable tells you which component itís coming from.

So an important thing to get straight, in general, is that there end up being two data likelihoods around here. Thereís the actual observed data likelihood, the probability of x given theta. And then if we assume completions, i.e., we give values to the y, we then have a complete data likelihood, which is the probability of x,y given theta. And the idea of the EM algorithm is that we want to maximize the observed theta likelihood, but weíre going to use completions to make that easier to do.

So why is that? Thatís because if we have products of sums thatís hard to work out the observed data likelihood, on the other hand, itís easy to work out the complete data likelihood if we know the yís. So the general idea of the EM algorithm is you alternate these two steps. That you assume some completions field data, and then work out what good values for the theta parameters would be. And then you use those theta parameters to try and work out what the completions are most likely to be, and you repeat over and again those two alternating phases. Which is what is expressed essentially in that slide. But Iíll go on fairly quickly to the more concrete part of it.

So the E-Step is we work out expectations which tell us how likely different completions are. And then the M-Step we maximize the likelihood of the complete data, which isnít the same as maximizing the logged likelihood of the observed data thatís the least close to it. And that we hope by doing that in an iterative algorithm weíll get better results.

An important thing to notice here is when youíre running the EM algorithm, the completed data isnít a constant, itís something that varies with each iteration as you come up with better estimates.

So hereís my final trying to do it a bit more formally. So for a certain set of parameters we have the likelihood of the data, which is the probability of all of the data given the parameters, which will then be, since we havenít observed the yís, the sum of the y of the probability of x,y. And in particular, since we assume that each word is generated independently, we have a product over the xís of the sum over of the yís of a probability of x,y given theta.

And so thatís precisely why you have this product of sums, which is hard to optimize. And the reason itís hard to optimize, is in general, products of sums give you a non-convex surface, so you get all of these local maxima. And so even running the EM algorithm you have problems with these local maxima.

And so what weíre going to want to do is sort of the answer to the question. Since we donít know what the yís are, weíre not actually going to want to have the sign the y values, what weíre actually gonna want to have is the distribution over how likely the different values for y are. And weíre going to effectively treat them like little mini partial data items, where say we have .7 of a data item where y is the trigram model; .15 of a data item where y is the bigram model; and .15 of the data item where y is the trigram model.

Now, around this point is when conventional presentations of the EM algorithm people drag out Jensenís inequality. People heard of Jensenís inequality? Some people? Well, anyway, I was gonna try and do it without doing Jensenís inequality in an easier way. And so this is an easier way.

This way of doing it uses a very simple fact. What is the arithmetic meaning of the numbers one and nine? Five. Okay. What is the geometric mean of the numbers one and nine? Three. And so just like in that example, the geometric mean is always less than the arithmetic mean of numbers. And so if you stay out of log space, you can represent what you can do with Jensenís inequality just with arithmetic means and geometric means.

So if you just remember that the square root of 1 times 9, 3, is less than 1 plus 9 divided by 2, 5. And thatís the arithmetic mean, geometric mean inequality, which is the same idea that you use for doing EM algorithm. And so here it is in its general form. So you can put arbitrary weights on the different components. And where these weights add up to one, and that your geometric mean is less than your arithmetic mean.

And weíre gonna use this to do the EM algorithm. Because what we have is a product of sums. And if we can turn that inside sum into a product, we then have something smaller, a lower bound, which would be a product of a product, and so that would lower bound what weíre interested in and would give us a basis. Because if we can kind of push up that lower bound, then weíd have made progress and weíll be able to have an attempt at optimizing things.

And so where we have this probability of a completion, the probability of x,y given theta, so that can be z,i here. But somehow we want to get in a notion of what can that w,i be to run this algorithm. And so how can we do that? The way we do that is by introducing an iterative algorithm, where we go through a succession of updates of updating our weights.

So that means we start off with some guess for our parameters. It doesnít have to be a very good guess, it has to be some guess for our parameters, which is called theta prime. And so according to our original guess for the parameters that gave the data some likelihood value. It doesnít matter what it is, it gave the data some likelihood value.

Now what weíre gonna want to do is come up with an improved set of estimates theta that is better than our old theta prime. Well, our old estimate was just whatever it was, you know, like it was some estimate of the parameters that gave some likelihood to the data. So this previous likelihood we can regard as a constant because it was just whatever it was. And what we want to do is come up with a new theta that makes the observed theta more likely.

And so what we can do is instead of directly trying to optimize theta, we can instead try and make this relative change in likelihood as big as possible. So the likelihood, according to the old thetas, is some constant. And so if we can make the likelihood given out new thetas, such that that ratio is big, weíve improved our estimates of theta.


Instructor (Christopher Manning):Yes. Well, youíre training data Ė strictly speaking for the case of doing the mixture model for a language model, your training data for learning the mixing weights is the validation data. So the validation data serves as the training data for the purpose of learning the mixing weights.

So itís sufficient if we can make this likelihood ratio large. Well, how can we do that? Well, hereís a little derivation thatís so slowly spelled out that even I can understand it. So we just write that down.

So hereís the likelihood of the data given our new model theta. And thereís the likelihood of the data given our old parameters theta prime. We can pull the product out the front. We can pull the sum out the front, where the sum is just concerned with the numerator not the denominator. In this step here, we multiply by one. So the terms on the right-hand side is just this number over number, so thatís multiplying by one. And then in the final step, we take this term over to the left, and we multiply those two denominator terms to get the product of x,y given theta prime. Thatís not too hard math. That hopefully makes sense.

But if we look at this, what weíve now got on the inside is that weíve got a sum of relative likelihoods that are weighted by the probability of y given x, theta prime. So this is something in the shape of the arithmetic mean, geometric mean, inequality, because here we have a sum over some quantities with some weights. So we can apply the arithmetic mean, geometric mean, inequality right there. And so thatís what we do on the next slide.

So this is just we had at the end of the last slide. And if you quickly look. Remember back to the arithmetic mean, geometric mean, inequality, we take the weights by which weíre combining things and turn them into exponents and the we have a product. And so we can say that this quantity is greater than this quantity. So now we have a lower bound, which is expressed just in terms of products. So thatís kind of pretty.

But in terms of maximizing this lower bound, we kind of donít need the denominator here, because that denominator is just a constant. So for the purpose of maximizing the lower bound, we can throw it away and instead just maximize something thatís proportional to lower bound. And so that gives this q equation, which gets called the auxiliary equation in the EM algorithm. Where weíve got the likelihood according to the new theta estimates raised to the power of the likelihood given the old theta estimates.


Instructor (Christopher Manning):In our mixture model, what weíre guessing generated the individual data points. So the values of y are just 1, 2, or 3. As to did our unigram model generate this data point; did our bigram model generate this data point; did our trigram model generate this data point. I bet it will make more sense when I get to the concrete example in just a minute. This is to just confuse you before I do the concrete example to show you all how easy it is.

So this is sort of a summary of that. And the crucial thing is this here is something we can easily maximize. Because products of products, theyíre nice, easy things to maximize. We can use our calculus on that, and we can work out how to maximize them. In particular, as Iíve already pointed out earlier, for these discreet probability distributions, actually the way you can maximize them is just using relative frequency. You donít even have to remember calculus. But if you remember some calculus, you can prove that the way to maximize them is just using relative frequencies.

So we run this algorithm. In each step we guess the completions. We then change the parameters to make the guess completions as likely as possible. We then re-guess the completions, change the parameters again, and do that. And so provably from doing that, and Iíve sort of almost proved, is that each iteration your theta guesses must make the observed data more likely.

What Iíve presented here isnít then a complete proof that the EM algorithm works. Because the other half of it is actually showing that my lower bound gets tight as I approach a solution. Which I havenít actually tried to do here. But Iíll leave that part out. And let me move instead for a bit to my [inaudible].

So I think all of this makes a ton more sense, and is really quite easy if you see it being done in a spreadsheet. So what Iím wanting to estimate a probability distribution over words that come after comes across. And so in my original training corpus, I saw comes across ten times. And eight times after comes across was the word as (comes across as a common idiom). Once there was comes across the, and once there was comes across three. And that was what I saw in my training corpus. I saw ten instances of comes across.

So based on my training corpus, I trained, by maximum likelihood estimates letís say, a trigram model. So thatís easy. This is .8; this is .1; this is .1; and every other words is estimated at 0. And the Iíve also estimated a bigram model, and a unigram model. And while I havenít shown you the rest of my training corpus, letís just assume this is my bigram model and my unigram model.

Now, what I want to learn is mixing weights between these three models to make the likelihood of some held out data as likely as possible. So in my held out corpus, I get to see comes across five times. So I see comes across as three times; I see comes across the zero times; comes across three zero times; and I see comes across a once; and comes across one once. So this is now my hold out corpus.

And so what Iím going to do in the EM algorithm is Iím going to, I call them Lambda here, but here are my Theta, or Lambda, my mixing weights between the unigram, bigram and trigram model. And so Iím going to want to set these weights so as to make the likelihood of the held out corpus as good as possible. And Iím gonna do that with the EM algorithm.

So to start off, I have to start off with some initial guesses for what these mixing parameters are. And I guess .7 for the trigram. I mean, itís not gonna really matter what I pick, as Iíll show you in a minute. So hereís how concretely I run the EM algorithm.

So what I want to do is say, okay, the things I observed in my observed corpus is as, a, and like. Let me work out how likely each model is to have generated them, or how likely it is to have generated them in different ways. So if I see the word as, well, whatís the kind of chance of it being generated by the trigram model? The chance of as being generated by the trigram model is thereís a .7 chance of the trigram model being selected, and then thereís a .8 chance of it generating the word as. So if I run my mixture model, the chance of it generating the word as, by having used the trigram model, is just the product of those two quantities. Does that make sense?


Instructor (Christopher Manning):Yeah. All through this, Iím assuming that the context is this comes across before it. Iíve got my mixture model. Itís gonna generate the next word. Whatís the probability that itíll generate as when using the trigram model to do it itís .56?

It can also generate the word as using the bigram model. And whatís its chance of doing that? Well, itís chance of .2 of selecting the bigram model, then given its selected the bigram model, this is a bit of a high estimate, but the model isnít sitting here, thereís a 20 percent chance of it generating the word as after across. So this is effectively the probability of as given across. So, again, we just multiply those together, and we get .04. And so I fill in all of this table.

Well, if I then want to ask for mixture model whatís the total chance of it generating the word as after comes across, well, then Iím summing this column. Because the mixture model is the sum over the probability of generating it with the component models. So these are the overall estimates that my model gives to generating these different words after comes across.

So my initial data likelihood is then to take, since I saw as three times, and the other words once, is to take this probability and cube it, multiply it by this probability, and multiply it by that probability, which is what I do with this equation. So I take this probability cubes times that probability times that probability. And so I get my small number as my data likelihood. And so then at this point, I run the EM algorithm.

So for the EM algorithm what I want is the probability distribution over completions, which is over the choice of whether a unigram, bigram or trigram model generated different things.

And this is actually easy, right. So thereís a total .6 chance of generating as next. And of that .6, 0.56 out of .6 chance it came from the trigram model. So I literally take this by that, and thatís the probability distribution the trigram model generated it. And, similarly, I just do the same, divide through and say thereís a six percent chance the bigram model generated it, a tiny chance the unigram model generated it. So I get here. My chances of different models generating things.

In particular, since the trigram model gave zero probability estimate to generating a after comes across, the probability of the trigram model, being the thing that generated it, is also zero. So this is my probability distribution over my completions. And so from there I work out my expectations.

And so the expectations is then just a count of how often I expect to have seen different completions. So since I saw as three times in my held out corpus, my expectations for this column are simply three times my probability as distribution here. So Iím taking that number and multiplying it by three. And so this column Iím multiplying by three there. And since these words I saw one time each, their expectations are simply the probability estimates.

So this gives me expectations of how many times I used the trigram model to generate as after comes across. If I then want to just work out my total expectation for my held out corpus, as to how often I was in the trigram model, what I then do is sum across the row. But since these are zeros, the expected number of times I was using the trigram model to generate something was 2.7 times. But then for the bigram model and the unigram model, I sum these rows, and I say, okay, my best guesses were I used the trigram model 2.7 times, the bigram model 1.4 times, and the unigram model 0.7 times. And so if I sum those expectations they add up to five. The number of items in my validation corpus.

If for some reason they donít add up to the size of your data set, you know youíve done something wrong during the course of it. So up until there thatís my E-Step. Iíve worked out expectations over the completions.

And so now I do the M-Step, which is to change the parameters to make my data more likely. And so the way I do that is I just re-estimate the mixing weights based on my current guesses of how much I use different models.

So it looks like I used the trigram 2.7 out of five times, i.e., slightly more than 50 percent of the time. So I do that division, and I get a new estimate for the mixing weight of the trigram model, which has shrank. Remember I started at .7, but it shrunk to .55, and then these two weights have grown.

And intuitively that makes sense, right? Because for my held out corpus, since the trigram model can only generate three of the observed five words, because itís giving a zero estimate to the other two words, it just canít be the best thing that gives highest likelihood to the data, to give a mixing weight of .7 to the trigram model, because at most it can have generated 3/5 of the data. So it sort of seems like its weight must be less than or equal to .6.

Student:So do we do this for every [inaudible]?

Instructor (Christopher Manning):So youíre kind of getting to a point where Iím making this example sort of easy. So if youíre doing this for real, youíre certainly, yes, summing over every history to work out good mixing weights. And then thereís a question of how are the mixing weights done? And I sort of mentioned in class that there are a couple of possibilities for that. One possibility was if youíre just doing the simplest form of linear interpolation. Your mixing weights are you just have one for each order model. And so then you sum this over all possible contexts, and then in the same way you just re-estimate and you get weights on the trigram, bigram, and unigram model.

There are more subtle ways to do it, in which you kind of put different estimates based on things like the count of the context so you have more parameters in your mixture. And you can explore that if you want for the homework. That makes things a little bit more complicated but itís in principle the same. What you donít want to do is actually have different mixing weights for each possible history.

In real life, if you have different mixing weights for each possible history, itís hopeless because then your mixing weights are being estimated with just as, or even more sparse data than your original distributions, and so the mixing weights you learn are completely over trained. Because if in your held out data you saw only things that you saw in the trigram model before, youíll put all the weight on the trigram model and zero on the other models and then youíre not achieving what youíd like to achieve of having this doing smoothing for you between the models.

So at that point, Iíve changed my estimates for my mixing weights. And then I can go back to the start and I can work out, okay, for my new model with these weights, which Iíve just again put down here, I can work out the likelihood of the observed data, i.e., the held out data according to my new model.

So itís exactly the same as before. I work out the chance of the trigram model, the trigram component generating as, which is simply the weight here times the probability in the trigram model. And then I sum these. And now the probability of as is 0.5, which is lower than before. The probability of a is this and this. And if I work out my data likelihood in the same way as before, the crucial thing to notice is the hold. My data likelihood has gone much bigger than it was before. It used to be 4.9 times 10 to the minus 5, and now itís 6.4 times 10 to the minus 5. So my data likelihood has gone up by 30 percent. So fiddling these Lambda weights is doing something useful for me.

Student:[Inaudible] the actual data [inaudible].

Instructor (Christopher Manning):Right. This is the actual data likelihood.


Instructor (Christopher Manning):No, you have to see an increase. But because the completion lower bounds the actual data likelihood, it Ė I think thatís right. Yeah, no, you really have to see it if Ė well, strictly, it could stay the same. That it has to stay the same or increase. You canít possibly have a reversal in the actual data likelihood.

Yeah, but this is just working out the actual data likelihood of my hold out data. Iím just working it out by taking a product of the sum, which is the sum is going down these columns.

So at that point I just repeat. So I work out probability over completions again. I then work out expectations to be in different states again. I work out probabilities for my mixing weights by, again, just dividing through these expectations via the total. And, again, I get these different weights. Where, again, the probability estimate of using the trigram model shrinks a little bit. Itís not nearly as much as the first time.

But I do it again, and lo and behold my data likelihood has increased a little. Itís now increased from 6.4 times 10 the minus 5 to 6.6 times 10 to the minus 5. And I keep on doing this, and now itís 6, 6, 4, 4 and 6, 6, 6, 6 and 6, 6, 8, 0. I was able to do this by cut and paste. I didnít have to fight this all in. And I keep on going, 6, 6, 8, 9; 6, 6, 9, 4; 6, 6, 9, 8; 6, 6, 7 Ė one of the things that you should learn from this about the EM algorithm is convergence of the EM algorithm is pretty slow. You kind of have to run it for a while and go through iterations. But, nevertheless, the likelihood of the data keeps on going up, and I keep on jugging by. And essentially what itís converging to is that the weight on the trigram model is a half. The weight on the bigram model is .4. And the weight on the unigram model is a 1/10.

So let me just ask kind of a couple of questions on this. I chose these mixing weights at the beginning. What would happen if I had sort of started with the bigram rate as 0, and the unigram weight as 0.3? What would I expect to be the end result after Iíd run this convergence?


Instructor (Christopher Manning):No, that is not the right answer.


Instructor (Christopher Manning):Yes, that is the right answer.

So your answer is right most of the time, but itís not right if something is zero.

So if something is zero, itís got no chance of generating particular data points. And so when I work out whatís the probability of having used the bigram model to have generated different things, itís still zero because it would have no chance of generating them. So the expectations is zero, and it stays zero. And, well, as you can see, I had these strings hard coded rather than them dynamically generating the truth. So this oneís growing, but that one isnít changing at all. And if we run it to convergence, the weight is just being split between the trigram and the bigram model.


Your answer is right most of the time, so I didnít really have to start with these numbers. If I started off with 0.1, 0.6 and 0.3, it actually makes no difference. And if you go down to the bottom youíre getting the same, kind of a half, four-tenths, a tenth distribution.

So a lot of the time what you find with the EM algorithm is that if you choose extremely extreme values of these parameters, that you can get to bad local maxima and youíll get different answers in the time. But fairly often thereíll be a fairly broad range of middle values of starting parameters and thatíll converge the same local maxima.


Instructor (Christopher Manning):That certainly should be the case.

So if I choose any kind of middling values, my data likelihood ends up as about 6.705 times 10 to the minus fifth. If I go to the case where this was zero, and Ė well, hereís another trick. It doesnít actually matter on the first iteration of what you give it as a probability distribution because when it re-estimates things itíll turn it into a probability distribution. So if I start off with the middle one zero, I guess my data likelihood never gets nearly as good, yeah. Because the bigram model is useful.


Instructor (Christopher Manning):I can try it. I think that will be bad news. Minus 0.3. Student:


Instructor (Christopher Manning):Yeah, you donít actually want to have zeros. Providing theyíre all positive youíll be in good shape.

So this spreadsheet is on the web page. You can play around with it more if you want to. Another thing that you can play around with is you can also just change these underlying models. So if you, for example, suppose the bigram model had given a slightly higher probability to like coming after comes across. So suppose it was 0.02, that that estimate is still less than the unigramís estimate. But, nevertheless, if you run it to convergence, whatís happening as it runs to convergence is the unigramís weight is converging to zero, and weight is just being split between the trigram and bigram model.

But it turns out that even though the unigram model assigns a higher probability to like, and even though the word like appears in the held out corpus, that the optimization essentially finds it can just get a lot more value in terms of increasing likelihood by giving weight to the bigram model. Because the bigram model gives much higher likelihood to as, and much higher likelihood to a, so that data likelihood is actually maximized by giving just more and more weight to the bigram model, despite the fact that the unigram model is more likely to generate 1/5 of the training data.

So you guys can play with that. Let me just for the last couple of minutes then say a moment about what comes after Model 1, and is then the heart of the rest of what you have to do for the assignment. So the first part of the assignment is implement that and get Model 1 to work. And then the second part of the assignment is we go onto IBM Model 2. So thereís the sequence of IBM models the rest of which I will talk about next time.

But the idea of these models is to make further refinements to the probability models that make them more complex, but also make them do a slightly better job at modeling how human languages work.

So if you look at human languages as translations, commonly what you get is these alignment grids kind of look like this example I showed earlier. That thereís various stuff going on reflecting the fact that different languages put words in different orders from each other. They donít all use the same word order. But generally it tends to be the case that you sort of get this sort of overall kind of, you know, goes along the diagonal.

And if you know something about languages, you might say, well, wait minutes, thatís not true of some languages. There are some languages that have very different words orders. In English you get subject, verb, object. There are some languages that have completely opposite, they have object, verb, subject, they should get completely the opposite word order.

And thatís a little bit true. I mean, so for language pairs with similar typology, something like French/English, you get a very strong diagonal effect like this. Whereas if you kind of think of languages that have very different word orders, the diagonal effect becomes less strong, itís true, but it normally is still there.

And the reason itís still there is in written pros, or even when people are speaking, most sentences, except in kindergartener readers, most sentences arenít, The boy sees the bal. The boy grabs the ball. Right. Theyíre not those s, v, o sentences. What you get is these long multi clause sentences of, While Jim was listening to the morning news, his wife called out it was time to make the coffee, and he went over and started the machine. And so that even if each of those individual clauses sort of has its order flipped in the language with very different word order, normally the translation will still have those three clauses in the same order. Because thatís basically just the iconical order in which events happen. And so youíll still get a kind of blocked diagonal structure.


Instructor (Christopher Manning):I mean there is a bit of bias, youíre right. Because clearly in some sense itís more mental effort for translators to reorganize the clauses than just to do them one at a time. Although, you know, translators really will do that if itís appropriate to do sometimes. But I agree thatís a source of bias.

But itís something more than that. Thereís also Ė I mean this is the kind of thing people talk about in functional linguistics, there is just an iconic ordering of events. And although you donít have to present events to a reader in the order in which they happened, and sometimes people donít, and then you use things like, before clauses, where you kind of invert the order. But by and large 90 percent of the time people do present events in the iconic order, and that exists independent of it being a translation. At any rate, weíre learning from translations.

Okay. So what we want to do in Model 2 is then to say, well, we somehow want to capture that translations normally lie along the diagonal. And so in the programming Assignment 2, the general idea is when youíre translating a word here, that it could be translated by any other word. But a word somewhere around here should be more likely, and a word somewhere up here should be less likely to be the translation of it. And so Iíll put that into our models, which gives it a bias on absolute the kind of learned roughly diagonal translations. And youíll see that that makes the MT system work a lot better.

Okay. Iíll stop for now.

[End of Audio]

Duration: 74 minutes