2:48

juice is the word 4834 in our vocab of 10,000 words.

Â And so that's the input x that you want to learn to map to that open y.

Â So to represent the input such as the word orange, you can start out with some one

Â hot vector which is going to be write as O subscript C, so

Â there's a one hot vector for the context words.

Â And then similar to what you saw on the last video you can take

Â the embedding matrix E, multiply E by the vector O subscript C, and

Â this gives you your embedding vector for the input context word,

Â so here EC is equal to capital E times that one hot vector.

Â Then in this new network that we formed we're going to take this vector EC and

Â feed it to a softmax unit.

Â So I've been drawing softmax unit as a node in a neural network.

Â That's not an o, that's a softmax unit.

Â And then there's a drop in the softmax unit to output y hat.

Â 4:57

So we use y to represent the target word.

Â And we use a one-hot representation for y hat and y here.

Â Then the lost would be The negative log

Â liklihood, so sum from i equals 1

Â to 10,000 of yi log yi hat.

Â So that's a usual loss for softmax where

Â we're representing the target y as a one hot vector.

Â So this would be a one hot vector with just 1 1 and the rest zeros.

Â And if the target word is juice, then it'd be element 4834 from up here.

Â That is equal to 1 and the rest will be equal to 0.

Â And similarly Y hat will be a 10,000 dimensional vector output by the softmax

Â unit with probabilities for all 10,000 possible targets words.

Â So to summarize, this is the overall little model, little neural

Â network with basically looking up the embedding and then just a soft max unit.

Â And the matrix E will have a lot of parameters, so the matrix E has parameters

Â corresponding to all of these embedding vectors, E subscript C.

Â And then the softmax unit also has parameters that gives the theta

Â T parameters but if you optimize this loss function with respect to the all of

Â these parameters, you actually get a pretty good set of embedding vectors.

Â So this is called the skip-gram model because is taking as input one word

Â like orange and then trying to predict some words skipping a few words from

Â the left or the right side.

Â To predict what comes little bit before little bit after the context words.

Â Now, it turns out there are a couple problems with using this algorithm.

Â And the primary problem is computational speed.

Â In particular, for the softmax model, every time you want to evaluate this

Â probability, you need to carry out a sum over all 10,000 words in your vocabulary.

Â And maybe 10,000 isn't too bad, but

Â if you're using a vocabulary of size 100,000 or a 1,000,000,

Â it gets really slow to sum up over this denominator every single time.

Â And, in fact, 10,000 is actually already that will be quite slow, but

Â it makes even harder to scale to larger vocabularies.

Â So there are a few solutions to this, one which you see in the literature

Â is to use a hierarchical softmax classifier.

Â And what that means is,

Â instead of trying to categorize something into all 10,000 carries on one go.

Â Imagine if you have one classifier,

Â it tells you is the target word in the first 5,000 words in the vocabulary?

Â Or is in the second 5,000 words in the vocabulary?

Â And lets say this binary cost that it tells you this is in the first 5,000

Â words, think of second class to tell you that this in the first 2,500

Â words of vocab or in the second 2,500 words vocab and so on.

Â Until eventually you get down to classify exactly what word it is, so

Â that the leaf of this tree, and so having a tree of classifiers like this,

Â means that each of the retriever nodes of the tree can be just a binding classifier.

Â And so you don't need to sum over all 10,000 words or

Â else it will capsize in order to make a single classification.

Â In fact, the computational classifying tree like this

Â scales like log of the vocab size rather than linear in vocab size.

Â So this is called a hierarchical softmax classifier.

Â I should mention in practice, the hierarchical softmax classifier doesn't

Â use a perfectly balanced tree or this perfectly symmetric tree,

Â with equal numbers of words on the left and right sides of each branch.

Â In practice, the hierarchical software classifier can be developed so

Â that the common words tend to be on top,

Â whereas the less common words like durian can be buried much deeper in the tree.

Â Because you see the more common words more often, and so

Â you might need only a few traversals to get to common words like the and of.

Â Whereas you see less frequent words like durian much less often, so it says

Â okay that are buried deep in the tree because you don't need to go that deep.

Â So there are various heuristics for

Â building the tree how you used to build the hierarchical software spire.

Â 9:59

and others, on the first slide.

Â But I won't spend too much more time on this.

Â Because in the next video, where she talk about a different method,

Â called nectar sampling, which I think is even simpler.

Â And also works really well for speeding up the softmax classifier and

Â the problem of needing the sum over the entire cap size in the denominator.

Â So you see more of that in the next video.

Â But before moving on,

Â one quick Topic I want you to understand is how to sample the context C.

Â So once you sample the context C, the target T can be sampled within,

Â say, a plus minus ten word window of the context C, but

Â how do you choose the context C?

Â One thing you could do is just sample uniformly, at random,

Â from your training corpus.

Â When we do that, you find that there are some words like the, of, a,

Â and, to and so on that appear extremely frequently.

Â And so, if you do that, you find that in your context to target mapping pairs just

Â get these these types of words extremely frequently, whereas there are other words

Â like orange, apple, and also durian that don't appear that often.

Â And maybe you don't want your training site to be dominated by these extremely

Â frequently or current words, because then you spend almost all the effort updating

Â ec, for those frequently occurring words.

Â But you want to make sure that you spend some time updating the embedding, even for

Â these less common words like e durian.

Â So in practice the distribution of words pc isn't taken

Â just entirely uniformly at random for the training set purpose, but

Â instead there are different heuristics that you could use in order to balance out

Â something from the common words together with the less common words.

Â 11:50

So that's it for the Word2Vec skip-gram model.

Â If you read the original paper by that I referenced earlier, you find that that

Â paper actually had two versions of this Word2Vec model, the skip gram was one.

Â And the other one is called the CBow, the continuous backwards model,

Â which takes the surrounding contexts from middle word, and

Â uses the surrounding words to try to predict the middle word,

Â and that algorithm also works, it has some advantages and disadvantages.

Â But the key problem with this algorithm with the skip-gram model as presented so

Â far is that the softmax step is very expensive to calculate because needing to

Â sum over your entire vocabulary size into the denominator of the soft packs.

Â In the next video I show you an algorithm that modifies the training objective

Â that makes it run much more efficiently therefore lets you apply this in a much

Â bigger fitting set as well and therefore learn much better word embeddings.

Â Lets go onto the next video.

Â