Introduction to Word2Vec

The Word2Vec is a deep learning algorithm that draws context from phrases. Every textual document is represented in the form of a vector, and that is done through Vector Space Modelling (VSM). We can convert our text using One Hot Encoding. For example, having three words we can make a vector in a three dimentional space.

The problem with One Hot Encoding is that it doesn’t help us to find similarities. In fact, from the graph above, every distance is same from each other, and we cannot find similarities using for example Euclidian Distance. That is why Word2Vec data generatiioin also known as Skipgram is used. The Word2Vec is a Word Embedding where similarities come from neighbor words.

From the example above, we converted the words in a One Hot Encoding, and we also codified the Neighbor One Hot Encoding. The architecture of the Word2Vec is as described below.

The example described by the figure above, try to train the word king as input and brave as neighbor using gradient descent as optimizer. During Backpropagation we have an update of the Weights in the Hidden layer for each combination of words, and the inputs are multiplied with the updated weights. The Weights continue to be aìupdated during each combination of words based on the context of each phrase. The Softmax Function create the probability distribution, and Gradient Descent is used as optimizer.
There is an interesting simulation here where we can simulate to train a ANN and see how it developes.

The crucial point is to be able to predict the Context Word from the Focus Word, namely the word in the sentence.

\[ \log p(c | w ; \theta)=\frac{\exp v_{c} \cdot v_{w}}{\sum_{c^{\prime} \in C} \exp v_{c^{\prime}} \cdot v_{w}} \]

From the function above, we are going to take the probability of the context word given the focus word as the product between the contex word vc and the focus word vw. The formula remind us the Sigmoid Function of the Logistic Regression.
One important detail of Word2Vec is related to the distributioin of the probability of the context word. In fact, the robability of words is typically raise to the power of 3/4 called Negative Sampling Distributin.

As we can see from the figure above, raising to the power of 3/4 is able to bring down frequent terms, and bring up infrequent terms. As a result, we are focusing only on super frequent words, but also we are considering words that are in the middle range of our distribution and we can explore more at the long tail of the distribution. The Negative Sampling Distribution with a power of 3/4 makes the distribution a little bit fatter and longer.

There is an interesting simulation here where we can simulate to train an ANN and see how it develops. And here the related theory.