Word2vec Algorithm

8 min readJan 26, 2022

Abstract

NLP is a field of computer science, on which many researches have been made today and which is thought to shape the future with its development. The most common way to use the Deep Learning method in NLP, on texts is to put words into matrices. To use the deep learning method, words in a text can be matrices in several different ways. Word2vec, which is one of these methods, has been increasing in popularity recently. The word2vec algorithm uses a neural network model to learn word associations from a large text. Word2Vec is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Word2vec provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words.

1 Introduction

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function which indicates the level of semantic similarity between the words represented by those vectors.

Important to know word embeddings because the overall result and output of word2vec will be embeddings associated to each unique word passed through the algorithm. Word embeddings is a technique that individual words are transformed into a numerical representation of the word as a vector. The vector which represented a word, is then learned in a way which resembles a neural network.

This is a word embedding for the word “king” (GloVe vector trained on Wikipedia):

[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , — 0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , — 0.51042 ]

It’s a list of 50 numbers. We can’t tell much by looking at the values. But let’s visualize it a bit so we can compare it other word vectors. Let’s put all these numbers in one row:

Figure 1: List of 50 Elements of “king” word

Let’s color code the cells based on their values so as to be red if they’re close to 2, white if they’re close to 0, blue if they’re close to -2:

Figure 2: Colored List of 50 Elements of “king” word

Let’s now contrast “King” against other words:

Figure 3: Comparison of Colored List of 50 Elements of “king” Word with Other Words

These vector representations capture quite a bit of the information/meaning/associations of these words. We can see acording these vectors “Man” and “Woman” are much more similar to each other than “King”.

The famous examples that show an incredible property of embeddings is the concept of analogies. We can add and subtract word embeddings and arrive at interesting results. The most famous example is the formula: “king” — “man” + “woman”:

Using the Gensim library in python, we can add and subtract word vectors, and it would find the most similar words to the resulting vector. The Figure 4 shows a list of the most similar words, each with its cosine similarity.

Figure 4: Result of the Data with Word2vec Using the Gensim Library in Python

Also, we can visualize this analogy in Figure 5 as we did previously:

Figure 5: Visualised Result of the Data with Word2vec Using the Gensim Library in Python

According to the result from Figure 5 and Figure 6 “king-man+woman” doesn’t exactly equal “queen”, but “queen” is the closest word to “king-man+woman” in our data.

There are two main architectures which yield the success of word2vec. Word2vec can utilize either of these two model architectures to produce a distributed representation of words: Continuous Bag-ofWords (CBOW) or Continuous Skip-Gram.

2 Continuous Bag-of-Words

This method takes the context of each word as the input and tries to predict the word corresponding to the context. The idea is that given a context, we want to know which word is most likely to appear in it.

For CBOW, all the examples with the target word as target are fed into the networks, and taking the average of the extracted hidden layer. For example, assume that we only have two sentences, these are “This is a book.” and “It is a car.” To compute the word representation for the word “a”, we need 5 to feed in these two examples into the Neural Network and take the average of the value in the hidden layer.

There is 2 types of Continuous Bag-of-Word Model, One-word context and Multi-word context.

2.1 One-Word Context

We assume that there is only one word considered per context, which means the model will predict one target word given one context word, which is like a bigram model.

Figure 6 shows the network model under the simplified context definition. In our setting, the vocabulary size is V, and the hidden layer size is N. The units on adjacent layers are fully connected. The input is a one-hot encoded vector, which means for a given input context word, only one out of V units, {x1, · · · , xV }, will be 1, and all other units are 0.

Figure 6: A Simple CBOW Model with One Word in the Context

Wv*n is the weight matrix that maps the input x to the hidden layer (V*N dimensional matrix). W’v*n is the weight matrix that maps the hidden layer outputs to the final output layer (N*V dimensional matrix)

Figure 7: Calculation of One-word Context

Note that vw and v’w are two representations of the word w. vw comes from rows of W, which is the input→hidden weight matrix, and v’w comes from columns of W’, which is the hidden→output matrix. In subsequent analysis, we call vw as the “input vector”, and v’w as the “output vector” of the word w.

Figure 8: Vector Representation of the Input Word wı of One-Word Context

The weights between the input layer and the output layer can be represented by a V × N matrix W. Each row of W is the N-dimension vector representation vw of the associated word of the input layer. Formally, row i of W is v T w. Given a context (a word), assuming xk = 1 and xk’ = 0 for k’= k, we have given implementation of Figure 8 which is essentially copying the k-th row of W to h.

2.2 Multi-Word Context

When computing the hidden layer output, instead of directly copying the input vector of the input context word, the CBOW model takes the average of the vectors of the input context words, and use the product of the input→hidden weight matrix and the average vector as the output which is in Figure 9.

Figure 9: Vector Representation of the Input Word wı of Multi-Word Context

The above model takes C context words. When Wvn is used to calculate hidden layer inputs, we take an average over all these C context word inputs.

Figure 10 shows the CBOW model with a multi-word context setting.

Figure 10: A Simple CBOW Model with Multi Word in the Context

3 Continuous Skip-Gram

While a bag-of-words model predicts a word given the neighboring context, a skip-gram model predicts the neighbors of a word, given the word itself. It is the opposite of the CBOW model. The target word is now at the input layer, and the context words are on the output layer. The model is trained on skip-grams, which are n-grams that allow tokens to be skipped. The context of a word can be represented through a set of skip-gram pairs of (target_word, context_word) where context_word appears in the neighboring context of target_word.

Take a look at this table of skip-grams for target words based on different window sizes.

The training objective of the skip-gram model is to maximize the probability of predicting context words given the target word. For a sequence of words w1, w2, … wT, the objective can be written as the average log probability

where c is the size of the training context. The basic skip-gram formulation defines this probability using the softmax function

Figure 13: The Basic Skip-Gram Formulation

where v and v ‘ are target and context vector representations of words and W is vocabulary size.

The visualization of the above mathematical operations is shown in Figure 14.

4 Conclusion

Word embeddings are a crucial aspect of many NLP challenges since they show a machine how humans comprehend language. Word2vec generates an embedding vector for each word in a big corpus of text given a large corpus of text. These embeddings are organized in such a way that words with comparable qualities are clustered together. The two primary architectures connected with word2vec are the CBOW and the skip-gram model. The skip-gram model uses a variety of words to try to predict the missing one given an input word, whereas the CBOW model uses a variety of words to try to predict the missing one given an input word.

References

[1] Xin Rong. (2016, 5 June). word2vec Parameter Learning Explained. https://arxiv.org/pdf/1411.2738.pdf

[2] Kung-Hsiang, Huang (Steeve). (2018, 4 Feb). Word2Vec and FastText Word Embedding with Gensim. https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttexta209c1d3e12c

[3] Tomas Mikolov, Kai Chen, Greg Corrado Jeffrey Dean. (2013, 7 Sep).Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/pdf/1301.3781.pdf

[4] Vatsal. (2021, 29 July). Word2Vec Explained. https://towardsdatascience.com/word2vecexplained-49c52b4ccb71

[5] Dhruvil Karani. (2018, 1 September). Introduction to Word Embedding and Word2Vec. https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

[6] Jay Alammar. (2019, 27 March 27). The Illustrated Word2vec. https://jalammar.github.io/illustrated-word2vec/

[7] https://www.tensorflow.org/tutorials/text/word2vec

[8] https://en.wikipedia.org/wiki/Word2vec