A look under the hood of word2vec and node2vec

5 Dec 2020

I work with networks and node2vec is a topic that appears often enough to make me curious. So, I wanted to understand what node2vec does. At a superficial level, the explanation is simple: we generate random walks on a network and pass them to word2vec. The problem with this answer is that it does give no insight, unless you know what word2vec does.

Figuring out what word2vec does is not trivial. I could find two types of articles: the ones that do not go into any depth and just give a general overview and the technical ones that discuss the optimisation of the training, but do not give a clear exposition of the idea behind it. I wanted to get a solid understanding of what the algorithm does before I started looking at the gory optimisation details. So, after I succeeded, I decided to write the article I would have wanted to find when I started reading about word2vec.

Structure

The article is divided in two parts, the main part and the appendix. I tried to include as little mathematics as possible in this article. However, I suspect that some readers will be left with questions. For this reason I have added an appendix, which contains some details I have omitted.

In the main part I discuss the idea behind word embeddings, what is the context of a word and explain the word2vec algorithm using a toy example. In the appendix, I discuss the function which word2vec tries to maximise and a couple of details that were left untouched in the first part.

What is the point of word2vec?
What is the context of a word?
A more general view of word embedding
Word2vec with skip-gram
Word2vec with continuous bag of words
Word similarity
What is node2vec?
Appendix
What is the actual loss function of skip-gram?
But what about CBOW?
What about the beginning of sentences in node2vec?
Links

What is the point of word2vec?

Word2vec is an algorithm for word embedding, i.e. a way to represent words by vectors in a meaningful way. The standard example is that if we take the vectors that represent the words Italy, France, Rome and Paris, then the vectors Rome - Italy and France - Paris are very similar (I explain what similar means later). In a way, this vectors represents the relationship between a country and its capital. As a result the vector France + Rome - Italy is very similar to the vector Paris. This means that a computer can guess the capital of France! This is truly mind blowing. Well, at least it was to me. The question I want to answer here is “How do we get such embeddings using word2vec?”. It is worthy to note that word2vec is not the only way to get such embedding, but it is one of the most popular ways.

What is the context of a word?

The general idea behind word embeddings is that the meaning of a word is captured in its context. This sentence is pretty vague, but we shouldn’t let this discourage us. In practice, it is assumed that the context of a word is the words which surround it in a sentence. But at which sentences should we look?

We start with what linguists call a corpus. This is a (usually huge) collection of texts that we believe is suitable for our needs. I will not go into the conversation about which corpus is suitable, as this is mostly a linguistic discussion. For the needs of this article, I chose my corpus to be the following two sentences:

London Bridge is falling down, falling down, falling down. London Bridge is falling down, my fair lady.

Then we need to decide which words should be considered distinct in our corpus. This is trivial in our toy example, in alphabetic order they are: bridge, down, fair, falling, is, London, lady, my.

In practice, this decision is far from trivial. There are many choices to be made here. For example, should the words am, are and is be considered different? What about am and was? Should phrasal verbs be considered one word or two? Should words with multiple meanings, like the word club, be split into multiple words? Should articles be discarded? Should any other words be discarded? Do commas carry important information? What about colons? However, since all these questions are of linguistic nature, I will just steer clear of them.

Once we have decided which are our words, we need to decide what is the context of each word. For our toy example, the context of each word will be defined by a symmetric window of width 1, that is the word preceding and the word succeeding it. In practice, this is too narrow and the width is usually at least 4 or 5. But regardless, since we have decided the size of our window, we can create pairs between words and their context.

We immediately run in the next problem. The first word is London and we do not have a preceding word. We have several choices here. We may say that the preceding word in this case can be any other word with uniform probability. We may say that the probability is not uniform, but is equal to the frequency of each word in the corpus. We can use a ‘dummy’ word like BLANK. We can even drop the first few words until we get words with well-defined context. As you can guess, this is also a linguistic choice and I will not go into this discussion.

Of course, the same problem appears at the beginning of each sentence. In this case, since the sentences are often related, we have the extra option of just ignoring the full stop and taking the words of the previous sentence. But what happens with a new paragraph? Is this still ok? What happens if it is the beginning of a new text in our corpus? You have probably guessed it by now, this discussion is a linguistic one and I will not go into it.

For the purpose of the present article, let’s assume that we have no such problems, we ignore full stops and commas and we decide that the word just before the first London is lady and the word after lady is London. Note that this is just a convenient choice and not a useful one. Now we can finally create the word-context pairs: (London: lady, bridge), (bridge: London, is) and so on. The results are shown in the following table.

Tally table of word-context pairs.
	`bridge`	`down`	`fair`	`falling`	`is`	`lady`	`London`	`my`
(`bridge`,`falling`)	0	0	0	0	2	0	0	0
(`down`,`bridge`)	0	0	0	0	0	0	1	0
(`down`,`down`)	0	0	0	2	0	0	0	0
(`down`,`fair`)	0	0	0	0	0	0	0	1
(`fair`,`London`)	0	0	0	0	0	1	0	0
(`falling`,`falling`)	0	2	0	0	0	0	0	0
(`falling`,`London`)	0	1	0	0	0	0	0	0
(`falling`,`my`)	0	1	0	0	0	0	0	0
(`is`,`down`)	0	0	0	2	0	0	0	0
(`lady`,`bridge`)	0	0	0	0	0	0	1	0
(`London`,`is`)	2	0	0	0	0	0	0	0
(`my`,`lady`)	0	0	1	0	0	0	0	0

A more general view of word embedding

Recall that our declared goal is to embed the words in a vector space. If we take a second look at the table above, we will see that it actually does that. We can describe each word by its column in the table, which we can treat as a vector! This is actually the starting point of many word embedding algorithms.

Once we have these vectors, we need to decide which vectors are close to each other. Euclidean distance may seem the obvious choice. If the distance is small, then the two words appear in more or less the same contexts, so they seem to be used roughly interchangeably.

However, we may have two words that are synonyms, but one of them is more common than the other. The less common one will appear in the same contexts, but less often. This means that the Euclidean distance between the two vectors will be big, but the vectors point towards the same direction. We solve this problem, by changing the way we measure distance in our vector space. Instead of using the Euclidean distance, we use the angle between the two vectors. If the angle is 0, then we expect that the words are synonyms. Well, in practice no one wants to compute the angle, so we instead compute the cosine of the angle.

The next problem we face here is that the dimension of the vector space equals the number of contexts. In practice, this can be of the order of millions or more. However, since the vectors are sparse, we can use our favourite dimensionality reduction algorithm to get vectors in a vector space with much smaller dimension. Speaking liberally, we could say that word2vec is one such algorithm.

Word2vec with skip-gram

Now we are ready to talk about the actual algorithm. First, we need to construct the frequency table from the tally table. We do that by making sure that every row of the tally table sums up to 1.

Frequency table of word-context pairs.
	`bridge`	`down`	`fair`	`falling`	`is`	`lady`	`London`	`my`
(`bridge`,`falling`)	0	0	0	0	1	0	0	0
(`down`,`bridge`)	0	0	0	0	0	0	0.5	0
(`down`,`down`)	0	0	0	0.5	0	0	0	0
(`down`,`fair`)	0	0	0	0	0	0	0	1
(`fair`,`London`)	0	0	0	0	0	1	0	0
(`falling`,`falling`)	0	0.5	0	0	0	0	0	0
(`falling`,`London`)	0	0.25	0	0	0	0	0	0
(`falling`,`my`)	0	0.25	0	0	0	0	0	0
(`is`,`down`)	0	0	0	0.5	0	0	0	0
(`lady`,`bridge`)	0	0	0	0	0	0	0.5	0
(`London`,`is`)	1	0	0	0	0	0	0	0
(`my`,`lady`)	0	0	1	0	0	0	0	0

The way to read this table is that the word down is always preceded by the word falling and it is followed by the word falling 50% of the time, by the word London 25% of the time and by the word my 25% of the time.

Then we represent each word by a one-hot vector, i.e. a vector whose components are all 0 except one which is 1. For example, the word bridge, which is alphabetically the first word, is represented by the vector (1, 0, 0, 0, 0, 0, 0, 0) and the word down, which is the second, by the vector (0, 1, 0, 0, 0, 0, 0, 0) and so on. Since we have 8 distinct words, the one-hot vectors have 8 components.

Now we construct a “neural network” and we teach it to spit out the frequency vector when it is given a word. I used quotes when I talked about the neural network, because it is not exactly a neural network, at least not of one of the types we are used to encounter.

The neural network we constructed for our toy example has 8 input nodes, 2 hidden nodes and 16 output nodes. The number of input nodes always equal the number of words in our corpus. The number of output nodes always equal the number of words in our corpus times the number of words in a context, i.e. twice the window width. The number of hidden nodes is our guess of how many dimensions we actually need. In this case I chose 2. Our neural network looks like the one in the following figure.

Skip-gram graph — The skip-gram neural network: The input nodes are shown green, the hidden nodes yellow, the output nodes for the preceding word blue and the output nodes for the succeeding words red.

I said that this is not exactly a neural network. This is because the activation function of the hidden nodes is just the identity function. This means that the hidden nodes do nothing, they just spit out the input they get. The only non-linearity happens at the output nodes. These are split into 2 groups of 8. The first one corresponds to the preceding word in the context and the second to the succeeding word in the context. We take the values of the first 8 output nodes and we consider them as one vector. We apply the softmax function to this vector. Then we take the remaining 8 output nodes and we apply the softmax functions there too. The softmaxed vectors are our output and they are interpreted as probability distributions.

Now the neural network needs to be trained. This means that we need to find weights such that when it is given the one-hot vector of a word, the neural network should return the probability distributions of the preceding word and the succeeding word. In our toy example, when we give the vector (0, 1, 0, 0, 0, 0, 0, 0) which corresponds to the word down, we should get (0, 0, 0, 1, 0, 0, 0, 0) and (0, 0, 0, 0.5, 0, 0, 0.25, 0.25). In practice the loss function is not the obvious one, but something different. I will discuss this in the appendix.

Once the network is trained, we can give it the one-hot vectors of our words, read the values of the hidden nodes and use these values to represent our word. For instance, in our toy example the word down is represented by the vector (0.946, 1.041). And we are done! These values are the vector embedding of our words. Now the words can be positioned on the plane. Of course our toy example is too small to draw any conclusions from this embedding.

Word2vec with continuous bag of words

Maybe when you were reading about skip-gram, you thought “Hang on, why do we try to guess the context of a word given the word? Wouldn’t the opposite make more sense?”. Well, we can do either. Continuous Bag of Words (CBOW) is just that.

The obvious way to do the opposite is by “flipping” our neural network and have 16 input nodes, 2 hidden nodes and 8 output nodes. Then we give the two one-hot vectors that correspond to the context and we ask to get a probability distribution of the word in the middle. There is a problem, however, with this approach. Let’s say we trained the network and now we want to get the embedding, i.e. the values of the hidden nodes given the one-hot vector of a word. We have two choices, we can give the one-hot vector as the preceding word or as the succeeding word, and we will get different values. The embedding is ambiguous!

The way to solve this is to throw away the order of the words in the context. We consider a context to be an unordered set (a bag of words). In our case it is just an unordered pair. Notice that we do not eliminate repeating words, we just ignore the order.

Now our neural network has 8 input nodes, 2 hidden nodes and 8 output nodes. The number of input nodes in this case is equal to the number of output nodes which is equal to the number of words in our corpus. The number of hidden nodes is our guess of how many dimensions we need. In this case my choice is 2. The input is the average of the one-hot vectors in the context. For example, if the context is (bridge, falling), the input vector is (0.5, 0, 0, 0.5, 0, 0, 0, 0), which would have been the same if the context was (falling, bridge). Instead of the average, we can just take the sum, as this does not really make any significant difference. I will not go into details why this makes no difference, I will just say it boils down to the way we measure distance in this space. The neural network looks like the following figure.

We train the network to give us the conditional probability distribution of a word given the average vector of a context. In our toy example, each context corresponds to only one word, so the network should be trained to return the one-hot vector of the word.

Once our network is trained, we can get the word embedding the same way as before. We give the one-hot vector of a word and we note the values of the hidden nodes.

As you can see, the two embeddings looks quite different. But this shouldn’t surprise us; the corpus is so small, that we should not expect results that are better than random.

Word similarity

Now that we have an embedding, either by using skip-gram or by using CBOW, we need to decide which words are similar. As we discussed before, Euclidean distance seems like the obvious choice, but the results are better when we use the angle. In the same spirit, we define the “similarity” of two vectors, \(\vec a\) and \(\vec b\), to be the cosine of their angle, i.e.

\[S(\vec a,\vec b) = \cos(\theta_{\vec a, \vec b}) = \frac{\vec a\cdot \vec b}{|\vec a||\vec b|}.\]

The values we get are between -1 and 1. If the similarity is 1, the two words are synonyms or at least that’s the case if we trained the neural network well. Unfortunately, similarity -1 does not mean that the two words are antonyms, it just means that they are not related.

This is how we end up saying that the vector Rome - Italy + France is similar to the vector Paris.

What is node2vec?

Node2vec is what we get when we apply word2vec to a network. It is often presented as the best network classifier we have^{[citation needed]}. I am not going to make any such claim, but it is definitely a useful method.

The goal of the method is to embed the nodes of a network in a vector space and the idea behind this is quite simple. We name the nodes of our network and these are our words.

Then we need to generate our corpus. Our corpus consists of sentences which are generated by random walks on the graph. We choose the length of the random walks, say \(n\) and the number of random walks per node, say \(m\). We pick a node, we start a random walk there and we generate \(n\) steps. This is one sentence. We do this in total \(m\) times for each node and these sentences form our corpus. Then we pass this corpus to the skip-gram version of word2vec algorithm and it returns a vector embedding of our nodes.

You may think “Wait, what happens with words at the beginning of the sentences? This is not a linguistic problem any more!” and you will be completely right. The short answer to this is that we don’t care, we let the specific implementation of word2vec deal with that. A longer answer is given in the appendix.

Appendix

I wanted to keep this article math-free as much as possible. This means that there are some important details which were not discussed, but if I were the reader, I would like to see them. So, if you are like me, read on.

What is the actual loss function of skip-gram?

I implied in the text that the objective of the neural network was to predict the probability distribution of resulting words given the input word. However, for optimisation reasons, the algorithm does not do that in the obvious way.

By construction, the neural network takes a word and gives two normalised vectors which we interpret as probability distributions for the preceding and the succeeding word. So let’s define \(P_p(\)word1|word2\()\) to be the conditional probability that given word2, word1 is the preceding word. Let’s also define \(P_s(\)word3|word2\()\) the conditional probability that given word2, word3 is the succeeding word. These are the probabilities that we read from our neural network.

Let’s take our toy example and let’s look at the word down. The contexts for this word are: (falling, falling), (falling, falling), (falling, London), (falling, my). The preceding words are falling, falling, falling, falling and the succeeding words are falling, falling, London, my. This means that given the word down, the preceding word is falling 100% of the time and the succeeding word is falling 50% of the time, London 25% of the time and my 25% of the time.

The obvious optimisation goal would be to compute the difference between the softmaxed output vectors and the frequency vectors and try to minimise that. However, this is not what word2vec does. Instead, for each context of the word down we sum the logarithms of the conditional probabilities for the preceding words. We give this sum a name:

\[\require{color} \definecolor{wordcolor}{RGB}{232,62,140} F_p({\color{wordcolor}\text{down}}) = 4\log(P_p({\color{wordcolor}\text{falling}}|{\color{wordcolor}\text{down}})).\]

We also sum the logarithms of the conditional probabilities for the succeeding words and we give it a name too:

\[\require{color} \definecolor{wordcolor}{RGB}{232,62,140} F_s({\color{wordcolor}\text{down}}) = 2\log(P_s({\color{wordcolor}\text{falling}}|{\color{wordcolor}\text{down}}))+\log(P_s({\color{wordcolor}\text{London}}|{\color{wordcolor}\text{down}}))+\log(P_s({\color{wordcolor}\text{my}}|{\color{wordcolor}\text{down}})).\]

The loss function is the sum of \(F_p\) and \(F_s\) for all words. Technically this should be called objective function and not loss function because word2vec does not try to minimise this, it tries to maximise it, but I will continue calling it a loss function, since this is how it is usually called.

However, the situation is not as straightforward as I implied here. This objective function does not have a unique maximum, this has the interesting implication, that the result depends on the algorithm. In this case the algorithm which is described in the articles [1,2] is mostly the stochastic gradient descent of this function. I say mostly, because they also do what they call subsampling and negative sampling.

Subsampling means that the sample at each step of the stochastic gradient descent is not chosen uniformly. This helps because some words are a lot more common than others and with uniform sampling we would need to train the model for extremely long times in order for the rare words to have any effect. Subsampling allows us to cut down the time without suffering loss of accuracy.

I will not go into the details of the optimisation algorithm, but this briefly negative sampling means the following. Keeping the same example with the word down, at every step of the stochastic gradient descent a few words that do not appear in the contexts of down are chosen, for example bridge and is and the weights are updated so that the predicted probability for these words becomes smaller. So at each step, the algorithm does not only update the weights to make the context words more probable, but also actively tries to make some of the other words less probable. Given that, as we discussed, the result depends on the algorithm, this training method should be considered as an integral part of the word2vec.

But what about CBOW?

It is practically the same. We again try to maximise the conditional probability distribution for every pair word-context.

What about the beginning of sentences in node2vec?

Since we are talking about networks now, this is not a linguistic problem any more. I wondered the same, but the original article does not go into these details. However, the code the group has shared used the word2vec implementation of gensim.

The gensin version of word2vec deals with the beginning of sentences by basically adding a dummy word, BLANK (I state this at a conceptual level, they do not actually add a new word).

In the case of our toy example, the first word-context pair becomes (London: BLANK, bridge). Now, when we compute the skip-gram loss function for our network, we ignore any terms which contain the word BLANK. In the case of CBOW, nothing really changes, we just take the average of all words in the context, ignoring BLANK.

Links

💬 Comments