Word embeddings is probably one of the most beautiful and romantic ideas in the history of artificial intelligence. If Philosophy of Language is the branch of philosophy that explores the relationship between language and reality and how we are able to make meaningful conversations understanding one with the other, this very specific technique in modern Natural Language Processing (NLP) is, in a sense, the empirical proof of Ludwig Wittgenstein’s theories, one of the most relevant philosophers of the last century. For Wittgenstein, words’ usage is a move in a social language game, played by the members of a community that mutually understand each other. The meaning of a word relies on its usefulness in a context only, and not just in a 1:1 relationship with an existing world object.
For a large class of cases in which we employ the word “meaning” it can be defined thus: the meaning of a word is its use in the language.
Of course, knowing the exact meaning of a word is a complex stuff and involves a lot of different aspects: the object in the world to which it possibly refers, what part of speech (POS) it is, whether it is an idiomatic expression and all the flavours it may carry, and so on. All these aspects, in the end, can be summarized in a single one: know how to use.
The concept of meaning and why an ordered set of characters has a certain connotation in a language isn’t a philosophical interest only but it’s probably the greatest challenge that AI experts working with NLP have to face every day. As human beings speaking English it is quite trivial to understand that a “dog” is an “animal” and that is more similar to a “cat” than to a “dolphin” but this task is far from easy to be solved in a systematic way. Tweaking Wittgenstein’s theories a bit, we can say that dogs are similar to cats because they often appear in the same contexts: it’s more likely to find dogs and cats related to words like “house” and “garden” than to words like “sea” and “ocean”. This very intuition is the core of Word2Vec, one of the most famous and successful implementation of word embeddings. If today machines are quite distant from being able to actually understand long texts and passages, it is undisputed that word embedding is the single technique that has allowed the field to make the biggest step towards that direction in the last decade.
An initial problem in a lot of Computer-Science-related tasks is to represent data in a numerical form; words and sentences are probably the most challenging type of data to be represented that way. In our setting, words are selected from a vocabulary composed by D different words and each word in the collection can be associated to a numerical index i. The classical approach used for decades was to represent each word as a numeric D-sized vector, composed by all zeros except for a 1 at position i. Consider as an example a vocabulary composed by 3 words only: “dog”, “cat” and “dolphin” (D=3). Each word can be represented as a 3-sized vector: “dog” corresponds to [1, 0, 0], “cat” to [0,1,0] and “dolphin”, obviously, to [0, 0, 1]. A document, as a straightforward extension, can be represented as a D-sized vector where each element counts the occurrences of the i-th word in the document. This approach is called Bag of Words (BoW) and has been used for decades.
BoW, as far as it has been successful in the 90s, lacks the very single cool feature about words: meaning. We know that two very different words might have similar meanings even if they are completely different from an orthographic point of view. “Cat” and “dog” both represent pets, “king” and “queen” have the very same meaning and they just differ for the genre, “apple” and “cigarette” are completely unrelated one with the other. We know that, but using the BoW model they all have the same distance in the vector space, 1. The very same problem can be extended to documents: using BoW we can say that two documents are similar only if they contain the exact same words a certain number of times. And here is where Word2Vec comes to help, putting in Machine Learning terms a lot of the philosophical discussions that Wittgenstein made in his Philosophical Investigations 60 years before.
Given a dictionary of size D, where a word is identified by its index, the goal is to learn an N-sized vectorial representation for each word with N<<D. Ideally, we want this N-sized vector to be dense and able to represent some semantic-specific aspects of meaning. For example, we ideally want that “dog” and “cat” have similar representations, while “apple” and “cigarette” very distant ones in the vector space. We want to be able to make some basic algebraic operations on vectors like king+woman-man = queen, or that the distance between the vector representing “actor” and “actress” is pretty much the same as the distance between “prince” and “princess”. Even though these results are quite utopian, experiments show that the vectors obtained via Word2Vec exhibit properties very close to these ones.
Word2Vec doesn’t learn these representations directly but gets them as a side-result of an unsupervised classification task. An average NLP dataset (called corpus) is composed by a set of sentences; each word belonging to a sentence appears in a context of surrounding words. Goal of the classifier is to predict a target word given its context words as input. Extracting a sentence like “a brown dog is in the garden” the words [a, brown, is, in, the, garden] are provided as model’s input and the output word “dog” is the one to be predicted. This task is considered as unsupervised learning because the corpus doesn’t need to be labeled using an external source of truth: given a set of sentences it is always possible to create positive and negative examples automatically. Considering “a brown dog is in the garden” as a positive example we can create plenty of negative samples such as “a brown plane is in the garden” or “a brown the is in the garden”, replacing the target word “dog” with random words extracted from the dataset.
And it’s now quite clear where the Wittgenstein’s theories jump in: context is crucial to learn the embeddings as it’s crucial in his theories to attach meaning. In the same way as two words have similar meanings they will have similar representations (small distance in the N-dimensional space) just because they often appear in similar contexts. So “cat” and “dog” will end up having close vectors because they often appear in the same contexts: it’s useful for the model to use for them similar embeddings because it’s the most convenient thing it can do to have better performances in predicting the two words given their contexts.
The original paper proposes two different architectures: CBOW and Skip-gram. In both the cases word representations are trained along with the flavour-specific classification task, providing the best possible vector embeddings that maximize the performance of the model.
CBOW stands for Continuous Bag of Words, and its goal is to correctly predict a word given its context as input. Inputs and output are provided as D-sized vectors and are projected in a shared-weights N-sized space. The weights used to project D-sized vectors to the N-sized internal ones are the embeddings we are looking for. Basically, word embeddings are represented as a D x N matrix where each row represents a word of the vocabulary. All the context words are projected into the same position and their vector representations get averaged; therefore, the words’ order does not influence the projection.
Skip-gram does the same thing but reverted: trying to predict the C context words taking as input the target word. The problem of predicting multiple context words can be reshaped as a set of independent binary classification tasks, and the goal is now to predict the presence (or absence) of context words.
As a rule of thumb Skip-gram requires more time to be trained and often gives slightly better results, but, as usual, different applications have different requirements and it’s hard to predict in advance which of the two will outperform the other. For how simple the concept looks, training this kind of architectures is quite a nightmare, due to the amount of data and the computational power needed to optimize the models’ weights. Luckily, some pre-trained word embeddings can be found online and it’s possible to explore the vector space — the funniest part — with just few lines of code.
On top of the classical Word2Vec and following more or less the same approach, plenty of possible improvements have been proposed in the last years. The two most interesting and commonly used are GloVe (by Stanford University) and fastText (developed by Facebook) mainly for the limitations of the original algorithm they highlight and try to overcome.
In the original GloVe paper authors underline how training the model on separate local context poorly exploits the global statistics of the corpus. The first step to overcome this limitation is to create a global matrix X where each element i,j counts the number of times the word j appears in the context of word i. The second biggest contribution of this paper is in understanding that these raw probabilities alone aren’t so robust in determining meaning, introducing a matrix of co-occurrences from which certain aspects of meaning can be directly extracted.
Consider two words i and j that exhibit a particular aspect of interest; for concreteness, suppose we are interested in the concept of thermodynamic phase, for which we might take i = ice and j = steam. The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various probe words, k. For words k related to ice but not steam, say k = solid, we expect the ratio Pik/Pjk will be large. Similarly, for words k related to steam but not ice, say k = gas, the ratio should be small. For words k like water or fashion, that are either related to both ice and steam, or to neither, the ratio should be close to one.
This ratio of probabilities is now the starting point to learn embeddings. We want to be able to compute representations that combined with a specific function F maintains this ratio constant in the embedding space.
Function F and the dependence by the word k can be simplified and replaced by exponentials and fixed biases, giving as a result this least squares error function J:
Function f is a scoring function that tries not to weight to much frequent and rare co-occurrences while bi and bj are biases used to restore function symmetry. In the last paragraphs of the paper it is shown how training this model, in the end, is not so different from training a classical Skip-gram model, even if empirical tests show how GloVe outperforms both the Word2Vec implementations.
fastText on the other hand raises a completely different criticism to Word2Vec: training a model starting from a D-sized one-hot encoded vector has the drawback of ignoring the internal structure of words. Instead of one-hot encoding words learning word-representations, fastText proposes to learn representations for character n-grams, and to represent words as the sum of the n-gram vectors. For example, with n=3, the word “flower” is encoded as 6 different 3-grams [<fl, flo, low, owe, wer, er>] plus the special sequence <flower>. Note how the angular brackets are used to indicate the start and the end of a word. A word is thus represented by its index in the word dictionary and by the set of n-grams it contains, mapped to integers using a hashing function. This simple improvement allows to share the n-gram representations across words and to compute embeddings for words that did not appear in the training corpus.
As promised, using these embeddings is just a matter of few lines of Python code. I’ve run some experiments using a 50-sized GloVe model trained on 6 billion words extracted from sentences retrieved mainly on Wikipedia and a fastText 300-sized trained on Common Crawl (resulting in 600 billion tokens). In this paragraph results coming from both are mixed just to prove the concepts and to give a general understanding of the topic.
First of all, I wanted to test some basic word similarities, the simplest yet important feature of word embeddings. As expected, the most similar words to the word “dog” are “cat” (0.92), “dogs” (0.85), “horse” (0.79), “puppy” (0.78) and “pet” (0.77). Note how the plural form has pretty much the same meaning as the singular. Again, for us is quite trivial to say so, but for a machine it’s a completely different fact. Now food: the most similar words to “pizza” are “sandwich” (0.87), “sandwiches” (0.86), “snack” (0.81), “bakery” (0.79), “fries” (0.79) and “burgers” (0.78). Makes sense, the results are satisfactory and the model behaves quite well.
The next step is trying to perform some basic algebra in the vector space to check if some of the desired behaviors have been correctly learnt by our model. The word “actress” (0.94) can be obtained as the result of woman+actor-man, “king” (0.86) as man+queen-woman. Generally speaking if meaning-wise a : b = c : d, word d should be obtained as d = b-a+c. Taking this to the next level it’s unbelievable how this vector operations are able to describe even geographical aspects: we know that Rome is the capital of Italy as Berlin is the capital of Germany, indeed Berlin+Italy-Rome = Germany (0.88) and London+Germany-England = Berlin (0.83).
And now the funniest part, following the very same idea we can try to add-subtract concepts to see what happens. For example, what is the American equivalent of pizza for Italians? pizza + America-Italy = burgers (0.60), followed by cheeseburgers (0.59). Since I moved to the Netherlands I always say that this country is a mix of three things: a little bit of American capitalism, the Swedish cold and quality of life, and, finally, a pinch of the Neapolitan exuberance. Tweaking the original theorem a little by removing a bit of Swiss precision we get Holland (0.68) as USA+Sweden+Naples-Switzerland: quite impressive, to be honest.
Good hands-on starting points to use this pre-trained embeddings can be found here and here. Gensim is an easy and complete library written in Python and has some ready-to-use algebraic and similarity functions. These pre-trained embeddings might be used in a lot of different (and useful) ways, for example to improve performances of sentiment analyzers or language models. Rather than feeding these models (whatever is their task!) with one-hot-encoded words, using instead these N-sized vectors should improve performances significantly. Of course, training ad-hoc domain-specific embeddings can lead to even better performances, but again, the time and effort required to train this kind of architectures might be a little overkill.