This post gives an overview of methods that learn a joint cross-lingual word embedding space between different languages.

Note: If you are looking for a survey paper, this blog post is also available as an article on arXiv.

In past blog posts, we discussed different models, objective functions, and hyperparameter choices that allow us to learn accurate word embeddings. However, these models are generally restricted to capture representations of words in the language they were trained on. The availability of resources, training data, and benchmarks in English leads to a disproportionate focus on the English language and a negligence of the plethora of other languages that are spoken around the world.
In our globalised society, where national borders increasingly blur, where the Internet gives everyone equal access to information, it is thus imperative that we do not only seek to eliminate bias pertaining to gender or race inherent in our representations, but also aim to address our bias towards language.

To remedy this and level the linguistic playing field, we would like to leverage our existing knowledge in English to equip our models with the capability to process other languages.
Perfect machine translation (MT) would allow this. However, we do not need to actually translate examples, as long as we are able to project examples into a common subspace such as the one in Figure 1.

Figure 1: A shared embedding space between two languages (Luong et al., 2015)

Ultimately, our goal is to learn a shared embedding space between words in all languages. Equipped with such a vector space, we are able to train our models on data in any language. By projecting examples available in one language into this space, our model simultaneously obtains the capability to perform predictions in all other languages (we are glossing over some considerations here; for these, refer to this section). This is the promise of cross-lingual embeddings.

Over the course of this blog post, I will give an overview of models and algorithms that have been used to come closer to this elusive goal of capturing the relations between words in multiple languages in a common embedding space.

Note: While neural MT approaches implicitly learn a shared cross-lingual embedding space by optimizing for the MT objective, we will focus on models that explicitly learn cross-lingual word representations throughout this blog post. These methods generally do so at a much lower cost than MT and can be considered to be to MT what word embedding models (word2vec, GloVe, etc.) are to language modelling.

Types of cross-lingual embedding models

In recent years, various models for learning cross-lingual representations have been proposed. In the following, we will order them by the type of approach that they employ. Note that while the nature of the parallel data used is equally discriminatory and has been shown to account for inter-model performance differences , we consider the type of approach more conducive to understanding the assumptions a model makes and -- consequently -- its advantages and deficiencies.

Cross-lingual embedding models generally use four different approaches: