Enriching Word Vectors with Subword Information

doi:10.1162/TACL_A_00051

Piotr Bojanowski

∗

and Edouard Grave

∗

and Armand Joulin and Tomas Mikolov

Facebook AI Research

{bojanowski,egrave,ajoulin,tmikolov}@fb.com

Abstract

Continuous word representations, trained on

large unlabeled corpora are useful for many

natural language processing tasks. Popular

models that learn such representations ignore

the morphology of words, by assigning a dis-

tinct vector to each word. This is a limitation,

especially for languages with large vocabular-

ies and many rare words. In this paper, we pro-

pose a new approach based on the skipgram

model, where each word is represented as a

bag of character n-grams. A vector represen-

tation is associated to each character n-gram;

words being represented as the sum of these

representations. Our method is fast, allow-

ing to train models on large corpora quickly

and allows us to compute word representations

for words that did not appear in the training

data. We evaluate our word representations on

nine different languages, both on word sim-

ilarity and analogy tasks. By comparing to

recently proposed morphological word repre-

sentations, we show that our vectors achieve

state-of-the-art performance on these tasks.

1 Introduction

Learning continuous representations of words has a

long history in natural language processing (Rumel-

hart et al., 1988). These representations are typ-

ically derived from large unlabeled corpora using

co-occurrence statistics (Deerwester et al., 1990;

Schütze, 1992; Lund and Burgess, 1996). A large

body of work, known as distributional semantics,

has studied the properties of these methods (Turney

∗

The two ﬁrst authors contributed equally.

et al., 2010; Baroni and Lenci, 2010). In the neural

network community, Collobert and Weston (2008)

proposed to learn word embeddings using a feed-

forward neural network, by predicting a word based

on the two words on the left and two words on the

right. More recently, Mikolov et al. (2013b) pro-

posed simple log-bilinear models to learn continu-

ous representations of words on very large corpora

efﬁciently.

Most of these techniques represent each word of

the vocabulary by a distinct vector, without param-

eter sharing. In particular, they ignore the internal

structure of words, which is an important limitation

for morphologically rich languages, such as Turk-

ish or Finnish. For example, in French or Spanish,

most verbs have more than forty different inﬂected

forms, while the Finnish language has ﬁfteen cases

for nouns. These languages contain many word

forms that occur rarely (or not at all) in the training

corpus, making it difﬁcult to learn good word rep-

resentations. Because many word formations follow

rules, it is possible to improve vector representations

for morphologically rich languages by using charac-

ter level information.

In this paper, we propose to learn representations

for character n-grams, and to represent words as the

sum of the n-gram vectors. Our main contribution

is to introduce an extension of the continuous skip-

gram model (Mikolov et al., 2013b), which takes

into account subword information. We evaluate this

model on nine languages exhibiting different mor-

phologies, showing the beneﬁt of our approach.

135

Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. Action Editor: Hinrich Sch

¨

utze.

Submission batch: 9/2016; Revision batch: 12/2016; Published 6/2017.

c

2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

2 Related work

Morphological word representations. In recent

years, many methods have been proposed to incor-

porate morphological information into word repre-

sentations. To model rare words better, Alexan-

drescu and Kirchhoff (2006) introduced factored

neural language models, where words are repre-

sented as sets of features. These features might in-

clude morphological information, and this technique

was succesfully applied to morphologically rich lan-

guages, such as Turkish (Sak et al., 2010). Re-

cently, several works have proposed different com-

position functions to derive representations of words

from morphemes (Lazaridou et al., 2013; Luong

et al., 2013; Botha and Blunsom, 2014; Qiu et

al., 2014). These different approaches rely on a

morphological decomposition of words, while ours

does not. Similarly, Chen et al. (2015) introduced

a method to jointly learn embeddings for Chinese

words and characters. Cui et al. (2015) proposed

to constrain morphologically similar words to have

similar representations. Soricut and Och (2015)

described a method to learn vector representations

of morphological transformations, allowing to ob-

tain representations for unseen words by applying

these rules. Word representations trained on mor-

phologically annotated data were introduced by Cot-

terell and Schütze (2015). Closest to our approach,

Schütze (1993) learned representations of character

four-grams through singular value decomposition,

and derived representations for words by summing

the four-grams representations. Very recently, Wi-

eting et al. (2016) also proposed to represent words

using character n-gram count vectors. However, the

objective function used to learn these representa-

tions is based on paraphrase pairs, while our model

can be trained on any text corpus.

Character level features for NLP. Another area

of research closely related to our work are character-

level models for natural language processing. These

models discard the segmentation into words and aim

at learning language representations directly from

characters. A ﬁrst class of such models are recur-

rent neural networks, applied to language model-

ing (Mikolov et al., 2012; Sutskever et al., 2011;

Graves, 2013; Bojanowski et al., 2015), text nor-

malization (Chrupała, 2014), part-of-speech tag-

ging (Ling et al., 2015) and parsing (Ballesteros et

al., 2015). Another family of models are convolu-

tional neural networks trained on characters, which

were applied to part-of-speech tagging (dos San-

tos and Zadrozny, 2014), sentiment analysis (dos

Santos and Gatti, 2014), text classiﬁcation (Zhang

et al., 2015) and language modeling (Kim et al.,

2016). Sperr et al. (2013) introduced a language

model based on restricted Boltzmann machines, in

which words are encoded as a set of character n-

grams. Finally, recent works in machine translation

have proposed using subword units to obtain repre-

sentations of rare words (Sennrich et al., 2016; Lu-

ong and Manning, 2016).

3 Model

In this section, we propose our model to learn word

representations while taking into account morphol-

ogy. We model morphology by considering subword

units, and representing words by a sum of its charac-

ter n -grams. We will begin by presenting the general

framework that we use to train word vectors, then

present our subword model and eventually describe

how we handle the dictionary of character n-grams.

3.1 General model

We start by brieﬂy reviewing the continuous skip-

gram model introduced by Mikolov et al. (2013b),

from which our model is derived. Given a word vo-

cabulary of size W , where a word is identiﬁed by

its index w ∈ {1, ..., W }, the goal is to learn a

vectorial representation for each word w. Inspired

by the distributional hypothesis (Harris, 1954), word

representations are trained to predict well words that

appear in its context. More formally, given a large

training corpus represented as a sequence of words

w

1

, ..., w

T

, the objective of the skipgram model is to

maximize the following log-likelihood:

T

X

t=1

X

c∈C

t

log p(w

c

| w

t

),

where the context C

t

is the set of indices of words

surrounding word w

t

. The probability of observing

a context word w

c

given w

t

will be parameterized

using the aforementioned word vectors. For now, let

us consider that we are given a scoring function s

which maps pairs of (word, context) to scores in R.

136

One possible choice to deﬁne the probability of a

context word is the softmax:

p(w

c

| w

t

) =

e

s(w

t

, w

c

)

P

W

j=1

e

s(w

t

, j)

.

However, such a model is not adapted to our case as

it implies that, given a word w

t

, we only predict one

context word w

c

.

The problem of predicting context words can in-

stead be framed as a set of independent binary clas-

siﬁcation tasks. Then the goal is to independently

predict the presence (or absence) of context words.

For the word at position t we consider all context

words as positive examples and sample negatives at

random from the dictionary. For a chosen context

position c, using the binary logistic loss, we obtain

the following negative log-likelihood:

log



1 + e

−s(w

t

, w

c

)



+

X

n∈N

t,c

log



1 + e

s(w

t

, n)



,

where N

t,c

is a set of negative examples sampled

from the vocabulary. By denoting the logistic loss

function ℓ : x 7→ log(1 + e

−x

), we can re-write the

objective as:

T

X

t=1





X

c∈C

t

ℓ(s(w

t

, w

c

)) +

X

n∈N

t,c

ℓ(−s(w

t

, n))





.

A natural parameterization for the scoring function

s between a word w

t

and a context word w

c

is to use

word vectors. Let us deﬁne for each word w in the

vocabulary two vectors u

w

and v

w

in R

d

. These two

vectors are sometimes referred to as input and out-

put vectors in the literature. In particular, we have

vectors u

w

t

and v

w

c

, corresponding, respectively, to

words w

t

and w

c

. Then the score can be computed

as the scalar product between word and context vec-

tors as s(w

t

, w

c

) = u

⊤

w

t

v

w

c

. The model described

in this section is the skipgram model with negative

sampling, introduced by Mikolov et al. (2013b).

3.2 Subword model

By using a distinct vector representation for each

word, the skipgram model ignores the internal struc-

ture of words. In this section, we propose a different

scoring function s, in order to take into account this

information.

Each word w is represented as a bag of character

n-gram. We add special boundary symbols < and >

at the beginning and end of words, allowing to dis-

tinguish preﬁxes and sufﬁxes from other character

sequences. We also include the word w itself in the

set of its n -grams, to learn a representation for each

word (in addition to character n-grams). Taking the

word where and n = 3 as an example, it will be

represented by the character n-grams:

<wh, whe, her, ere, re>

and the special sequence

<where>.

Note that the sequence <her>, corresponding to the

word her is different from the tri-gram her from the

word where. In practice, we extract all the n-grams

for n greater or equal to 3 and smaller or equal to 6.

This is a very simple approach, and different sets of

n-grams could be considered, for example taking all

preﬁxes and sufﬁxes.

Suppose that you are given a dictionary of n-

grams of size G. Given a word w, let us denote by

G

w

⊂ {1, . . . , G} the set of n-grams appearing in

w. We associate a vector representation z

g

to each

n-gram g. We represent a word by the sum of the

vector representations of its n-grams. We thus ob-

tain the scoring function:

s(w, c) =

X

g∈G

w

z

⊤

g

v

c

.

This simple model allows sharing the representa-

tions across words, thus allowing to learn reliable

representation for rare words.

In order to bound the memory requirements of our

model, we use a hashing function that maps n-grams

to integers in 1 to K. We hash character sequences

using the Fowler-Noll-Vo hashing function (speciﬁ-

cally the FNV-1a variant).

1

We set K = 2. 10

6

be-

low. Ultimately, a word is represented by its index

in the word dictionary and the set of hashed n-grams

it contains.

4 Experimental setup

4.1 Baseline

In most experiments (except in Sec. 5.3), we

compare our model to the C implementation

1

http://www.isthe.com/chongo/tech/comp/fnv

137

of the skipgram and cbow models from the

word2vec

2

package.

4.2 Optimization

We solve our optimization problem by perform-

ing stochastic gradient descent on the negative log

likelihood presented before. As in the baseline

skipgram model, we use a linear decay of the step

size. Given a training set containing T words and

a number of passes over the data equal to P , the

step size at time t is equal to γ

0

(1 −

t

T P

), where

γ

0

is a ﬁxed parameter. We carry out the optimiza-

tion in parallel, by resorting to Hogwild (Recht et

al., 2011). All threads share parameters and update

vectors in an asynchronous manner.

4.3 Implementation details

For both our model and the baseline experiments, we

use the following parameters: the word vectors have

dimension 300. For each positive example, we sam-

ple 5 negatives at random, with probability propor-

tional to the square root of the uni-gram frequency.

We use a context window of size c, and uniformly

sample the size c between 1 and 5. In order to sub-

sample the most frequent words, we use a rejection

threshold of 10

−4

(for more details, see (Mikolov et

al., 2013b)). When building the word dictionary, we

keep the words that appear at least 5 times in the

training set. The step size γ

0

is set to 0.025 for the

skipgram baseline and to 0.05 for both our model

and the cbow baseline. These are the default values

in the word2vec package and work well for our

model too.

Using this setting on English data, our model with

character n-grams is approximately 1.5× slower

to train than the skipgram baseline. Indeed,

we process 105k words/second/thread versus 145k

words/second/thread for the baseline. Our model is

implemented in C++, and is publicly available.

3

4.4 Datasets

Except for the comparison to previous

work (Sec. 5.3), we train our models on Wikipedia

data.

4

We downloaded Wikipedia dumps in nine

languages: Arabic, Czech, German, English,

2

https://code.google.com/archive/p/word2vec

3

https://github.com/facebookresearch/fastText

4

https://dumps.wikimedia.org

Spanish, French, Italian, Romanian and Russian.

We normalize the raw Wikipedia data using Matt

Mahoney’s pre-processing perl script.

5

All the

datasets are shufﬂed, and we train our models by

doing ﬁve passes over them.

5 Results

We evaluate our model in ﬁve experiments: an eval-

uation of word similarity and word analogies, a com-

parison to state-of-the-art methods, an analysis of

the effect of the size of training data and of the size

of character n-grams that we consider. We will de-

scribe these experiments in detail in the following

sections.

5.1 Human similarity judgement

We ﬁrst evaluate the quality of our representations

on the task of word similarity / relatedness. We do

so by computing Spearman’s rank correlation co-

efﬁcient (Spearman, 1904) between human judge-

ment and the cosine similarity between the vector

representations. For German, we compare the dif-

ferent models on three datasets: GUR65, GUR350

and ZG222 (Gurevych, 2005; Zesch and Gurevych,

2006). For English, we use the WS353 dataset in-

troduced by Finkelstein et al. (2001) and the rare

word dataset (RW), introduced by Luong et al.

(2013). We evaluate the French word vectors on

the translated dataset RG65 (Joubarne and Inkpen,

2011). Spanish, Arabic and Romanian word vectors

are evaluated using the datasets described in (Hassan

and Mihalcea, 2009). Russian word vectors are eval-

uated using the HJ dataset introduced by Panchenko

et al. (2016).

We report results for our method and baselines

for all datasets in Table 1. Some words from these

datasets do not appear in our training data, and

thus, we cannot obtain word representation for these

words using the cbow and skipgram baselines. In

order to provide comparable results, we propose by

default to use null vectors for these words. Since our

model exploits subword information, we can also

compute valid representations for out-of-vocabulary

words. We do so by taking the sum of its n-gram

vectors. When OOV words are represented using

5

http://mattmahoney.net/dc/textdata

138

sg cbow sisg- sisg

AR WS353 51 52 54 55

DE

GUR350 61 62 64 70

GUR65 78 78 81 81

ZG222 35 38 41 44

EN

RW 43 43 46 47

WS353 72 73 71 71

ES WS353 57 58 58 59

FR RG65 70 69 75 75

RO WS353 48 52 51 54

RU HJ 59 60 60 66

Table 1: Correlation between human judgement and

similarity scores on word similarity datasets. We

train both our model and the word2vec baseline on

normalized Wikipedia dumps. Evaluation datasets

contain words that are not part of the training set,

so we represent them using null vectors (sisg-).

With our model, we also compute vectors for unseen

words by summing the n-gram vectors (sisg).

null vectors we refer to our method as sisg- and

sisg otherwise (Subword Information Skip Gram).

First, by looking at Table 1, we notice that the pro-

posed model (sisg), which uses subword informa-

tion, outperforms the baselines on all datasets except

the English WS353 dataset. Moreover, computing

vectors for out-of-vocabulary words (sisg) is al-

ways at least as good as not doing so (sisg-). This

proves the advantage of using subword information

in the form of character n-grams.

Second, we observe that the effect of using char-

acter n-grams is more important for Arabic, Ger-

man and Russian than for English, French or Span-

ish. German and Russian exhibit grammatical de-

clensions with four cases for German and six for

Russian. Also, many German words are compound

words; for instance the nominal phrase “table ten-

nis” is written in a single word as “Tischtennis”. By

exploiting the character-level similarities between

“Tischtennis” and “Tennis”, our model does not rep-

resent the two words as completely different words.

Finally, we observe that on the English Rare

Words dataset (RW), our approach outperforms the

sg cbow sisg

CS

Semantic 25.7 27.6 27.5

Syntactic 52.8 55.0 77.8

DE

Semantic 66.5 66.8 62.3

Syntactic 44.5 45.0 56.4

EN

Semantic 78.5 78.2 77.8

Syntactic 70.1 69.9 74.9

IT

Semantic 52.3 54.7 52.3

Syntactic 51.5 51.8 62.7

Table 2: Accuracy of our model and baselines on

word analogy tasks for Czech, German, English and

Italian. We report results for semantic and syntactic

analogies separately.

baselines while it does not on the English WS353

dataset. This is due to the fact that words in the En-

glish WS353 dataset are common words for which

good vectors can be obtained without exploiting

subword information. When evaluating on less fre-

quent words, we see that using similarities at the

character level between words can help learning

good word vectors.

5.2 Word analogy tasks

We now evaluate our approach on word analogy

questions, of the form A is to B as C is to D,

where D must be predicted by the models. We use

the datasets introduced by Mikolov et al. (2013a)

for English, by Svoboda and Brychcin (2016) for

Czech, by Köper et al. (2015) for German and by

Berardi et al. (2015) for Italian. Some questions con-

tain words that do not appear in our training corpus,

and we thus excluded these questions from the eval-

uation.

We report accuracy for the different models in

Table 2. We observe that morphological informa-

tion signiﬁcantly improves the syntactic tasks; our

approach outperforms the baselines. In contrast,

it does not help for semantic questions, and even

degrades the performance for German and Italian.

Note that this is tightly related to the choice of the

length of character n-grams that we consider. We

show in Sec. 5.5 that when the size of the n -grams

is chosen optimally, the semantic analogies degrade

139

Enriching Word Vectors with Subword Information

Citations

Deep contextualized word representations

Recent Trends in Deep Learning Based Natural Language Processing [Review Article]

Deep contextualized word representations

Cross-lingual Language Model Pretraining.

Word translation without parallel data

References

Distributed Representations of Words and Phrases and their Compositionality

Learning representations by back-propagating errors

Efficient Estimation of Word Representations in Vector Space

Indexing by Latent Semantic Analysis

Neural Machine Translation of Rare Words with Subword Units

Related Papers (5)

Glove: Global Vectors for Word Representation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Distributed Representations of Words and Phrases and their Compositionality

Efficient Estimation of Word Representations in Vector Space

Long short-term memory