scispace - formally typeset
Open AccessJournal ArticleDOI

Enriching Word Vectors with Subword Information

Reads0
Chats0
TLDR
This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.
Abstract
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models to learn such representations  ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

read more

Content maybe subject to copyright    Report

Enriching Word Vectors with Subword Information
Piotr Bojanowski
and Edouard Grave
and Armand Joulin and Tomas Mikolov
Facebook AI Research
{bojanowski,egrave,ajoulin,tmikolov}@fb.com
Abstract
Continuous word representations, trained on
large unlabeled corpora are useful for many
natural language processing tasks. Popular
models that learn such representations ignore
the morphology of words, by assigning a dis-
tinct vector to each word. This is a limitation,
especially for languages with large vocabular-
ies and many rare words. In this paper, we pro-
pose a new approach based on the skipgram
model, where each word is represented as a
bag of character n-grams. A vector represen-
tation is associated to each character n-gram;
words being represented as the sum of these
representations. Our method is fast, allow-
ing to train models on large corpora quickly
and allows us to compute word representations
for words that did not appear in the training
data. We evaluate our word representations on
nine different languages, both on word sim-
ilarity and analogy tasks. By comparing to
recently proposed morphological word repre-
sentations, we show that our vectors achieve
state-of-the-art performance on these tasks.
1 Introduction
Learning continuous representations of words has a
long history in natural language processing (Rumel-
hart et al., 1988). These representations are typ-
ically derived from large unlabeled corpora using
co-occurrence statistics (Deerwester et al., 1990;
Schütze, 1992; Lund and Burgess, 1996). A large
body of work, known as distributional semantics,
has studied the properties of these methods (Turney
The two first authors contributed equally.
et al., 2010; Baroni and Lenci, 2010). In the neural
network community, Collobert and Weston (2008)
proposed to learn word embeddings using a feed-
forward neural network, by predicting a word based
on the two words on the left and two words on the
right. More recently, Mikolov et al. (2013b) pro-
posed simple log-bilinear models to learn continu-
ous representations of words on very large corpora
efficiently.
Most of these techniques represent each word of
the vocabulary by a distinct vector, without param-
eter sharing. In particular, they ignore the internal
structure of words, which is an important limitation
for morphologically rich languages, such as Turk-
ish or Finnish. For example, in French or Spanish,
most verbs have more than forty different inflected
forms, while the Finnish language has fifteen cases
for nouns. These languages contain many word
forms that occur rarely (or not at all) in the training
corpus, making it difficult to learn good word rep-
resentations. Because many word formations follow
rules, it is possible to improve vector representations
for morphologically rich languages by using charac-
ter level information.
In this paper, we propose to learn representations
for character n-grams, and to represent words as the
sum of the n-gram vectors. Our main contribution
is to introduce an extension of the continuous skip-
gram model (Mikolov et al., 2013b), which takes
into account subword information. We evaluate this
model on nine languages exhibiting different mor-
phologies, showing the benefit of our approach.
135
Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. Action Editor: Hinrich Sch
¨
utze.
Submission batch: 9/2016; Revision batch: 12/2016; Published 6/2017.
c
2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

2 Related work
Morphological word representations. In recent
years, many methods have been proposed to incor-
porate morphological information into word repre-
sentations. To model rare words better, Alexan-
drescu and Kirchhoff (2006) introduced factored
neural language models, where words are repre-
sented as sets of features. These features might in-
clude morphological information, and this technique
was succesfully applied to morphologically rich lan-
guages, such as Turkish (Sak et al., 2010). Re-
cently, several works have proposed different com-
position functions to derive representations of words
from morphemes (Lazaridou et al., 2013; Luong
et al., 2013; Botha and Blunsom, 2014; Qiu et
al., 2014). These different approaches rely on a
morphological decomposition of words, while ours
does not. Similarly, Chen et al. (2015) introduced
a method to jointly learn embeddings for Chinese
words and characters. Cui et al. (2015) proposed
to constrain morphologically similar words to have
similar representations. Soricut and Och (2015)
described a method to learn vector representations
of morphological transformations, allowing to ob-
tain representations for unseen words by applying
these rules. Word representations trained on mor-
phologically annotated data were introduced by Cot-
terell and Schütze (2015). Closest to our approach,
Schütze (1993) learned representations of character
four-grams through singular value decomposition,
and derived representations for words by summing
the four-grams representations. Very recently, Wi-
eting et al. (2016) also proposed to represent words
using character n-gram count vectors. However, the
objective function used to learn these representa-
tions is based on paraphrase pairs, while our model
can be trained on any text corpus.
Character level features for NLP. Another area
of research closely related to our work are character-
level models for natural language processing. These
models discard the segmentation into words and aim
at learning language representations directly from
characters. A first class of such models are recur-
rent neural networks, applied to language model-
ing (Mikolov et al., 2012; Sutskever et al., 2011;
Graves, 2013; Bojanowski et al., 2015), text nor-
malization (Chrupała, 2014), part-of-speech tag-
ging (Ling et al., 2015) and parsing (Ballesteros et
al., 2015). Another family of models are convolu-
tional neural networks trained on characters, which
were applied to part-of-speech tagging (dos San-
tos and Zadrozny, 2014), sentiment analysis (dos
Santos and Gatti, 2014), text classification (Zhang
et al., 2015) and language modeling (Kim et al.,
2016). Sperr et al. (2013) introduced a language
model based on restricted Boltzmann machines, in
which words are encoded as a set of character n-
grams. Finally, recent works in machine translation
have proposed using subword units to obtain repre-
sentations of rare words (Sennrich et al., 2016; Lu-
ong and Manning, 2016).
3 Model
In this section, we propose our model to learn word
representations while taking into account morphol-
ogy. We model morphology by considering subword
units, and representing words by a sum of its charac-
ter n -grams. We will begin by presenting the general
framework that we use to train word vectors, then
present our subword model and eventually describe
how we handle the dictionary of character n-grams.
3.1 General model
We start by briefly reviewing the continuous skip-
gram model introduced by Mikolov et al. (2013b),
from which our model is derived. Given a word vo-
cabulary of size W , where a word is identified by
its index w {1, ..., W }, the goal is to learn a
vectorial representation for each word w. Inspired
by the distributional hypothesis (Harris, 1954), word
representations are trained to predict well words that
appear in its context. More formally, given a large
training corpus represented as a sequence of words
w
1
, ..., w
T
, the objective of the skipgram model is to
maximize the following log-likelihood:
T
X
t=1
X
c∈C
t
log p(w
c
| w
t
),
where the context C
t
is the set of indices of words
surrounding word w
t
. The probability of observing
a context word w
c
given w
t
will be parameterized
using the aforementioned word vectors. For now, let
us consider that we are given a scoring function s
which maps pairs of (word, context) to scores in R.
136

One possible choice to define the probability of a
context word is the softmax:
p(w
c
| w
t
) =
e
s(w
t
, w
c
)
P
W
j=1
e
s(w
t
, j)
.
However, such a model is not adapted to our case as
it implies that, given a word w
t
, we only predict one
context word w
c
.
The problem of predicting context words can in-
stead be framed as a set of independent binary clas-
sification tasks. Then the goal is to independently
predict the presence (or absence) of context words.
For the word at position t we consider all context
words as positive examples and sample negatives at
random from the dictionary. For a chosen context
position c, using the binary logistic loss, we obtain
the following negative log-likelihood:
log
1 + e
s(w
t
, w
c
)
+
X
n∈N
t,c
log
1 + e
s(w
t
, n)
,
where N
t,c
is a set of negative examples sampled
from the vocabulary. By denoting the logistic loss
function : x 7→ log(1 + e
x
), we can re-write the
objective as:
T
X
t=1
X
c∈C
t
(s(w
t
, w
c
)) +
X
n∈N
t,c
(s(w
t
, n))
.
A natural parameterization for the scoring function
s between a word w
t
and a context word w
c
is to use
word vectors. Let us define for each word w in the
vocabulary two vectors u
w
and v
w
in R
d
. These two
vectors are sometimes referred to as input and out-
put vectors in the literature. In particular, we have
vectors u
w
t
and v
w
c
, corresponding, respectively, to
words w
t
and w
c
. Then the score can be computed
as the scalar product between word and context vec-
tors as s(w
t
, w
c
) = u
w
t
v
w
c
. The model described
in this section is the skipgram model with negative
sampling, introduced by Mikolov et al. (2013b).
3.2 Subword model
By using a distinct vector representation for each
word, the skipgram model ignores the internal struc-
ture of words. In this section, we propose a different
scoring function s, in order to take into account this
information.
Each word w is represented as a bag of character
n-gram. We add special boundary symbols < and >
at the beginning and end of words, allowing to dis-
tinguish prefixes and suffixes from other character
sequences. We also include the word w itself in the
set of its n -grams, to learn a representation for each
word (in addition to character n-grams). Taking the
word where and n = 3 as an example, it will be
represented by the character n-grams:
<wh, whe, her, ere, re>
and the special sequence
<where>.
Note that the sequence <her>, corresponding to the
word her is different from the tri-gram her from the
word where. In practice, we extract all the n-grams
for n greater or equal to 3 and smaller or equal to 6.
This is a very simple approach, and different sets of
n-grams could be considered, for example taking all
prefixes and suffixes.
Suppose that you are given a dictionary of n-
grams of size G. Given a word w, let us denote by
G
w
{1, . . . , G} the set of n-grams appearing in
w. We associate a vector representation z
g
to each
n-gram g. We represent a word by the sum of the
vector representations of its n-grams. We thus ob-
tain the scoring function:
s(w, c) =
X
g∈G
w
z
g
v
c
.
This simple model allows sharing the representa-
tions across words, thus allowing to learn reliable
representation for rare words.
In order to bound the memory requirements of our
model, we use a hashing function that maps n-grams
to integers in 1 to K. We hash character sequences
using the Fowler-Noll-Vo hashing function (specifi-
cally the FNV-1a variant).
1
We set K = 2. 10
6
be-
low. Ultimately, a word is represented by its index
in the word dictionary and the set of hashed n-grams
it contains.
4 Experimental setup
4.1 Baseline
In most experiments (except in Sec. 5.3), we
compare our model to the C implementation
1
http://www.isthe.com/chongo/tech/comp/fnv
137

of the skipgram and cbow models from the
word2vec
2
package.
4.2 Optimization
We solve our optimization problem by perform-
ing stochastic gradient descent on the negative log
likelihood presented before. As in the baseline
skipgram model, we use a linear decay of the step
size. Given a training set containing T words and
a number of passes over the data equal to P , the
step size at time t is equal to γ
0
(1
t
T P
), where
γ
0
is a fixed parameter. We carry out the optimiza-
tion in parallel, by resorting to Hogwild (Recht et
al., 2011). All threads share parameters and update
vectors in an asynchronous manner.
4.3 Implementation details
For both our model and the baseline experiments, we
use the following parameters: the word vectors have
dimension 300. For each positive example, we sam-
ple 5 negatives at random, with probability propor-
tional to the square root of the uni-gram frequency.
We use a context window of size c, and uniformly
sample the size c between 1 and 5. In order to sub-
sample the most frequent words, we use a rejection
threshold of 10
4
(for more details, see (Mikolov et
al., 2013b)). When building the word dictionary, we
keep the words that appear at least 5 times in the
training set. The step size γ
0
is set to 0.025 for the
skipgram baseline and to 0.05 for both our model
and the cbow baseline. These are the default values
in the word2vec package and work well for our
model too.
Using this setting on English data, our model with
character n-grams is approximately 1.5× slower
to train than the skipgram baseline. Indeed,
we process 105k words/second/thread versus 145k
words/second/thread for the baseline. Our model is
implemented in C++, and is publicly available.
3
4.4 Datasets
Except for the comparison to previous
work (Sec. 5.3), we train our models on Wikipedia
data.
4
We downloaded Wikipedia dumps in nine
languages: Arabic, Czech, German, English,
2
https://code.google.com/archive/p/word2vec
3
https://github.com/facebookresearch/fastText
4
https://dumps.wikimedia.org
Spanish, French, Italian, Romanian and Russian.
We normalize the raw Wikipedia data using Matt
Mahoney’s pre-processing perl script.
5
All the
datasets are shuffled, and we train our models by
doing five passes over them.
5 Results
We evaluate our model in five experiments: an eval-
uation of word similarity and word analogies, a com-
parison to state-of-the-art methods, an analysis of
the effect of the size of training data and of the size
of character n-grams that we consider. We will de-
scribe these experiments in detail in the following
sections.
5.1 Human similarity judgement
We first evaluate the quality of our representations
on the task of word similarity / relatedness. We do
so by computing Spearman’s rank correlation co-
efficient (Spearman, 1904) between human judge-
ment and the cosine similarity between the vector
representations. For German, we compare the dif-
ferent models on three datasets: GUR65, GUR350
and ZG222 (Gurevych, 2005; Zesch and Gurevych,
2006). For English, we use the WS353 dataset in-
troduced by Finkelstein et al. (2001) and the rare
word dataset (RW), introduced by Luong et al.
(2013). We evaluate the French word vectors on
the translated dataset RG65 (Joubarne and Inkpen,
2011). Spanish, Arabic and Romanian word vectors
are evaluated using the datasets described in (Hassan
and Mihalcea, 2009). Russian word vectors are eval-
uated using the HJ dataset introduced by Panchenko
et al. (2016).
We report results for our method and baselines
for all datasets in Table 1. Some words from these
datasets do not appear in our training data, and
thus, we cannot obtain word representation for these
words using the cbow and skipgram baselines. In
order to provide comparable results, we propose by
default to use null vectors for these words. Since our
model exploits subword information, we can also
compute valid representations for out-of-vocabulary
words. We do so by taking the sum of its n-gram
vectors. When OOV words are represented using
5
http://mattmahoney.net/dc/textdata
138

sg cbow sisg- sisg
AR WS353 51 52 54 55
DE
GUR350 61 62 64 70
GUR65 78 78 81 81
ZG222 35 38 41 44
EN
RW 43 43 46 47
WS353 72 73 71 71
ES WS353 57 58 58 59
FR RG65 70 69 75 75
RO WS353 48 52 51 54
RU HJ 59 60 60 66
Table 1: Correlation between human judgement and
similarity scores on word similarity datasets. We
train both our model and the word2vec baseline on
normalized Wikipedia dumps. Evaluation datasets
contain words that are not part of the training set,
so we represent them using null vectors (sisg-).
With our model, we also compute vectors for unseen
words by summing the n-gram vectors (sisg).
null vectors we refer to our method as sisg- and
sisg otherwise (Subword Information Skip Gram).
First, by looking at Table 1, we notice that the pro-
posed model (sisg), which uses subword informa-
tion, outperforms the baselines on all datasets except
the English WS353 dataset. Moreover, computing
vectors for out-of-vocabulary words (sisg) is al-
ways at least as good as not doing so (sisg-). This
proves the advantage of using subword information
in the form of character n-grams.
Second, we observe that the effect of using char-
acter n-grams is more important for Arabic, Ger-
man and Russian than for English, French or Span-
ish. German and Russian exhibit grammatical de-
clensions with four cases for German and six for
Russian. Also, many German words are compound
words; for instance the nominal phrase “table ten-
nis” is written in a single word as “Tischtennis”. By
exploiting the character-level similarities between
“Tischtennis” and “Tennis”, our model does not rep-
resent the two words as completely different words.
Finally, we observe that on the English Rare
Words dataset (RW), our approach outperforms the
sg cbow sisg
CS
Semantic 25.7 27.6 27.5
Syntactic 52.8 55.0 77.8
DE
Semantic 66.5 66.8 62.3
Syntactic 44.5 45.0 56.4
EN
Semantic 78.5 78.2 77.8
Syntactic 70.1 69.9 74.9
IT
Semantic 52.3 54.7 52.3
Syntactic 51.5 51.8 62.7
Table 2: Accuracy of our model and baselines on
word analogy tasks for Czech, German, English and
Italian. We report results for semantic and syntactic
analogies separately.
baselines while it does not on the English WS353
dataset. This is due to the fact that words in the En-
glish WS353 dataset are common words for which
good vectors can be obtained without exploiting
subword information. When evaluating on less fre-
quent words, we see that using similarities at the
character level between words can help learning
good word vectors.
5.2 Word analogy tasks
We now evaluate our approach on word analogy
questions, of the form A is to B as C is to D,
where D must be predicted by the models. We use
the datasets introduced by Mikolov et al. (2013a)
for English, by Svoboda and Brychcin (2016) for
Czech, by Köper et al. (2015) for German and by
Berardi et al. (2015) for Italian. Some questions con-
tain words that do not appear in our training corpus,
and we thus excluded these questions from the eval-
uation.
We report accuracy for the different models in
Table 2. We observe that morphological informa-
tion significantly improves the syntactic tasks; our
approach outperforms the baselines. In contrast,
it does not help for semantic questions, and even
degrades the performance for German and Italian.
Note that this is tightly related to the choice of the
length of character n-grams that we consider. We
show in Sec. 5.5 that when the size of the n -grams
is chosen optimally, the semantic analogies degrade
139

Citations
More filters
Proceedings ArticleDOI

Deep contextualized word representations

TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).
Journal ArticleDOI

Recent Trends in Deep Learning Based Natural Language Processing [Review Article]

TL;DR: This paper reviews significant deep learning related models and methods that have been employed for numerous NLP tasks and provides a walk-through of their evolution.
Posted Content

Deep contextualized word representations

TL;DR: This article introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).
Posted Content

Cross-lingual Language Model Pretraining.

TL;DR: This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
Proceedings Article

Word translation without parallel data

TL;DR: It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.
References
More filters
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Journal ArticleDOI

Learning representations by back-propagating errors

TL;DR: Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.
Posted Content

Efficient Estimation of Word Representations in Vector Space

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Proceedings ArticleDOI

Neural Machine Translation of Rare Words with Subword Units

TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.
Related Papers (5)