scispace - formally typeset
Open AccessJournal ArticleDOI

Cross-Lingual Syntactic Transfer with Limited Resources

Reads0
Chats0
TLDR
A simple but effective method for cross-lingual syntactic transfer of dependency parsers, in the scenario where a large amount of translation data is not available.
Abstract
We describe a simple but effective method for cross-lingual syntactic transfer of dependency parsers, in the scenario where a large amount of translation data is not available. The method makes use of three steps: 1) a method for deriving cross-lingual word clusters, which can then be used in a multilingual parser; 2) a method for transferring lexical information from a target language to source language treebanks; 3) a method for integrating these steps with the density-driven annotation projection method of Rasooli and Collins (2015). Experiments show improvements over the state-of-the-art in several languages used in previous work, in a setting where the only source of translation data is the Bible, a considerably smaller corpus than the Europarl corpus used in previous work. Results using the Europarl corpus as a source of translation data show additional improvements over the results of Rasooli and Collins (2015). We conclude with results on 38 datasets from the Universal Dependencies corpora.

read more

Content maybe subject to copyright    Report

Cross-Lingual Syntactic Transfer with Limited Resources
Mohammad Sadegh Rasooli and Michael Collins
Department of Computer Science, Columbia University
New York, NY 10027, USA
{rasooli,mcollins}@cs.columbia.edu
Abstract
We describe a simple but effective method
for cross-lingual syntactic transfer of depen-
dency parsers, in the scenario where a large
amount of translation data is not available.
This method makes use of three steps: 1) a
method for deriving cross-lingual word clus-
ters, which can then be used in a multilingual
parser; 2) a method for transferring lexical
information from a target language to source
language treebanks; 3) a method for integrat-
ing these steps with the density-driven annota-
tion projection method of Rasooli and Collins
(2015). Experiments show improvements over
the state-of-the-art in several languages used
in previous work, in a setting where the only
source of translation data is the Bible, a con-
siderably smaller corpus than the Europarl
corpus used in previous work. Results using
the Europarl corpus as a source of translation
data show additional improvements over the
results of Rasooli and Collins (2015). We con-
clude with results on 38 datasets from the Uni-
versal Dependencies corpora.
1 Introduction
Creating manually-annotated syntactic treebanks is
an expensive and time consuming task. Recently
there has been a great deal of interest in cross-lingual
syntactic transfer, where a parsing model is trained
for some language of interest, using only treebanks
in other languages. There is a clear motivation
for this in building parsing models for languages
for which treebank data is unavailable. Methods
On leave at Google Inc. New York.
for syntactic transfer include annotation projection
methods (Hwa et al., 2005; Ganchev et al., 2009;
McDonald et al., 2011; Ma and Xia, 2014; Rasooli
and Collins, 2015; Lacroix et al., 2016; Agi
´
c et al.,
2016), learning of delexicalized models on univer-
sal treebanks (Zeman and Resnik, 2008; McDon-
ald et al., 2011; T
¨
ackstr
¨
om et al., 2013; Rosa and
Zabokrtsky, 2015), treebank translation (Tiedemann
et al., 2014; Tiedemann, 2015; Tiedemann and Agi
´
c,
2016) and methods that leverage cross-lingual rep-
resentations of word clusters, embeddings or dictio-
naries (T
¨
ackstr
¨
om et al., 2012; Durrett et al., 2012;
Duong et al., 2015a; Zhang and Barzilay, 2015; Xiao
and Guo, 2015; Guo et al., 2015; Guo et al., 2016;
Ammar et al., 2016a).
This paper considers the problem of cross-lingual
syntactic transfer with limited resources of mono-
lingual and translation data. Specifically, we use
the Bible corpus of Christodouloupoulos and Steed-
man (2014) as a source of translation data, and
Wikipedia as a source of monolingual data. We de-
liberately limit ourselves to the use of Bible trans-
lation data because it is available for a very broad
set of languages: the data from Christodouloupou-
los and Steedman (2014) includes data from 100
languages. The Bible data contains a much smaller
set of sentences (around 24,000) than other transla-
tion corpora, for example Europarl (Koehn, 2005),
which has around 2 million sentences per language
pair. This makes it a considerably more challeng-
ing corpus to work with. Similarly, our choice of
Wikipedia as the source of monolingual data is mo-
tivated by the availability of Wikipedia data in a very
broad set of languages.
279
Transactions of the Association for Computational Linguistics, vol. 5, pp. 279–293, 2017. Action Editor: Yuji Matsumoto.
Submission batch: 5/2016; Revision batch: 10/2016; 2/2017; Published 8/2017.
c
2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

We introduce a set of simple but effective methods
for syntactic transfer, as follows:
We describe a method for deriving cross-
lingual clusters, where words from different
languages with a similar syntactic or seman-
tic role are grouped in the same cluster. These
clusters can then be used as features in a shift-
reduce dependency parser.
We describe a method for transfer of lexical in-
formation from the target language into source
language treebanks, using word-to-word trans-
lation dictionaries derived from parallel cor-
pora. Lexical features from the target language
can then be integrated in parsing.
We describe a method that integrates the above
two approaches with the density-driven ap-
proach to annotation projection described by
Rasooli and Collins (2015).
Experiments show that our model outperforms
previous work on a set of European languages from
the Google universal treebank (McDonald et al.,
2013). We achieve 80.9% average unlabeled at-
tachment score (UAS) on these languages; in com-
parison the work of Zhang and Barzilay (2015),
Guo et al. (2016) and Ammar et al. (2016b) have
a UAS of 75.4%, 76.3% and 77.8%, respectively.
All of these previous works make use of the much
larger Europarl (Koehn, 2005) corpus to derive lex-
ical representations. When using Europarl data in-
stead of the Bible, our approach gives 83.9% accu-
racy, a 1.7% absolute improvement over Rasooli and
Collins (2015). Finally, we conduct experiments on
38 datasets (26 languages) in the universal depen-
dencies v1.3 (Nivre et al., 2016) corpus. Our method
has an average unlabeled dependency accuracy of
74.8% for these languages, more than 6% higher
than the method of Rasooli and Collins (2015). Thir-
teen datasets (10 languages) have accuracies higher
than 80.0%.
1
2 Background
This section gives a description of the underlying
parsing models used in our experiments, the data
1
The parser code is available at https://github.
com/rasoolims/YaraParser/tree/transfer.
sets used, and a baseline approach based on delexi-
calized parsing models.
2.1 The Parsing Model
We assume that the parsing model is a discriminative
linear model, where given a sentence x, and a set of
candidate parses Y(x), the output from the model is
y
(x) = arg max
y∈Y(x)
θ · φ(x, y)
where θ R
d
is a parameter vector, and φ(x, y) is
a feature vector for the pair (x, y). In our experi-
ments we use the shift-reduce dependency parser of
Rasooli and Tetreault (2015), which is an extension
of the approach in Zhang and Nivre (2011). The
parser is trained using the averaged structured per-
ceptron (Collins, 2002).
We assume that the feature vector φ(x, y) is the
concatenation of three feature vectors:
φ
(p)
(x, y) is an unlexicalized set of features.
Each such feature may depend on the part-of-
speech (POS) tag of words in the sentence, but
does not depend on the identity of individual
words in the sentence.
φ
(c)
(x, y) is a set of cluster features. These fea-
tures require access to a dictionary that maps
each word in the sentence to an underlying
cluster identity. Clusters may, for example, be
learned using the Brown clustering algorithm
(Brown et al., 1992). The features may make
use of cluster identities in combination with
POS tags.
φ
(l)
(x, y) is a set of lexicalized features. Each
such feature may depend directly on word iden-
tities in the sentence. These features may also
depend on part-of-speech tags or cluster infor-
mation, in conjunction with lexical informa-
tion.
Appendix A has a complete description of the fea-
tures used in our experiments.
2.2 Data Assumptions
Throughout this paper we will assume that we have
m source languages L
1
. . . L
m
, and a single tar-
get language L
m+1
. We assume the following data
sources:
280

Source language treebanks. We have a treebank
T
i
for each language i {1 . . . m}.
Part-of-speech (POS) data. We have hand-
annotated POS data for all languages L
1
. . . L
m+1
.
We assume that the data uses a universal POS set
that is common across all languages.
Monolingual data. We have monolingual, raw
text for each of the (m +1) languages. We use D
i
to
refer to the monolingual data for the ith language.
Translation data. We have translation data for all
language pairs. We use B
i,j
to refer to transla-
tion data for the language pair (i, j) where i, j
{1 . . . (m + 1)} and i 6= j.
In our main experiments we use the Google
universal treebank (McDonald et al., 2013) as
our source language treebanks
2
(this treebank pro-
vides universal dependency relations and POS
tags), Wikipedia data as our monolingual data, and
the Bible from Christodouloupoulos and Steedman
(2014) as the source of our translation data. In ad-
ditional experiments we use the Europarl corpus as
a source of translation data, in order to measure the
impact of using the smaller Bible corpus.
2.3 A Baseline Approach: Delexicalized
Parsers with Self-Training
Given the data assumption of a universal POS set,
the feature vectors φ
(p)
(x, y) can be shared across
languages. A simple approach is then to simply train
a delexicalized parser using treebanks T
1
. . . T
m
, us-
ing the representation φ(x, y) = φ
(p)
(x, y) (see
(McDonald et al., 2013; T
¨
ackstr
¨
om et al., 2013)).
Our baseline approach makes use of a delexical-
ized parser, with two refinements:
WALS properties. We use the six properties from
the World Atlas of Language Structures (WALS)
(Dryer and Haspelmath, 2013) to select a subset of
closely related languages for each target language.
These properties are shown in Table 1. The model
for a target language is trained on treebank data from
languages where at least 4 out of 6 WALS prop-
erties are common between the source and target
2
We also train our best performing model on the newly re-
leased universal treebank v1.3 (Nivre et al., 2016). See §4.3 for
more details.
Feature Description
82A Order of subject and verb
83A Order of object and verb
85A Order of adposition and noun phrase
86A Order of genitive and noun
87A Order of adjective and noun
88A Order of demonstrative and noun
Table 1: The six properties from the world atlas of lan-
guage structures (WALS) (Dryer and Haspelmath, 2013)
used to select the source languages for each target lan-
guage in our experiments.
language.
3
This gives a slightly stronger baseline.
Our experiments showed an improvement in aver-
age labeled dependency accuracy for the languages
from 62.52% to 63.18%. Table 2 shows the set
of source languages used for each target language.
These source languages are used for all experiments
in the paper.
Self-training. We use self-training (McClosky et
al., 2006) to further improve parsing performance.
Specifically, we first train a delexicalized model on
treebanks T
1
. . . T
m
; then use the resulting model to
parse a dataset T
m+1
that includes target-language
sentences which have POS tags but do not have de-
pendency structures. We finally use the automati-
cally parsed data T
0
m+1
as the treebank data and re-
train the model. This last model is trained using
all features (unlexicalized, clusters, and lexicalized).
Self-training in this way gives an improvement in la-
beled accuracy from 63.18% to 63.91%.
2.4 Translation Dictionaries
Our only use of the translation data B
i,j
for i, j
{1 . . . (m + 1)} is to construct a translation dictio-
nary t(w, i, j). Here i and j are two languages,
w is a word in language L
i
, and the output w
0
=
t(w, i, j) is a word in language L
j
corresponding to
the most frequent translation of w into this language.
We define the function t(w, i, j) as follows: We
first run the GIZA++ alignment process (Och and
Ney, 2003) on the data B
i,j
. We then keep inter-
sected alignments between sentences in the two lan-
guages. Finally, for each word w in L
i
, we define
3
There was no effort to optimize this choice; future work
may consider more sophisticated sharing schemes.
281

Target Sources
en de, fr, pt, sv
de en, fr, pt
es fr, it, pt
fr en, de, es, it, pt, sv
it es, fr, pt
pt en, de, es, fr, it, sv
sv en, fr, pt
Table 2: The selected source languages for each target
language in the Google universal treebank v2 (McDonald
et al., 2013). A language is chosen as a source language
if it has at least 4 out of 6 WALS properties in common
with the target language.
w
0
= t(w, i, j) to be the target language word most
frequently aligned to w in the aligned data. If a word
w is never seen aligned to a target language word w
0
,
we define t(w, i, j) = NULL.
3 Our Approach
We now describe an approach that gives significant
improvements over the baseline. §3.1 describes a
method for deriving cross-lingual clusters, allowing
us to add cluster features φ
(c)
(x, y) to the model.
§3.2 describes a method for adding lexical features
φ
(l)
(x, y) to the model. §3.3 describes a method for
integrating the approach with the density-driven ap-
proach of Rasooli and Collins (2015). Finally, §4
describes experiments. We show that each of the
above steps leads to improvements in accuracy.
3.1 Learning Cross-Lingual Clusters
We now describe a method for learning cross-
lingual clusters. This follows previous work on
cross-lingual clustering algorithms (T
¨
ackstr
¨
om et
al., 2012). A clustering is a function C(w) that maps
each word w in a vocabulary to a cluster C(w)
{1 . . . K}, where K is the number of clusters. A hi-
erarchical clustering is a function C(w, l) that maps
a word w together with an integer l to a cluster at
level l in the hierarchy. As one example, the Brown
clustering algorithm (Brown et al., 1992) gives a hi-
erarchical clustering. The level l allows cluster fea-
tures at different levels of granularity.
A cross-lingual hierarchical clustering is a func-
tion C(w, l) where the clusters are shared across the
(m + 1) languages of interest. That is, the word w
Inputs: 1) Monolingual texts D
i
for i = 1 . . . (m + 1);
2) a function t(w, i, j) that translates a word w L
i
to w
0
L
j
; and 3) a parameter α such that 0 < α < 1.
Algorithm:
D = {}
for i = 1 to m + 1 do
for each sentence s D
i
do
for p = 1 to |s| do
Sample ¯a [0, 1)
if ¯a α then
continue
Sample j unif{1, ..., m + 1}\{i}
w
0
= t(s
p
, i, j)
if w
0
6= NULL then
Set s
p
= w
0
D = D {s}
Use the algorithm of Stratos et al. (2015) on D to learn
a clustering C.
Output: The clustering C.
Figure 1: An algorithm for learning a cross-lingual clus-
tering. In our experiments we used the parameter value
α = 0.3.
can be from any of the (m + 1) languages. Ideally,
a cross-lingual clustering should put words across
different languages which have a similar syntactic
and/or semantic role in the same cluster. There is
a clear motivation for cross-lingual clustering in the
parsing context. We can use the cluster-based fea-
tures φ
(c)
(x, y) on the source language treebanks
T
1
. . . T
m
, and these features will now generalize be-
yond these treebanks to the target language L
m+1
.
We learn a cross-lingual clustering by leverag-
ing the monolingual data sets D
1
. . . D
m+1
, together
with the translation dictionaries t(w, i, j) learned
from the translation data. Figure 1 shows the algo-
rithm that learns a cross-lingual clustering. The al-
gorithm first prepares a multilingual corpus, as fol-
lows: for each sentence s in the monolingual data
D
i
, for each word in s, with probability α, we re-
place the word with its translation into some ran-
domly chosen language. Once this data is created,
we can easily obtain a cross-lingual clustering. Fig-
ure 1 shows the complete algorithm. The intuition
behind this method is that by creating the cross-
lingual data in this way, we bias the clustering al-
282

gorithm towards putting words that are translations
of each other in the same cluster.
3.2 Treebank Lexicalization
We now describe how to introduce lexical repre-
sentations φ
(l)
(x, y) to the model. Our approach
is simple: we take the treebank data T
1
. . . T
m
for
the m source languages, together with the transla-
tion lexicons t(w, i, m + 1). For any word w in
the source treebank data, we can look up its transla-
tion t(w, i, m + 1) in the lexicon, and add this trans-
lated form to the underlying sentence. Features can
now consider lexical identities derived in this way.
In many cases the resulting translation will be the
NULL word, leading to the absence of lexical fea-
tures. However, the representations φ
(p)
(x, y) and
φ
(c)
(x, y) still apply in this case, so the model is ro-
bust to some words having a NULL translation.
3.3 Integration with the Density-Driven
Projection Method of Rasooli and Collins
(2015)
In this section we describe a method for integrating
our approach with the cross-lingual transfer method
of Rasooli and Collins (2015), which makes use of
density-driven projections.
In annotation projection methods (Hwa et al.,
2005; McDonald et al., 2011), it is assumed that
we have translation data B
i,j
for a source and target
language, and that we have a dependency parser in
the source language L
i
. The translation data con-
sists of pairs (e, f ) where e is a source language
sentence, and f is a target language sentence. A
method such as GIZA++ is used to derive an align-
ment between the words in e and f, for each sen-
tence pair; the source language parser is used to
parse e. Each dependency in e is then potentially
transferred through the alignments to create a de-
pendency in the target sentence f. Once dependen-
cies have been transferred in this way, a dependency
parser can be trained on the dependencies in the tar-
get language.
The density-driven approach of Rasooli and
Collins (2015) makes use of various definitions of
“density” of the projected dependencies. For exam-
ple, P
100
is the set of projected structures where the
projected dependencies form a full projective parse
tree for the sentence; P
80
is the set of projected
structures where at least 80% of the words in the pro-
jected structure are a modifier in some dependency.
An iterative training process is used, where the pars-
ing algorithm is first trained on the set T
100
of com-
plete structures, and where progressively less dense
structures are introduced in learning.
We integrate our approach with the density-driven
approach of Rasooli and Collins (2015) as follows:
consider the treebanks T
1
. . . T
m
created using the
lexicalization method of §3.2. We add all trees in
these treebanks to the set P
100
of full trees used to
initialize the method of Rasooli and Collins (2015).
In addition we make use of the representations
φ
(p)
, φ
(c)
and φ
(l)
, throughout the learning process.
4 Experiments
This section first describes the experimental settings,
then reports results.
4.1 Data and Tools
Data In the first set of experiments, we consider 7
European languages studied in several pieces of pre-
vious work (Ma and Xia, 2014; Zhang and Barzi-
lay, 2015; Guo et al., 2016; Ammar et al., 2016a;
Lacroix et al., 2016). More specifically, we use the
7 European languages in the Google universal tree-
bank (v.2; standard data) (McDonald et al., 2013).
As in previous work, gold part-of-speech tags are
used for evaluation. We use the concatenation of
the treebank training sentences, Wikipedia data and
the Bible monolingual sentences as our monolingual
raw text. Table 3 shows statistics for the monolin-
gual data. We use the Bible from Christodouloupou-
los and Steedman (2014), which includes data for
100 languages, as the source of translations. We also
conduct experiments with the Europarl data (both
with the original set and a subset of it with the same
size as the Bible) to study the effects of translation
data size and domain shift. The statistics for transla-
tion data is shown in Table 4.
In a second set of experiments, we run experi-
ments on 38 datasets (26 languages) in the more re-
cent Universal Dependencies v1.3 corpus (Nivre et
al., 2016). The full set of languages we use is listed
in Table 9.
4
We use the Bible as the translation data,
4
We excluded languages that are not completely present in
the Bible of Christodouloupoulos and Steedman (2014) (An-
283

Citations
More filters
Proceedings ArticleDOI

A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages

TL;DR: This work systematically compares a set of simple strategies for improving low-resource parsers: data augmentation, which has not been tested before; cross-lingual training; and transliteration.
Proceedings ArticleDOI

Isomorphic Transfer of Syntactic Structures in Cross-Lingual NLP

TL;DR: This paper introduces a source language selection procedure that facilitates effective cross-lingual parser transfer, and proposes a typologically driven method for syntactic tree processing which reduces anisomorphism, demonstrating the importance of syntactic structure compatibility for boosting cross-lingsual transfer in NLP.
Journal ArticleDOI

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity

TL;DR: Multi-SimLex as discussed by the authors is a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones.
Proceedings ArticleDOI

Synthetic Data Made to Order: The Case of Parsing

TL;DR: This work shows how to (stochastically) permute the constituents of an existing dependency treebank so that its surface part-of-speech statistics approximately match those of the target language.
Journal ArticleDOI

Cross-lingual sentiment transfer with limited resources

TL;DR: This work uses multiple source languages to learn a more robust sentiment transfer model for 16 languages from different language families and shows that it can build a robust transfer system whose performance can in some cases approach that of a supervised system.
References
More filters
Proceedings ArticleDOI

Moses: Open Source Toolkit for Statistical Machine Translation

TL;DR: An open-source toolkit for statistical machine translation whose novel contributions are support for linguistically motivated factors, confusion network decoding, and efficient data formats for translation models and language models.
Journal ArticleDOI

A systematic comparison of various statistical alignment models

TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.

Europarl: A Parallel Corpus for Statistical Machine Translation

Philipp Koehn
TL;DR: A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.
Proceedings ArticleDOI

Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

TL;DR: Experimental results on part-of-speech tagging and base noun phrase chunking are given, in both cases showing improvements over results for a maximum-entropy tagger.
Proceedings ArticleDOI

Effective Self-Training for Parsing

TL;DR: This work presents a simple, but surprisingly effective, method of self-training a two-phase parser-reranker system using readily available unlabeled data and shows that this type of bootstrapping is possible for parsing when the bootstrapped parses are processed by a discriminative reranker.
Related Papers (5)