Cross-Lingual Syntactic Transfer with Limited Resources

doi:10.1162/TACL_A_00061

Mohammad Sadegh Rasooli and Michael Collins

∗

Department of Computer Science, Columbia University

New York, NY 10027, USA

{rasooli,mcollins}@cs.columbia.edu

Abstract

We describe a simple but effective method

for cross-lingual syntactic transfer of depen-

dency parsers, in the scenario where a large

amount of translation data is not available.

This method makes use of three steps: 1) a

method for deriving cross-lingual word clus-

ters, which can then be used in a multilingual

parser; 2) a method for transferring lexical

information from a target language to source

language treebanks; 3) a method for integrat-

ing these steps with the density-driven annota-

tion projection method of Rasooli and Collins

(2015). Experiments show improvements over

the state-of-the-art in several languages used

in previous work, in a setting where the only

source of translation data is the Bible, a con-

siderably smaller corpus than the Europarl

corpus used in previous work. Results using

the Europarl corpus as a source of translation

data show additional improvements over the

results of Rasooli and Collins (2015). We con-

clude with results on 38 datasets from the Uni-

versal Dependencies corpora.

1 Introduction

Creating manually-annotated syntactic treebanks is

an expensive and time consuming task. Recently

there has been a great deal of interest in cross-lingual

syntactic transfer, where a parsing model is trained

for some language of interest, using only treebanks

in other languages. There is a clear motivation

for this in building parsing models for languages

for which treebank data is unavailable. Methods

∗

On leave at Google Inc. New York.

for syntactic transfer include annotation projection

methods (Hwa et al., 2005; Ganchev et al., 2009;

McDonald et al., 2011; Ma and Xia, 2014; Rasooli

and Collins, 2015; Lacroix et al., 2016; Agi

´

c et al.,

2016), learning of delexicalized models on univer-

sal treebanks (Zeman and Resnik, 2008; McDon-

ald et al., 2011; T

¨

ackstr

¨

om et al., 2013; Rosa and

Zabokrtsky, 2015), treebank translation (Tiedemann

et al., 2014; Tiedemann, 2015; Tiedemann and Agi

´

c,

2016) and methods that leverage cross-lingual rep-

resentations of word clusters, embeddings or dictio-

naries (T

¨

ackstr

¨

om et al., 2012; Durrett et al., 2012;

Duong et al., 2015a; Zhang and Barzilay, 2015; Xiao

and Guo, 2015; Guo et al., 2015; Guo et al., 2016;

Ammar et al., 2016a).

This paper considers the problem of cross-lingual

syntactic transfer with limited resources of mono-

lingual and translation data. Speciﬁcally, we use

the Bible corpus of Christodouloupoulos and Steed-

man (2014) as a source of translation data, and

Wikipedia as a source of monolingual data. We de-

liberately limit ourselves to the use of Bible trans-

lation data because it is available for a very broad

set of languages: the data from Christodouloupou-

los and Steedman (2014) includes data from 100

languages. The Bible data contains a much smaller

set of sentences (around 24,000) than other transla-

tion corpora, for example Europarl (Koehn, 2005),

which has around 2 million sentences per language

pair. This makes it a considerably more challeng-

ing corpus to work with. Similarly, our choice of

Wikipedia as the source of monolingual data is mo-

tivated by the availability of Wikipedia data in a very

broad set of languages.

279

Transactions of the Association for Computational Linguistics, vol. 5, pp. 279–293, 2017. Action Editor: Yuji Matsumoto.

Submission batch: 5/2016; Revision batch: 10/2016; 2/2017; Published 8/2017.

c

2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

We introduce a set of simple but effective methods

for syntactic transfer, as follows:

• We describe a method for deriving cross-

lingual clusters, where words from different

languages with a similar syntactic or seman-

tic role are grouped in the same cluster. These

clusters can then be used as features in a shift-

reduce dependency parser.

• We describe a method for transfer of lexical in-

formation from the target language into source

language treebanks, using word-to-word trans-

lation dictionaries derived from parallel cor-

pora. Lexical features from the target language

can then be integrated in parsing.

• We describe a method that integrates the above

two approaches with the density-driven ap-

proach to annotation projection described by

Rasooli and Collins (2015).

Experiments show that our model outperforms

previous work on a set of European languages from

the Google universal treebank (McDonald et al.,

2013). We achieve 80.9% average unlabeled at-

tachment score (UAS) on these languages; in com-

parison the work of Zhang and Barzilay (2015),

Guo et al. (2016) and Ammar et al. (2016b) have

a UAS of 75.4%, 76.3% and 77.8%, respectively.

All of these previous works make use of the much

larger Europarl (Koehn, 2005) corpus to derive lex-

ical representations. When using Europarl data in-

stead of the Bible, our approach gives 83.9% accu-

racy, a 1.7% absolute improvement over Rasooli and

Collins (2015). Finally, we conduct experiments on

38 datasets (26 languages) in the universal depen-

dencies v1.3 (Nivre et al., 2016) corpus. Our method

has an average unlabeled dependency accuracy of

74.8% for these languages, more than 6% higher

than the method of Rasooli and Collins (2015). Thir-

teen datasets (10 languages) have accuracies higher

than 80.0%.

1

2 Background

This section gives a description of the underlying

parsing models used in our experiments, the data

1

The parser code is available at https://github.

com/rasoolims/YaraParser/tree/transfer.

sets used, and a baseline approach based on delexi-

calized parsing models.

2.1 The Parsing Model

We assume that the parsing model is a discriminative

linear model, where given a sentence x, and a set of

candidate parses Y(x), the output from the model is

y

∗

(x) = arg max

y∈Y(x)

θ · φ(x, y)

where θ ∈ R

d

is a parameter vector, and φ(x, y) is

a feature vector for the pair (x, y). In our experi-

ments we use the shift-reduce dependency parser of

Rasooli and Tetreault (2015), which is an extension

of the approach in Zhang and Nivre (2011). The

parser is trained using the averaged structured per-

ceptron (Collins, 2002).

We assume that the feature vector φ(x, y) is the

concatenation of three feature vectors:

• φ

(p)

(x, y) is an unlexicalized set of features.

Each such feature may depend on the part-of-

speech (POS) tag of words in the sentence, but

does not depend on the identity of individual

words in the sentence.

• φ

(c)

(x, y) is a set of cluster features. These fea-

tures require access to a dictionary that maps

each word in the sentence to an underlying

cluster identity. Clusters may, for example, be

learned using the Brown clustering algorithm

(Brown et al., 1992). The features may make

use of cluster identities in combination with

POS tags.

• φ

(l)

(x, y) is a set of lexicalized features. Each

such feature may depend directly on word iden-

tities in the sentence. These features may also

depend on part-of-speech tags or cluster infor-

mation, in conjunction with lexical informa-

tion.

Appendix A has a complete description of the fea-

tures used in our experiments.

2.2 Data Assumptions

Throughout this paper we will assume that we have

m source languages L

1

. . . L

m

, and a single tar-

get language L

m+1

. We assume the following data

sources:

280

Source language treebanks. We have a treebank

T

i

for each language i ∈ {1 . . . m}.

Part-of-speech (POS) data. We have hand-

annotated POS data for all languages L

1

. . . L

m+1

.

We assume that the data uses a universal POS set

that is common across all languages.

Monolingual data. We have monolingual, raw

text for each of the (m +1) languages. We use D

i

to

refer to the monolingual data for the ith language.

Translation data. We have translation data for all

language pairs. We use B

i,j

to refer to transla-

tion data for the language pair (i, j) where i, j ∈

{1 . . . (m + 1)} and i 6= j.

In our main experiments we use the Google

universal treebank (McDonald et al., 2013) as

our source language treebanks

2

(this treebank pro-

vides universal dependency relations and POS

tags), Wikipedia data as our monolingual data, and

the Bible from Christodouloupoulos and Steedman

(2014) as the source of our translation data. In ad-

ditional experiments we use the Europarl corpus as

a source of translation data, in order to measure the

impact of using the smaller Bible corpus.

2.3 A Baseline Approach: Delexicalized

Parsers with Self-Training

Given the data assumption of a universal POS set,

the feature vectors φ

(p)

(x, y) can be shared across

languages. A simple approach is then to simply train

a delexicalized parser using treebanks T

1

. . . T

m

, us-

ing the representation φ(x, y) = φ

(p)

(x, y) (see

(McDonald et al., 2013; T

¨

ackstr

¨

om et al., 2013)).

Our baseline approach makes use of a delexical-

ized parser, with two reﬁnements:

WALS properties. We use the six properties from

the World Atlas of Language Structures (WALS)

(Dryer and Haspelmath, 2013) to select a subset of

closely related languages for each target language.

These properties are shown in Table 1. The model

for a target language is trained on treebank data from

languages where at least 4 out of 6 WALS prop-

erties are common between the source and target

2

We also train our best performing model on the newly re-

leased universal treebank v1.3 (Nivre et al., 2016). See §4.3 for

more details.

Feature Description

82A Order of subject and verb

83A Order of object and verb

85A Order of adposition and noun phrase

86A Order of genitive and noun

87A Order of adjective and noun

88A Order of demonstrative and noun

Table 1: The six properties from the world atlas of lan-

guage structures (WALS) (Dryer and Haspelmath, 2013)

used to select the source languages for each target lan-

guage in our experiments.

language.

3

This gives a slightly stronger baseline.

Our experiments showed an improvement in aver-

age labeled dependency accuracy for the languages

from 62.52% to 63.18%. Table 2 shows the set

of source languages used for each target language.

These source languages are used for all experiments

in the paper.

Self-training. We use self-training (McClosky et

al., 2006) to further improve parsing performance.

Speciﬁcally, we ﬁrst train a delexicalized model on

treebanks T

1

. . . T

m

; then use the resulting model to

parse a dataset T

m+1

that includes target-language

sentences which have POS tags but do not have de-

pendency structures. We ﬁnally use the automati-

cally parsed data T

0

m+1

as the treebank data and re-

train the model. This last model is trained using

all features (unlexicalized, clusters, and lexicalized).

Self-training in this way gives an improvement in la-

beled accuracy from 63.18% to 63.91%.

2.4 Translation Dictionaries

Our only use of the translation data B

i,j

for i, j ∈

{1 . . . (m + 1)} is to construct a translation dictio-

nary t(w, i, j). Here i and j are two languages,

w is a word in language L

i

, and the output w

0

=

t(w, i, j) is a word in language L

j

corresponding to

the most frequent translation of w into this language.

We deﬁne the function t(w, i, j) as follows: We

ﬁrst run the GIZA++ alignment process (Och and

Ney, 2003) on the data B

i,j

. We then keep inter-

sected alignments between sentences in the two lan-

guages. Finally, for each word w in L

i

, we deﬁne

3

There was no effort to optimize this choice; future work

may consider more sophisticated sharing schemes.

281

Target Sources

en de, fr, pt, sv

de en, fr, pt

es fr, it, pt

fr en, de, es, it, pt, sv

it es, fr, pt

pt en, de, es, fr, it, sv

sv en, fr, pt

Table 2: The selected source languages for each target

language in the Google universal treebank v2 (McDonald

et al., 2013). A language is chosen as a source language

if it has at least 4 out of 6 WALS properties in common

with the target language.

w

0

= t(w, i, j) to be the target language word most

frequently aligned to w in the aligned data. If a word

w is never seen aligned to a target language word w

0

,

we deﬁne t(w, i, j) = NULL.

3 Our Approach

We now describe an approach that gives signiﬁcant

improvements over the baseline. §3.1 describes a

method for deriving cross-lingual clusters, allowing

us to add cluster features φ

(c)

(x, y) to the model.

§3.2 describes a method for adding lexical features

φ

(l)

(x, y) to the model. §3.3 describes a method for

integrating the approach with the density-driven ap-

proach of Rasooli and Collins (2015). Finally, §4

describes experiments. We show that each of the

above steps leads to improvements in accuracy.

3.1 Learning Cross-Lingual Clusters

We now describe a method for learning cross-

lingual clusters. This follows previous work on

cross-lingual clustering algorithms (T

¨

ackstr

¨

om et

al., 2012). A clustering is a function C(w) that maps

each word w in a vocabulary to a cluster C(w) ∈

{1 . . . K}, where K is the number of clusters. A hi-

erarchical clustering is a function C(w, l) that maps

a word w together with an integer l to a cluster at

level l in the hierarchy. As one example, the Brown

clustering algorithm (Brown et al., 1992) gives a hi-

erarchical clustering. The level l allows cluster fea-

tures at different levels of granularity.

A cross-lingual hierarchical clustering is a func-

tion C(w, l) where the clusters are shared across the

(m + 1) languages of interest. That is, the word w

Inputs: 1) Monolingual texts D

i

for i = 1 . . . (m + 1);

2) a function t(w, i, j) that translates a word w ∈ L

i

to w

0

∈ L

j

; and 3) a parameter α such that 0 < α < 1.

Algorithm:

D = {}

for i = 1 to m + 1 do

for each sentence s ∈ D

i

do

for p = 1 to |s| do

Sample ¯a ∼ [0, 1)

if ¯a ≥ α then

continue

Sample j ∼ unif{1, ..., m + 1}\{i}

w

0

= t(s

p

, i, j)

if w

0

6= NULL then

Set s

p

= w

0

D = D ∪ {s}

Use the algorithm of Stratos et al. (2015) on D to learn

a clustering C.

Output: The clustering C.

Figure 1: An algorithm for learning a cross-lingual clus-

tering. In our experiments we used the parameter value

α = 0.3.

can be from any of the (m + 1) languages. Ideally,

a cross-lingual clustering should put words across

different languages which have a similar syntactic

and/or semantic role in the same cluster. There is

a clear motivation for cross-lingual clustering in the

parsing context. We can use the cluster-based fea-

tures φ

(c)

(x, y) on the source language treebanks

T

1

. . . T

m

, and these features will now generalize be-

yond these treebanks to the target language L

m+1

.

We learn a cross-lingual clustering by leverag-

ing the monolingual data sets D

1

. . . D

m+1

, together

with the translation dictionaries t(w, i, j) learned

from the translation data. Figure 1 shows the algo-

rithm that learns a cross-lingual clustering. The al-

gorithm ﬁrst prepares a multilingual corpus, as fol-

lows: for each sentence s in the monolingual data

D

i

, for each word in s, with probability α, we re-

place the word with its translation into some ran-

domly chosen language. Once this data is created,

we can easily obtain a cross-lingual clustering. Fig-

ure 1 shows the complete algorithm. The intuition

behind this method is that by creating the cross-

lingual data in this way, we bias the clustering al-

282

gorithm towards putting words that are translations

of each other in the same cluster.

3.2 Treebank Lexicalization

We now describe how to introduce lexical repre-

sentations φ

(l)

(x, y) to the model. Our approach

is simple: we take the treebank data T

1

. . . T

m

for

the m source languages, together with the transla-

tion lexicons t(w, i, m + 1). For any word w in

the source treebank data, we can look up its transla-

tion t(w, i, m + 1) in the lexicon, and add this trans-

lated form to the underlying sentence. Features can

now consider lexical identities derived in this way.

In many cases the resulting translation will be the

NULL word, leading to the absence of lexical fea-

tures. However, the representations φ

(p)

(x, y) and

φ

(c)

(x, y) still apply in this case, so the model is ro-

bust to some words having a NULL translation.

3.3 Integration with the Density-Driven

Projection Method of Rasooli and Collins

(2015)

In this section we describe a method for integrating

our approach with the cross-lingual transfer method

of Rasooli and Collins (2015), which makes use of

density-driven projections.

In annotation projection methods (Hwa et al.,

2005; McDonald et al., 2011), it is assumed that

we have translation data B

i,j

for a source and target

language, and that we have a dependency parser in

the source language L

i

. The translation data con-

sists of pairs (e, f ) where e is a source language

sentence, and f is a target language sentence. A

method such as GIZA++ is used to derive an align-

ment between the words in e and f, for each sen-

tence pair; the source language parser is used to

parse e. Each dependency in e is then potentially

transferred through the alignments to create a de-

pendency in the target sentence f. Once dependen-

cies have been transferred in this way, a dependency

parser can be trained on the dependencies in the tar-

get language.

The density-driven approach of Rasooli and

Collins (2015) makes use of various deﬁnitions of

“density” of the projected dependencies. For exam-

ple, P

100

is the set of projected structures where the

projected dependencies form a full projective parse

tree for the sentence; P

80

is the set of projected

structures where at least 80% of the words in the pro-

jected structure are a modiﬁer in some dependency.

An iterative training process is used, where the pars-

ing algorithm is ﬁrst trained on the set T

100

of com-

plete structures, and where progressively less dense

structures are introduced in learning.

We integrate our approach with the density-driven

approach of Rasooli and Collins (2015) as follows:

consider the treebanks T

1

. . . T

m

created using the

lexicalization method of §3.2. We add all trees in

these treebanks to the set P

100

of full trees used to

initialize the method of Rasooli and Collins (2015).

In addition we make use of the representations

φ

(p)

, φ

(c)

and φ

(l)

, throughout the learning process.

4 Experiments

This section ﬁrst describes the experimental settings,

then reports results.

4.1 Data and Tools

Data In the ﬁrst set of experiments, we consider 7

European languages studied in several pieces of pre-

vious work (Ma and Xia, 2014; Zhang and Barzi-

lay, 2015; Guo et al., 2016; Ammar et al., 2016a;

Lacroix et al., 2016). More speciﬁcally, we use the

7 European languages in the Google universal tree-

bank (v.2; standard data) (McDonald et al., 2013).

As in previous work, gold part-of-speech tags are

used for evaluation. We use the concatenation of

the treebank training sentences, Wikipedia data and

the Bible monolingual sentences as our monolingual

raw text. Table 3 shows statistics for the monolin-

gual data. We use the Bible from Christodouloupou-

los and Steedman (2014), which includes data for

100 languages, as the source of translations. We also

conduct experiments with the Europarl data (both

with the original set and a subset of it with the same

size as the Bible) to study the effects of translation

data size and domain shift. The statistics for transla-

tion data is shown in Table 4.

In a second set of experiments, we run experi-

ments on 38 datasets (26 languages) in the more re-

cent Universal Dependencies v1.3 corpus (Nivre et

al., 2016). The full set of languages we use is listed

in Table 9.

4

We use the Bible as the translation data,

4

We excluded languages that are not completely present in

the Bible of Christodouloupoulos and Steedman (2014) (An-

283

Cross-Lingual Syntactic Transfer with Limited Resources

Citations

A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages

Isomorphic Transfer of Syntactic Structures in Cross-Lingual NLP

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity

Synthetic Data Made to Order: The Case of Parsing

Cross-lingual sentiment transfer with limited resources

References

Moses: Open Source Toolkit for Statistical Machine Translation

A systematic comparison of various statistical alignment models

Europarl: A Parallel Corpus for Statistical Machine Translation

Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

Effective Self-Training for Parsing

Related Papers (5)

Multi-Source Transfer of Delexicalized Dependency Parsers

Many Languages, One Parser

Bootstrapping parsers via syntactic projection across parallel texts

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Selective Sharing for Multilingual Dependency Parsing