Cross-language information retrieval (CLIR) is an active sub-domain of information retrieval (IR). Like IR, CLIR is centered on the search for documents and for information contained within those documents. Unlike IR, CLIR must reconcile queries and documents that are written in different languages. The usual solution to this mismatch involves translating the query and/or the documents before performing the search. Translation is therefore a pivotal activity for CLIR engines. Over the last 15 years, the CLIR community has developed a wide range of techniques and models supporting free text translation. This article presents an overview of those techniques, with a special emphasis on recent developments.

Translation techniques in cross-language information retrieval

The distributional hypothesis of Harris (1954), according to which the meaning of words is evidenced by the contexts they occur in, has motivated several effective techniques for obtaining vector space semantic representations of words using unannotated text corpora. This paper argues that lexico-semantic content should additionally be invariant across languages and proposes a simple technique based on canonical correlation analysis (CCA) for incorporating multilingual evidence into vectors generated monolingually. We evaluate the resulting word representations on standard lexical semantic evaluation tasks and show that our method produces substantially better semantic representations than monolingual techniques.

/pdf/improving-vector-space-word-representations-using-m5i6cwpztu.pdf

Improving Vector Space Word Representations Using Multilingual Correlation

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web,first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

http://www.umiacs.umd.edu/%7Eresnik/pubs/resnik_smith.pdf

The Web as a parallel corpus

We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.

/pdf/posterior-regularization-for-structured-latent-variable-3zlde8om9l.pdf

Posterior Regularization for Structured Latent Variable Models

We present a new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean. To show the usefulness of such a resource, we present a case study of crosslingual transfer parsing with more reliable evaluation than has been possible before. This ‘universal’ treebank is made freely available in order to facilitate research on multilingual dependency parsing. 1

/pdf/universal-dependency-annotation-for-multilingual-parsing-17ew644wxd.pdf

Universal Dependency Annotation for Multilingual Parsing

Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor intensive and time-consuming process, and it is difficult to find linguistically annotated text in sufficient quantities. In this article, we explore using parallel text to help solving the problem of creating syntactic annotation in more languages. The central idea is to annotate the English side of a parallel corpus, project the analysis to the second language, and then train a stochastic analyzer on the resulting noisy annotations. We discuss our background assumptions, describe an initial study on the “projectability” of syntactic relations, and then present two experiments in which stochastic parsers are developed with minimal human intervention via projection from English.

/pdf/bootstrapping-parsers-via-syntactic-projection-across-58ag6g7mes.pdf

Bootstrapping parsers via syntactic projection across parallel texts

The University of Maryland participated in the CLEF 2000 multilingual task, submitting three official runs that explored the impact of applying language-independent stemming techniques to dictionarybased cross-language information retrieval. The paper begins by describing a cross-language information retrieval architecture based on balanced document translation. A four-stage backoff strategy for improving the coverage of dictionary-based translation techniques is then introduced, and an implementation based on automatically trained statistical stemming is presented. Results indicate that competitive performance can be achieved using four-stage backoff translation in conjunction with freely available bilingual dictionaries, but that the the usefulness of the statistical stemming algorithms that were tried varies considerably across the three languages to which they were applied.

CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

: In current state of the art statistical MT systems, word choice in the target language is governed implicitly by a combination of "phrase" selection and language modeling. In contrast, the state of the art in word sense disambiguation takes advantage of a wide array of features, both locally and at the document level. This technical report describes our initial efforts to employ the power of WSD techniques in helping to guide a state of the art statistical MT system toward better word choices. We briefly discuss the principles underlying our approach as contrasted with another recent attempt to integrate WSD with statistical MT (Carpuat and Wu, 2005) that yielded negative results. We then describe our approach, which leads to a small improvement in translation performance over a state of the art phrase-based statistical MT system. Qualitative analysis of translation output suggests there are still significant opportunities to improve performance further.

/pdf/using-wsd-techniques-for-lexical-selection-in-statistical-4w010b3wtt.pdf

Using WSD Techniques for Lexical Selection in Statistical Machine Translation

We describe the University of Maryland's supervised sense tagger, which participated in the SENSEVAL-2 lexical sample evaluations for English, Spanish, and Swedish; we also present unofficial results for Basque. We designed a highly modular combination of language-independent feature extraction and supervised learning using support vector machines in order to permit rapid ramp-up, language independence, and capability for future expansion.

Supervised Sense Tagging using Support Vector Machines

: We describe here our construction of lexical resources, tool creation, building of an aligned parallel corpus, and an approach to automatic treebank creation, which we have been developing using Spanish data, based on projection of English syntactic dependency information across a parallel corpus.

/pdf/spanish-language-processing-at-university-of-maryland-3htjmofi5d.pdf

Clara I. Cabezas

Papers

Bootstrapping parsers via syntactic projection across parallel texts

CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

Using WSD Techniques for Lexical Selection in Statistical Machine Translation

Supervised Sense Tagging using Support Vector Machines

Spanish Language Processing at University of Maryland: Building Infrastructure for Multilingual Applications