We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

Efficient Estimation of Word Representations in Vector Space

An intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions. The system can be implemented using any of a number of different platforms, such as the web, email, smartphone, and the like, or any combination thereof. In one embodiment, the system is based on sets of interrelated domains and tasks, and employs additional functionally powered by external services with which the system can interact.

Intelligent Automated Assistant

An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search We also compare how synthetic data compares to genuine bitext and study various domain effects Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT’14 English-German test set

/pdf/understanding-back-translation-at-scale-2a1jm02ac2.pdf

Understanding Back-Translation at Scale.

While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skip-gram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts. In particular, we perform experiments with dependency-based contexts, and show that they produce markedly different embeddings. The dependencybased embeddings are less topical and exhibit more functional similarity than the original skip-gram embeddings.

/pdf/dependency-based-word-embeddings-2k6rwfyv8g.pdf

Dependency-Based Word Embeddings

In this work we present a framework for the recognition of natural scene text. Our framework does not require any human-labelled data, and performs word recognition on the whole image holistically, departing from the character based recognition systems of the past. The deep neural network models at the centre of this framework are trained solely on data produced by a synthetic text generation engine -- synthetic data that is highly realistic and sufficient to replace real data, giving us infinite amounts of training data. This excess of data exposes new possibilities for word recognition models, and here we consider three models, each one "reading" words in a different way: via 90k-way dictionary encoding, character sequence encoding, and bag-of-N-grams encoding. In the scenarios of language based and completely unconstrained text recognition we greatly improve upon state-of-the-art performance on standard datasets, using our fast, simple machinery and requiring zero data-acquisition costs.

/pdf/synthetic-data-and-artificial-neural-networks-for-natural-zydv8cdgxf.pdf

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

Systems, methods, and computer program products for machine translation are provided. In some implementations a system is provided. The system includes a language model including a collection of n-grams from a corpus, each n-gram having a corresponding relative frequency in the corpus and an order n corresponding to a number of tokens in the n-gram, each n-gram corresponding to a backoff n-gram having an order of n-1 and a collection of backoff scores, each backoff score associated with an n-gram, the backoff score determined as a function of a backoff factor and a relative frequency of a corresponding backoff n-gram in the corpus.

/pdf/large-language-models-in-machine-translation-3wqubjuzrp.pdf

Large Language Models in Machine Translation

Systems, methods, and apparatus for accessing distributed models in automated machine processing, including using large language models in machine translation, speech recognition and other applications

Encoding and adaptive, scalable accessing of distributed models

In statistical language modeling, one technique to reduce the problematic eects of data sparsity is to partition the vocabulary into equivalence classes. In this paper we investigate the eects of applying such a technique to higherorder n-gram models trained on large corpora. We introduce a modification of the exchange clustering algorithm with improved eciency for certain partially class-based models and a distributed version of this algorithm to eciently obtain automatic word classifications for large vocabularies (>1 million words) using such large training corpora (>30 billion tokens). The resulting clusterings are then used in training partially class-based language models. We show that combining them with wordbased n-gram models in the log-linear model of a state-of-the-art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score.

Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

We present a novel context pattern induction method for information extraction, specifically named entity extraction. Using this method, we extended several classes of seed entity lists into much larger high-precision lists. Using token membership in these extended lists as additional features, we improved the accuracy of a conditional random field-based named entity tagger. In contrast, features derived from the seed lists decreased extractor accuracy.

/pdf/a-context-pattern-induction-method-for-named-entity-4woix41y5s.pdf

A Context Pattern Induction Method for Named Entity Extraction

Systems, methods, and apparatuses, including computer program products, are provided for machine translation using information retrieval techniques. In general, in one implementation, a method is provided. The method includes providing a received input segment as a query to a search engine, the search engine searching an index of one or more collections of documents, receiving one or more candidate segments in response to the query, determining a similarity of each candidate segment to the received input segment, and for one or more candidate segments having a determined similarity that exceeds a threshold similarity, providing a translated target segment corresponding to the respective candidate segment.

Thorsten Brants

Papers

Large Language Models in Machine Translation

Encoding and adaptive, scalable accessing of distributed models

Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

A Context Pattern Induction Method for Named Entity Extraction

Machine translation using information retrieval