scispace - formally typeset
Search or ask a question
Author

Eric Malmi

Bio: Eric Malmi is an academic researcher from Google. The author has contributed to research in topics: Sentence & Language model. The author has an hindex of 15, co-authored 56 publications receiving 651 citations. Previous affiliations of Eric Malmi include Helsinki Institute for Information Technology & Idiap Research Institute.


Papers
More filters
Proceedings ArticleDOI
03 Sep 2019
TL;DR: LaserTagger is proposed - a sequence tagging approach that casts text generation as a text editing task, and it is shown that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.
Abstract: We propose LaserTagger - a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LaserTagger achieves new state-of-the-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.

161 citations

Posted Content
TL;DR: The predictability of user demographics based on the list of a user's apps and the effect of the training set size and the number of apps on the predictability are looked into and it is shown that both of these factors have a large impact on the prediction accuracy.
Abstract: Understanding the demographics of app users is crucial, for example, for app developers, who wish to target their advertisements more effectively. Our work addresses this need by studying the predictability of user demographics based on the list of a user's apps which is readily available to many app developers. We extend previous work on the problem on three frontiers: (1) We predict new demographics (age, race, and income) and analyze the most informative apps for four demographic attributes included in our analysis. The most predictable attribute is gender (82.3 % accuracy), whereas the hardest to predict is income (60.3 % accuracy). (2) We compare several dimensionality reduction methods for high-dimensional app data, finding out that an unsupervised method yields superior results compared to aggregating the apps at the app category level, but the best results are obtained simply by the raw list of apps. (3) We look into the effect of the training set size and the number of apps on the predictability and show that both of these factors have a large impact on the prediction accuracy. The predictability increases, or in other words, a user's privacy decreases, the more apps the user has used, but somewhat surprisingly, after 100 apps, the prediction accuracy starts to decrease.

75 citations

Posted Content
TL;DR: It is demonstrated that performing a single fine-tuning step on cLang-8 with the off-the-shelf language models yields further accuracy improvements over an already top-performing gT5 model for English.
Abstract: This paper presents a simple recipe to train state-of-the-art multilingual Grammatical Error Correction (GEC) models. We achieve this by first proposing a language-agnostic method to generate a large number of synthetic examples. The second ingredient is to use large-scale multilingual language models (up to 11B parameters). Once fine-tuned on language-specific supervised sets we surpass the previous state-of-the-art results on GEC benchmarks in four languages: English, Czech, German and Russian. Having established a new set of baselines for GEC, we make our results easily reproducible and accessible by releasing a cLang-8 dataset. It is produced by using our best model, which we call gT5, to clean the targets of a widely used yet noisy lang-8 dataset. cLang-8 greatly simplifies typical GEC training pipelines composed of multiple fine-tuning stages -- we demonstrate that performing a single fine-tuning step on cLang-8 with the off-the-shelf language models yields further accuracy improvements over an already top-performing gT5 model for English.

55 citations

Proceedings ArticleDOI
13 Aug 2016
TL;DR: In this paper, a prediction model was developed to identify the next line of existing lyrics from a set of candidate next lines, and then they employed the prediction model to combine lines from existing songs, producing lyrics with rhyme and a meaning.
Abstract: Writing rap lyrics requires both creativity to construct a meaningful, interesting story and lyrical skills to produce complex rhyme patterns, which form the cornerstone of good flow. We present a rap lyrics generation method that captures both of these aspects. First, we develop a prediction model to identify the next line of existing lyrics from a set of candidate next lines. This model is based on two machine-learning techniques: the RankSVM algorithm and a deep neural network model with a novel structure. Results show that the prediction model can identify the true next line among 299 randomly selected lines with an accuracy of 17%, i.e., over 50 times more likely than by random. Second, we employ the prediction model to combine lines from existing songs, producing lyrics with rhyme and a meaning. An evaluation of the produced lyrics shows that in terms of quantitative rhyme density, the method outperforms the best human rappers by 21%. The rap lyrics generator has been deployed as an online tool called DeepBeat, and the performance of the tool has been assessed by analyzing its usage logs. This analysis shows that machine-learned rankings correlate with user preferences.

55 citations

Posted Content
TL;DR: A method for automatically-generating fusion examples from raw text and a sequence-to-sequence model on DiscoFuse, a large scale dataset for discourse-based sentence fusion, are proposed and shown to improve performance on WebSplit when viewed as a sentence fusion task.
Abstract: Sentence fusion is the task of joining several independent sentences into a single coherent text. Current datasets for sentence fusion are small and insufficient for training modern neural models. In this paper, we propose a method for automatically-generating fusion examples from raw text and present DiscoFuse, a large scale dataset for discourse-based sentence fusion. We author a set of rules for identifying a diverse set of discourse phenomena in raw text, and decomposing the text into two independent sentences. We apply our approach on two document collections: Wikipedia and Sports articles, yielding 60 million fusion examples annotated with discourse information required to reconstruct the fused text. We develop a sequence-to-sequence model on DiscoFuse and thoroughly analyze its strengths and weaknesses with respect to the various discourse phenomena, using both automatic as well as human evaluation. Finally, we conduct transfer learning experiments with WebSplit, a recent dataset for text simplification. We show that pretraining on DiscoFuse substantially improves performance on WebSplit when viewed as a sentence fusion task.

39 citations


Cited by
More filters
Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Proceedings ArticleDOI
12 Oct 2013
TL;DR: A novel location recommendation framework is introduced, based on the temporal properties of user movement observed from a real-world LBSN dataset, which exhibits the significance of temporal patterns in explaining user behavior, and demonstrates their power to improve location recommendation performance.
Abstract: Location-based social networks (LBSNs) have attracted an inordinate number of users and greatly enriched the urban experience in recent years. The availability of spatial, temporal and social information in online LBSNs offers an unprecedented opportunity to study various aspects of human behavior, and enable a variety of location-based services such as location recommendation. Previous work studied spatial and social influences on location recommendation in LBSNs. Due to the strong correlations between a user's check-in time and the corresponding check-in location, recommender systems designed for location recommendation inevitably need to consider temporal effects. In this paper, we introduce a novel location recommendation framework, based on the temporal properties of user movement observed from a real-world LBSN dataset. The experimental results exhibit the significance of temporal patterns in explaining user behavior, and demonstrate their power to improve location recommendation performance.

496 citations

Journal ArticleDOI
TL;DR: A Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2, and RoBERTa checkpoints is developed and an extensive empirical study on the utility of initializing the model, both encoder and decoder, with these checkpoints is conducted.
Abstract: Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By warm-starting from the publicly released checkpoints, NLP practitioners have pushed the ...

350 citations

01 Jan 1999
TL;DR: Statistics and Computing Series Ed.
Abstract: Submission information at the series homepage and springer.com/authors Order online at springer.com ▶ or for the Americas call (toll free) 1-800-SPRINGER ▶ or email us at: customerservice@springer.com. ▶ For outside the Americas call +49 (0) 6221-345-4301 ▶ or email us at: customerservice@springer.com. Statistics and Computing Series Ed.: W.K. Härdle Statistics and Computing (SC) includes monographs and advanced texts on statistical computing and statistical packages.

228 citations

Proceedings ArticleDOI
20 Dec 2022
TL;DR: The authors propose Self-Instruct, a framework for improving the instruction-following capabilities of pre-trained language models by bootstrapping off their own generations, which generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model.
Abstract: Large “instruction-tuned” language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. We introduce Self-Instruct, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. Our pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model. Applying our method to the vanilla GPT3, we demonstrate a 33% absolute improvement over the original model on Super-NaturalInstructions, on par with the performance of InstructGPT-001, which was trained with private user data and human annotations. For further evaluation, we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with Self-Instruct outperforms using existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind InstructGPT-001. Self-Instruct provides an almost annotation-free method for aligning pre-trained language models with instructions, and we release our large synthetic dataset to facilitate future studies on instruction tuning.

201 citations