scispace - formally typeset
Search or ask a question
Author

Andrey Kutuzov

Bio: Andrey Kutuzov is an academic researcher from University of Oslo. The author has contributed to research in topics: Word embedding & Semantic similarity. The author has an hindex of 12, co-authored 74 publications receiving 840 citations. Previous affiliations of Andrey Kutuzov include National Research University – Higher School of Economics & Mail.Ru Group.

Papers published on a yearly basis

Papers
More filters
Proceedings Article
09 Jun 2018
TL;DR: This paper surveys the current state of academic research related to diachronic word embeddings and semantic shifts detection, and proposes several axes along which these methods can be compared, and outlines the main challenges before this emerging subfield of NLP.
Abstract: Recent years have witnessed a surge of publications aimed at tracing temporal changes in lexical semantics using distributional methods, particularly prediction-based word embedding models. However, this vein of research lacks the cohesion, common terminology and shared practices of more established areas of natural language processing. In this paper, we survey the current state of academic research related to diachronic word embeddings and semantic shifts detection. We start with discussing the notion of semantic shifts, and then continue with an overview of the existing methods for tracing such time-related shifts with word embedding models. We propose several axes along which these methods can be compared, and outline the main challenges before this emerging subfield of NLP, as well as prospects and possible applications.

191 citations

Book Chapter
08 May 2017
TL;DR: An emerging shared repository of large-text resources for creating word vectors, including pre-processed corpora and pre-trained vectors for a range of frameworks and configurations is described, to facilitate reuse, rapid experimentation, and replicability of results.
Abstract: This paper describes an emerging shared repository of large-text resources for creating word vectors, including pre-processed corpora and pre-trained vectors for a range of frameworks and configurations. This will facilitate reuse, rapid experimentation, and replicability of results.

135 citations

Posted Content
TL;DR: A survey of the current state of academic research related to diachronic word embeddings and semantic shifts detection can be found in this article, where the authors discuss the notion of semantic shifts, and then continue with an overview of the existing methods for tracing such time-related shifts with word embedding models.
Abstract: Recent years have witnessed a surge of publications aimed at tracing temporal changes in lexical semantics using distributional methods, particularly prediction-based word embedding models. However, this vein of research lacks the cohesion, common terminology and shared practices of more established areas of natural language processing. In this paper, we survey the current state of academic research related to diachronic word embeddings and semantic shifts detection. We start with discussing the notion of semantic shifts, and then continue with an overview of the existing methods for tracing such time-related shifts with word embedding models. We propose several axes along which these methods can be compared, and outline the main challenges before this emerging subfield of NLP, as well as prospects and possible applications.

124 citations

Book ChapterDOI
07 Apr 2016
TL;DR: The paper presents a free and open source toolkit, WebVectors, which provides all the necessary routines for organizing online access to querying trained models via modern web interface and describes two demo installations of the toolkit.
Abstract: The paper presents a free and open source toolkit which aim is to quickly deploy web services handling distributed vector models of semantics. It fills in the gap between training such models (many tools are already available for this) and dissemination of the results to general public. Our toolkit, WebVectors, provides all the necessary routines for organizing online access to querying trained models via modern web interface. We also describe two demo installations of the toolkit, featuring several efficient models for English, Russian and Norwegian.

102 citations

Posted Content
TL;DR: It was found out that Continuous Skip-gram and Continuous Bag-of-words models, previously successfully applied to English material, can be used for semantic modeling of Russian as well, and it is shown that texts in Russian National Corpus provide an excellent training material for such models, outperforming other, much larger corpora.
Abstract: Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from the 2nd to the 5th position, depending on the task. We introduce the tools and corpora used, comment on the nature of the shared task and describe the achieved results. It was found out that Continuous Skip-gram and Continuous Bag-of-words models, previously successfully applied to English material, can be used for semantic modeling of Russian as well. Moreover, we show that texts in Russian National Corpus (RNC) provide an excellent training material for such models, outperforming other, much larger corpora. It is especially true for semantic relatedness tasks (although stacking models trained on larger corpora on top of RNC models improves performance even more). High-quality semantic vectors learned in such a way can be used in a variety of linguistic tasks and promise an exciting field for further study.

40 citations


Cited by
More filters
Journal ArticleDOI
01 Apr 1988-Nature
TL;DR: In this paper, a sedimentological core and petrographic characterisation of samples from eleven boreholes from the Lower Carboniferous of Bowland Basin (Northwest England) is presented.
Abstract: Deposits of clastic carbonate-dominated (calciclastic) sedimentary slope systems in the rock record have been identified mostly as linearly-consistent carbonate apron deposits, even though most ancient clastic carbonate slope deposits fit the submarine fan systems better. Calciclastic submarine fans are consequently rarely described and are poorly understood. Subsequently, very little is known especially in mud-dominated calciclastic submarine fan systems. Presented in this study are a sedimentological core and petrographic characterisation of samples from eleven boreholes from the Lower Carboniferous of Bowland Basin (Northwest England) that reveals a >250 m thick calciturbidite complex deposited in a calciclastic submarine fan setting. Seven facies are recognised from core and thin section characterisation and are grouped into three carbonate turbidite sequences. They include: 1) Calciturbidites, comprising mostly of highto low-density, wavy-laminated bioclast-rich facies; 2) low-density densite mudstones which are characterised by planar laminated and unlaminated muddominated facies; and 3) Calcidebrites which are muddy or hyper-concentrated debrisflow deposits occurring as poorly-sorted, chaotic, mud-supported floatstones. These

9,929 citations

Journal ArticleDOI
TL;DR: The recent collection of articles, The Art of Translation [Masterstvo perevoda], follows publication of two other books on the subject, The Problems of Translation of Creative Writing [Voprosy khudozhestvennogo pereVoda] and another work also bearing the title The art of Translation.
Abstract: The recent collection of articles, The Art of Translation [Masterstvo perevoda], follows publication of two other books on the subject, The Problems of Translation of Creative Writing [Voprosy khudozhestvennogo perevoda] and another work also bearing the title The Art of Translation.

913 citations

17 Dec 2010
TL;DR: The authors survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000, using a corpus of digitized texts containing about 4% of all books ever printed.
Abstract: L'article, publie dans Science, sur une des premieres utilisations analytiques de Google Books, fondee sur les n-grammes (Google Ngrams) We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can ...

735 citations

01 Jan 2011
TL;DR: This volume collects three decades of articles by distinguish linguist Joan Bybee, which essentially argue for the importance of frequency of use as a factor in the analysis and explanation of language structure.
Abstract: This volume collects three decades of articles by distinguish linguist Joan Bybee. Her articles essentially argue for the importance of frequency of use as a factor in the analysis and explanation of language structure. Her work has been very influential for a broad range of researchers in linguistics, particularly in discourse analysis, corpus linguistics, phonology, phonetics, and historical linguistics.

584 citations