LanideNN: Multilingual Language Identification on Character Window

Open AccessProceedings Article

LanideNN: Multilingual Language Identification on Character Window

Tom Kocmi, +1 more

- pp 927-936

Chats0

TLDR

The authors proposed a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages, which is based on bidirectional recurrent neural networks and performs well in monolingual and multilingual language identification tasks.

Abstract:

In language identification, a common first step in natural language processing, we want to automatically determine the language of some input text. Monolingual language identification assumes that the given document is written in one language. In multilingual language identification, the document is usually in two or three languages and we just want their names. We aim one step further and propose a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages. Our method is based on Bidirectional Recurrent Neural Networks and it performs well in monolingual and multilingual language identification tasks on six datasets covering 131 languages. The method keeps the accuracy also for short documents and across domains, so it is ideal for off-the-shelf use without preparation of training data.

Citations

PDF

Open Access

More filters

Proceedings Article

Deep Models for Arabic Dialect Identification on Benchmarked Data

Mohamed Elaraby, +1 more

TL;DR: The experimental results show that variantsof (attention-based) bidirectional recurrent neural networks achieve best accuracy (acc) on the AOC task, significantly outperforming all competitive baselines.

...read moreread less

Proceedings ArticleDOI

Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages

Maria Medvedeva, +2 more

TL;DR: The authors presented the results of their participation in the VarDial 4 shared task on discriminating closely related languages using simple traditional models using linear support vector machines (SVMs) and a neural network (NN).

...read moreread less

Proceedings ArticleDOI

A Dataset and Classifier for Recognizing Social Media English

Su Lin Blodgett, +2 more

TL;DR: It is found that a demographic language model—which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter—can be used to improve English language identification performance when combined with a traditional supervised language identifier.

...read moreread less

Proceedings ArticleDOI

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

Yuan Zhang, +5 more

TL;DR: The authors proposed a fine-grained multilingual language identification model that provides a language code for every token in a sentence, including codemixed text containing multiple languages, by using a feed-forward network with a simple globally constrained decoder.

...read moreread less

Comparing Approaches to Dravidian Language Identification

Tommi Jauhiainen, +2 more

TL;DR: The authors used a Naive Bayes classifier with adaptive language models, which has shown to obtain competitive performance in many language and dialect identification tasks, and a transformer-based model which is widely regarded as the state-of-the-art in a number of NLP tasks.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Journal Article

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, +4 more

- 01 Jan 2014 -

Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Proceedings ArticleDOI

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Kyunghyun Cho, +8 more

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.

...read moreread less

Journal ArticleDOI

Finding Structure in Time

Jeffrey L. Elman

- 01 Mar 1990 -

Cognitive Science

TL;DR: A proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory and suggests a method for representing lexical categories and the type/token distinction is developed.

...read moreread less

Related Papers (5)

Automatic Language Identification in Texts: A Survey

Tommi Jauhiainen, +4 more

- 25 Aug 2019 -

Journal of Artificial Intelligence Resea...

LanideNN: Multilingual Language Identification on Character Window

Citations

Deep Models for Arabic Dialect Identification on Benchmarked Data

Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages

A Dataset and Classifier for Recognizing Social Media English

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

Comparing Approaches to Dravidian Language Identification

References

Adam: A Method for Stochastic Optimization

Long short-term memory

Dropout: a simple way to prevent neural networks from overfitting

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Finding Structure in Time

Related Papers (5)

Automatic Language Identification in Texts: A Survey

Wiki-40B: Multilingual Language Model Dataset

Evaluation of a language identification system for mono- and multilingual text documents

Text-based language identification of multilingual names

A multilingual language processing tool for Uyghur, Kazak and Kirghiz