Normalizing SMS: are Two Metaphors Better than One ?

doi:10.3115/1599081.1599137

Home
/
Papers
/
Normalizing SMS: are Two Metaphors Better than One ?

Proceedings Article•DOI•

Normalizing SMS: are Two Metaphors Better than One ?

Catherine Kobus, François Yvon¹, Géraldine Damnati•Institutions (1)

Centre national de la recherche scientifique¹

18 Aug 2008-pp 441-448

TL;DR: This paper presents an comparative study of systems aiming at normalizing the orthography of French SMS messages, one drawing inspiration from the Machine Translation task; the other using techniques that are commonly used in automatic speech recognition devices.

read less

Abstract: Electronic written texts used in computermediated interactions (e-mails, blogs, chats, etc) present major deviations from the norm of the language This paper presents an comparative study of systems aiming at normalizing the orthography of French SMS messages: after discussing the linguistic peculiarities of these messages, and possible approaches to their automatic normalization, we present, evaluate and contrast two systems, one drawing inspiration from the Machine Translation task; the other using techniques that are commonly used in automatic speech recognition devices Combining both approaches, our best normalization system achieves about 11% Word Error Rate on a test set of about 3000 unseen messages

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

Named Entity Recognition in Tweets: An Experimental Study

[...]

Alan Ritter¹, Sam Clark¹, Oren Etzioni¹•Institutions (1)

University of Washington¹

27 Jul 2011

TL;DR: The novel T-ner system doubles F1 score compared with the Stanford NER system, and leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision.

...read moreread less

Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-ner system doubles F1 score compared with the Stanford NER system. T-ner leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms co-training, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http://github.com/aritter/twitter_nlp

...read moreread less

1,351 citations

Cites background from "Normalizing SMS: are Two Metaphors ..."

...Like SMS (Kobus et al., 2008), tweets are particularly terse and difficult (See Table 1)....
[...]

Journal Article•DOI•

Language and the Internet

[...]

Jean Aitchison

01 Sep 2002-Literary and Linguistic Computing

764 citations

Proceedings Article•

A Broad-Coverage Normalization System for Social Media Language

[...]

Fei Liu¹, Fuliang Weng¹, Xiao Jiang¹•Institutions (1)

Bosch¹

08 Jul 2012

TL;DR: A cognitively-driven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity is proposed.

...read moreread less

Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitively-driven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and message-level using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a 10% absolute increase compared to state-of-the-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.

...read moreread less

187 citations

Cites methods from "Normalizing SMS: are Two Metaphors ..."

...We aim for a robust text normalization system with “broad coverage”, i.e., for any user-created nonstandard token, the system should be able to restore the correct word within its top n candidates (n = 1, 3, 10...)....
[...]
...(Kobus et al., 2008) showed that using a statistical MT system in combination with an analogy of the ASR system improved performance in French SMS normalization....
[...]

Patent•

Systems and methods for multi-user multi-lingual communications

[...]

Gabriel Leydon, Francois Orsini, Nikhil Bojja, Shailen Karur

03 Jun 2014

TL;DR: In this article, the authors describe a system and a method for assessing the accuracy of translations between two or more languages, and a reward for these submissions is given to users for submitting corrections for inaccurate or erroneous translations.

...read moreread less

Abstract: Various embodiments described herein facilitate multi-lingual communications. The systems and methods of some embodiments enable multi-lingual communications through different modes of communication including, for example, Internet-based chat, e-mail, text-based mobile phone communications, postings to online forums, postings to online social media services, and the like. Certain embodiments implement communication systems and methods that translate text between two or more languages. Users of the systems and methods may be incentivized to submit corrections for inaccurate or erroneous translations, and may receive a reward for these submissions. Systems and methods for assessing the accuracy of translations are described.

...read moreread less

168 citations

Journal Article•DOI•

SMS spam filtering

[...]

Sarah Jane Delany¹, Mark Buckley¹, Derek Greene²•Institutions (2)

Dublin Institute of Technology¹, University College Dublin²

01 Aug 2012-Expert Systems With Applications

TL;DR: The need for content-based SMS spam filtering is motivated and the issues with data collection and availability for furthering research in this area are discussed, a large corpus of SMS spam is analyzed, and some initial benchmark results are provided.

...read moreread less

Abstract: Highlights? We motivate the need for content-based SMS spam filtering. ? We discuss similarities/differences between email and SMS spam filtering. ? We review recent research in SMS spam filtering. ? We analyse recent SMS spam messages and make a dataset available. ? Early days, no consensus yet on best techniques but significant challenges exist. Mobile or SMS spam is a real and growing problem primarily due to the availability of very cheap bulk pre-pay SMS packages and the fact that SMS engenders higher response rates as it is a trusted and personal service. SMS spam filtering is a relatively new task which inherits many issues and solutions from email spam filtering. However it poses its own specific challenges. This paper motivates work on filtering SMS spam and reviews recent developments in SMS spam filtering. The paper also discusses the issues with data collection and availability for furthering research in this area, analyses a large corpus of SMS spam, and provides some initial benchmark results.

...read moreread less

164 citations

Cites background from "Normalizing SMS: are Two Metaphors ..."

...…use an idiosyncratic language subset with abbreviations, phonetic contractions, bad punctuation, emoticons, etc., which is different to the more traditional written 4http://razor.sourceforge.net/ 5http://www.cloudmark.com/ language more typically used in emails (Kobus et al., 2008; Ling, 2005)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Bleu: a Method for Automatic Evaluation of Machine Translation

[...]

Kishore Papineni¹, Salim Roukos¹, Todd Ward¹, Wei-Jing Zhu¹•Institutions (1)

IBM¹

06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

Abstract: Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

...read moreread less

21,126 citations

Proceedings Article•DOI•

Moses: Open Source Toolkit for Statistical Machine Translation

[...]

Philipp Koehn¹, Hieu Hoang¹, Alexandra Birch¹, Chris Callison-Burch¹, Marcello Federico, Nicola Bertoldi, Brooke Cowan², Wade Shen², C. Corbett Moran², Richard Zens³, Chris Dyer⁴, Ondrej Bojar⁵, Alexandra Elena Constantin⁶, Evan Herbst⁷ - Show less +10 more•Institutions (7)

University of Edinburgh¹, Massachusetts Institute of Technology², RWTH Aachen University³, University of Maryland, College Park⁴, Charles University in Prague⁵, Williams College⁶, Cornell University⁷

25 Jun 2007

TL;DR: An open-source toolkit for statistical machine translation whose novel contributions are support for linguistically motivated factors, confusion network decoding, and efficient data formats for translation models and language models.

...read moreread less

Abstract: We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c) efficient data formats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.

...read moreread less

6,008 citations

"Normalizing SMS: are Two Metaphors ..." refers methods in this paper

...…induce, based on statistical principles (Brown et al., 1990), an automatic word alignment of SMS tokens with their normalized counterparts; Moses (Koehn et al., 2007) is used to learn the various parameters of the phrase-based model, to optimize the weight combination and to perform the…...
[...]
...Preliminary experiments suggest that using n-best list outputs from Moses instead of just the one best could buy us an small additional WER decrease....
[...]
...Giza++ (Och and Ney, 2003) is used to induce, based on statistical principles (Brown et al., 1990), an automatic word alignment of SMS tokens with their normalized counterparts; Moses (Koehn et al., 2007) is used to learn the various parameters of the phrase-based model, to optimize the weight combination and to perform the translation using a multi-stack search algorithm; the SRI language model toolkit (Stolcke, 2002) is finally used to estimate statistical language models....
[...]

Proceedings Article•

SRILM – An Extensible Language Modeling Toolkit

[...]

Andreas Stolcke

01 Jan 2002

TL;DR: The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.

...read moreread less

Abstract: SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimentation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a variety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipulation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and implementation, highlighting ease of rapid prototyping, reusability, and combinability of tools.

...read moreread less

4,904 citations

Journal Article•DOI•

A systematic comparison of various statistical alignment models

[...]

Franz Josef Och¹, Hermann Ney²•Institutions (2)

Information Sciences Institute¹, RWTH Aachen University²

01 Mar 2003-Computational Linguistics

TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.

...read moreread less

Abstract: We present and compare various methods for computing word alignments using statistical or heuristic models. We consider the five alignment models presented in Brown, Della Pietra, Della Pietra, and Mercer (1993), the hidden Markov alignment model, smoothing techniques, and refinements. These statistical models are compared with two heuristic models based on the Dice coefficient. We present different methods for combining word alignments to perform a symmetrization of directed statistical alignment models. As evaluation criterion, we use the quality of the resulting Viterbi alignment compared to a manually produced reference alignment. We evaluate the models on the German-English Verbmobil task and the French-English Hansards task. We perform a detailed analysis of various design decisions of our statistical alignment system and evaluate these on training corpora of various sizes. An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models. In the Appendix, we present an efficient training algorithm for the alignment models presented.

...read moreread less

4,402 citations

Journal Article•DOI•

A statistical approach to machine translation

[...]

Peter Fitzhugh Brown¹, John Cocke¹, Stephen A. Della Pietra¹, Vincent J. Della Pietra¹, F. Jelinek¹, John Lafferty¹, Robert Leroy Mercer¹, Paul S. Roossin¹ - Show less +4 more•Institutions (1)

IBM¹

01 Jun 1990-Computational Linguistics

TL;DR: The application of the statistical approach to translation from French to English and preliminary results are described and the results are given.

...read moreread less

Abstract: In this paper, we present a statistical approach to machine translation. We describe the application of our approach to translation from French to English and give preliminary results.

...read moreread less

1,860 citations

"Normalizing SMS: are Two Metaphors ..." refers methods in this paper

...Giza++ (Och and Ney, 2003) is used to induce, based on statistical principles (Brown et al., 1990), an automatic word alignment of SMS tokens with their normalized counterparts; Moses (Koehn et al....
[...]
...Giza++ (Och and Ney, 2003) is used to induce, based on statistical principles (Brown et al., 1990), an automatic word alignment of SMS tokens with their normalized counterparts; Moses (Koehn et al., 2007) is used to learn the various parameters of the phrase-based model, to optimize the weight…...
[...]