scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Normalizing SMS: are Two Metaphors Better than One ?

TL;DR: This paper presents an comparative study of systems aiming at normalizing the orthography of French SMS messages, one drawing inspiration from the Machine Translation task; the other using techniques that are commonly used in automatic speech recognition devices.
Abstract: Electronic written texts used in computermediated interactions (e-mails, blogs, chats, etc) present major deviations from the norm of the language This paper presents an comparative study of systems aiming at normalizing the orthography of French SMS messages: after discussing the linguistic peculiarities of these messages, and possible approaches to their automatic normalization, we present, evaluate and contrast two systems, one drawing inspiration from the Machine Translation task; the other using techniques that are commonly used in automatic speech recognition devices Combining both approaches, our best normalization system achieves about 11% Word Error Rate on a test set of about 3000 unseen messages

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article
27 Jul 2011
TL;DR: The novel T-ner system doubles F1 score compared with the Stanford NER system, and leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision.
Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-ner system doubles F1 score compared with the Stanford NER system. T-ner leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms co-training, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http://github.com/aritter/twitter_nlp

1,351 citations


Cites background from "Normalizing SMS: are Two Metaphors ..."

  • ...Like SMS (Kobus et al., 2008), tweets are particularly terse and difficult (See Table 1)....

    [...]

Journal ArticleDOI

764 citations

Proceedings Article
Fei Liu1, Fuliang Weng1, Xiao Jiang1
08 Jul 2012
TL;DR: A cognitively-driven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity is proposed.
Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitively-driven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and message-level using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a 10% absolute increase compared to state-of-the-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.

187 citations


Cites methods from "Normalizing SMS: are Two Metaphors ..."

  • ...We aim for a robust text normalization system with “broad coverage”, i.e., for any user-created nonstandard token, the system should be able to restore the correct word within its top n candidates (n = 1, 3, 10...)....

    [...]

  • ...(Kobus et al., 2008) showed that using a statistical MT system in combination with an analogy of the ASR system improved performance in French SMS normalization....

    [...]

Patent
03 Jun 2014
TL;DR: In this article, the authors describe a system and a method for assessing the accuracy of translations between two or more languages, and a reward for these submissions is given to users for submitting corrections for inaccurate or erroneous translations.
Abstract: Various embodiments described herein facilitate multi-lingual communications. The systems and methods of some embodiments enable multi-lingual communications through different modes of communication including, for example, Internet-based chat, e-mail, text-based mobile phone communications, postings to online forums, postings to online social media services, and the like. Certain embodiments implement communication systems and methods that translate text between two or more languages. Users of the systems and methods may be incentivized to submit corrections for inaccurate or erroneous translations, and may receive a reward for these submissions. Systems and methods for assessing the accuracy of translations are described.

168 citations

Journal ArticleDOI
TL;DR: The need for content-based SMS spam filtering is motivated and the issues with data collection and availability for furthering research in this area are discussed, a large corpus of SMS spam is analyzed, and some initial benchmark results are provided.
Abstract: Highlights? We motivate the need for content-based SMS spam filtering. ? We discuss similarities/differences between email and SMS spam filtering. ? We review recent research in SMS spam filtering. ? We analyse recent SMS spam messages and make a dataset available. ? Early days, no consensus yet on best techniques but significant challenges exist. Mobile or SMS spam is a real and growing problem primarily due to the availability of very cheap bulk pre-pay SMS packages and the fact that SMS engenders higher response rates as it is a trusted and personal service. SMS spam filtering is a relatively new task which inherits many issues and solutions from email spam filtering. However it poses its own specific challenges. This paper motivates work on filtering SMS spam and reviews recent developments in SMS spam filtering. The paper also discusses the issues with data collection and availability for furthering research in this area, analyses a large corpus of SMS spam, and provides some initial benchmark results.

164 citations


Cites background from "Normalizing SMS: are Two Metaphors ..."

  • ...…use an idiosyncratic language subset with abbreviations, phonetic contractions, bad punctuation, emoticons, etc., which is different to the more traditional written 4http://razor.sourceforge.net/ 5http://www.cloudmark.com/ language more typically used in emails (Kobus et al., 2008; Ling, 2005)....

    [...]

References
More filters
Proceedings ArticleDOI
06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Abstract: Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

21,126 citations

Proceedings ArticleDOI
25 Jun 2007
TL;DR: An open-source toolkit for statistical machine translation whose novel contributions are support for linguistically motivated factors, confusion network decoding, and efficient data formats for translation models and language models.
Abstract: We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c) efficient data formats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.

6,008 citations


"Normalizing SMS: are Two Metaphors ..." refers methods in this paper

  • ...…induce, based on statistical principles (Brown et al., 1990), an automatic word alignment of SMS tokens with their normalized counterparts; Moses (Koehn et al., 2007) is used to learn the various parameters of the phrase-based model, to optimize the weight combination and to perform the…...

    [...]

  • ...Preliminary experiments suggest that using n-best list outputs from Moses instead of just the one best could buy us an small additional WER decrease....

    [...]

  • ...Giza++ (Och and Ney, 2003) is used to induce, based on statistical principles (Brown et al., 1990), an automatic word alignment of SMS tokens with their normalized counterparts; Moses (Koehn et al., 2007) is used to learn the various parameters of the phrase-based model, to optimize the weight combination and to perform the translation using a multi-stack search algorithm; the SRI language model toolkit (Stolcke, 2002) is finally used to estimate statistical language models....

    [...]

Proceedings Article
01 Jan 2002
TL;DR: The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.
Abstract: SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimentation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a variety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipulation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and implementation, highlighting ease of rapid prototyping, reusability, and combinability of tools.

4,904 citations

Journal ArticleDOI
TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.
Abstract: We present and compare various methods for computing word alignments using statistical or heuristic models. We consider the five alignment models presented in Brown, Della Pietra, Della Pietra, and Mercer (1993), the hidden Markov alignment model, smoothing techniques, and refinements. These statistical models are compared with two heuristic models based on the Dice coefficient. We present different methods for combining word alignments to perform a symmetrization of directed statistical alignment models. As evaluation criterion, we use the quality of the resulting Viterbi alignment compared to a manually produced reference alignment. We evaluate the models on the German-English Verbmobil task and the French-English Hansards task. We perform a detailed analysis of various design decisions of our statistical alignment system and evaluate these on training corpora of various sizes. An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models. In the Appendix, we present an efficient training algorithm for the alignment models presented.

4,402 citations

Journal ArticleDOI
TL;DR: The application of the statistical approach to translation from French to English and preliminary results are described and the results are given.
Abstract: In this paper, we present a statistical approach to machine translation. We describe the application of our approach to translation from French to English and give preliminary results.

1,860 citations


"Normalizing SMS: are Two Metaphors ..." refers methods in this paper

  • ...Giza++ (Och and Ney, 2003) is used to induce, based on statistical principles (Brown et al., 1990), an automatic word alignment of SMS tokens with their normalized counterparts; Moses (Koehn et al....

    [...]

  • ...Giza++ (Och and Ney, 2003) is used to induce, based on statistical principles (Brown et al., 1990), an automatic word alignment of SMS tokens with their normalized counterparts; Moses (Koehn et al., 2007) is used to learn the various parameters of the phrase-based model, to optimize the weight…...

    [...]