scispace - formally typeset
Search or ask a question
Author

Marco Lui

Other affiliations: NICTA
Bio: Marco Lui is an academic researcher from University of Melbourne. The author has contributed to research in topics: Language identification & Task (project management). The author has an hindex of 14, co-authored 22 publications receiving 1625 citations. Previous affiliations of Marco Lui include NICTA.

Papers
More filters
Proceedings Article
10 Jul 2012
TL;DR: It is found that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.
Abstract: We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

577 citations

Proceedings Article
Timothy Baldwin1, Paul Cook1, Marco Lui1, Andrew MacKinlay2, Li Wang2 
01 Oct 2013
TL;DR: This work investigates just how linguistically noisy or otherwise text in social media text is over a range of social media sources, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, which is compared to a reference corpus of edited English text.
Abstract: While various claims have been made about text in social media text being noisy, there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, which we compare to a reference corpus of edited English text. We first extract out various descriptive statistics from each data type (including the distribution of languages, average sentence length and proportion of out-ofvocabulary words), and then investigate the proportion of grammatical sentences in each, based on a linguistically-motivated parser. We also investigate the relative similarity between different data types.

234 citations

Proceedings Article
02 Jun 2010
TL;DR: It is demonstrated that the task becomes increasingly difficult as the authors increase the number of languages, reduce the amount of training data and reduce the length of documents, and it is shown that it is possible to perform language identification without having to perform explicit character encoding detection.
Abstract: Language identification is the task of identifying the language a given document is written in. This paper describes a detailed examination of what models perform best under different conditions, based on experiments across three separate datasets and a range of tokenisation strategies. We demonstrate that the task becomes increasingly difficult as we increase the number of languages, reduce the amount of training data and reduce the length of documents. We also show that it is possible to perform language identification without having to perform explicit character encoding detection.

163 citations

Proceedings Article
01 Nov 2011
TL;DR: It is shown that transductive (cross-domain) learning is an important consideration in building a general-purpose language identification system, and a feature selection method is developed that generalizes across domains.
Abstract: We show that transductive (cross-domain) learning is an important consideration in building a general-purpose language identification system, and develop a feature selection method that generalizes across domains. Our results demonstrate that our method provides improvements in transductive transfer learning for language identification. We provide an implementation of the method and show that our system is faster than popular standalone language identification systems, while maintaining competitive accuracy.

161 citations

Journal ArticleDOI
TL;DR: A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.
Abstract: Language identification (“LI”) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelfLI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.

133 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Proceedings ArticleDOI
01 Jan 2018
TL;DR: This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences, finding that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective.
Abstract: An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search We also compare how synthetic data compares to genuine bitext and study various domain effects Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT’14 English-German test set

968 citations

Proceedings Article
19 Feb 2018
TL;DR: This article used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project, and introduced three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish.
Abstract: Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

831 citations

Proceedings Article
01 Jun 2013
TL;DR: This work systematically evaluates the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy on Twitter and achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks.
Abstract: We consider the problem of part-of-speech tagging for informal, online conversational text. We systematically evaluate the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy. With these features, our system achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks; Twitter tagging is improved from 90% to 93% accuracy (more than 3% absolute). Qualitative analysis of these word clusters yields insights about NLP and linguistic phenomena in this genre. Additionally, we contribute the first POS annotation guidelines for such text and release a new dataset of English language tweets annotated using these guidelines. Tagging software, annotation guidelines, and large-scale word clusters are available at: http://www.ark.cs.cmu.edu/TweetNLP This paper describes release 0.3 of the “CMU Twitter Part-of-Speech Tagger” and annotated data. [This paper is forthcoming in Proceedings of NAACL 2013; Atlanta, GA, USA.]

780 citations

Proceedings Article
01 May 2016
TL;DR: A new major release of the OpenSubtitles collection of parallel corpora, which is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
Abstract: We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs.

705 citations