Home
/
Authors
/
Marco Lui

Author

Marco Lui

Other affiliations: NICTA

Bio: Marco Lui is an academic researcher from University of Melbourne. The author has contributed to research in topics: Language identification & Task (project management). The author has an hindex of 14, co-authored 22 publications receiving 1625 citations. Previous affiliations of Marco Lui include NICTA.

Papers

PDF

Open Access

More filters

Proceedings Article•

langid.py: An Off-the-shelf Language Identification Tool

[...]

Marco Lui¹, Timothy Baldwin¹•Institutions (1)

University of Melbourne¹

10 Jul 2012

TL;DR: It is found that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

...read moreread less

Abstract: We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

...read moreread less

577 citations

Proceedings Article•

How Noisy Social Media Text, How Diffrnt Social Media Sources?

[...]

Timothy Baldwin¹, Paul Cook¹, Marco Lui¹, Andrew MacKinlay², Li Wang² - Show less +1 more•Institutions (2)

University of Melbourne¹, NICTA²

01 Oct 2013

TL;DR: This work investigates just how linguistically noisy or otherwise text in social media text is over a range of social media sources, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, which is compared to a reference corpus of edited English text.

...read moreread less

Abstract: While various claims have been made about text in social media text being noisy, there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, which we compare to a reference corpus of edited English text. We first extract out various descriptive statistics from each data type (including the distribution of languages, average sentence length and proportion of out-ofvocabulary words), and then investigate the proportion of grammatical sentences in each, based on a linguistically-motivated parser. We also investigate the relative similarity between different data types.

...read moreread less

234 citations

Proceedings Article•

Language Identification: The Long and the Short of the Matter

[...]

Timothy Baldwin¹, Marco Lui¹•Institutions (1)

University of Melbourne¹

02 Jun 2010

TL;DR: It is demonstrated that the task becomes increasingly difficult as the authors increase the number of languages, reduce the amount of training data and reduce the length of documents, and it is shown that it is possible to perform language identification without having to perform explicit character encoding detection.

...read moreread less

Abstract: Language identification is the task of identifying the language a given document is written in. This paper describes a detailed examination of what models perform best under different conditions, based on experiments across three separate datasets and a range of tokenisation strategies. We demonstrate that the task becomes increasingly difficult as we increase the number of languages, reduce the amount of training data and reduce the length of documents. We also show that it is possible to perform language identification without having to perform explicit character encoding detection.

...read moreread less

163 citations

Proceedings Article•

Cross-domain Feature Selection for Language Identification

[...]

Marco Lui¹, Timothy Baldwin¹•Institutions (1)

University of Melbourne¹

01 Nov 2011

TL;DR: It is shown that transductive (cross-domain) learning is an important consideration in building a general-purpose language identification system, and a feature selection method is developed that generalizes across domains.

...read moreread less

Abstract: We show that transductive (cross-domain) learning is an important consideration in building a general-purpose language identification system, and develop a feature selection method that generalizes across domains. Our results demonstrate that our method provides improvements in transductive transfer learning for language identification. We provide an implementation of the method and show that our system is faster than popular standalone language identification systems, while maintaining competitive accuracy.

...read moreread less

161 citations

Journal Article•DOI•

Automatic Language Identification in Texts: A Survey

[...]

Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén - Show less +1 more

25 Aug 2019-Journal of Artificial Intelligence Research

TL;DR: A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.

...read moreread less

Abstract: Language identification (“LI”) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelfLI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.

...read moreread less

133 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read moreread less

13,246 citations

Proceedings Article•DOI•

Understanding Back-Translation at Scale.

[...]

Sergey Edunov¹, Myle Ott¹, Michael Auli¹, David Grangier¹•Institutions (1)

Facebook¹

01 Jan 2018

TL;DR: This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences, finding that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective.

...read moreread less

Abstract: An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search We also compare how synthetic data compares to genuine bitext and study various domain effects Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT’14 English-German test set

...read moreread less

968 citations

Proceedings Article•

Learning Word Vectors for 157 Languages

[...]

Edouard Grave¹, Piotr Bojanowski¹, Prakhar Gupta², Armand Joulin¹, Tomas Mikolov¹ - Show less +1 more•Institutions (2)

Facebook¹, École Polytechnique Fédérale de Lausanne²

19 Feb 2018

TL;DR: This article used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project, and introduced three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish.

...read moreread less

Abstract: Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

...read moreread less

831 citations

Proceedings Article•

Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters

[...]

Olutobi Owoputi, Brendan O'Connor¹, Chris Dyer¹, Kevin Gimpel², Nathan Schneider, Noah A. Smith¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Toyota Technological Institute at Chicago²

01 Jun 2013

TL;DR: This work systematically evaluates the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy on Twitter and achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks.

...read moreread less

Abstract: We consider the problem of part-of-speech tagging for informal, online conversational text. We systematically evaluate the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy. With these features, our system achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks; Twitter tagging is improved from 90% to 93% accuracy (more than 3% absolute). Qualitative analysis of these word clusters yields insights about NLP and linguistic phenomena in this genre. Additionally, we contribute the first POS annotation guidelines for such text and release a new dataset of English language tweets annotated using these guidelines. Tagging software, annotation guidelines, and large-scale word clusters are available at: http://www.ark.cs.cmu.edu/TweetNLP This paper describes release 0.3 of the “CMU Twitter Part-of-Speech Tagger” and annotated data. [This paper is forthcoming in Proceedings of NAACL 2013; Atlanta, GA, USA.]

...read moreread less

780 citations

Proceedings Article•

OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

[...]

Pierre Lison¹, Jörg Tiedemann²•Institutions (2)

University of Oslo¹, University of Helsinki²

01 May 2016

TL;DR: A new major release of the OpenSubtitles collection of parallel corpora, which is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.

...read moreread less

Abstract: We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs.

...read moreread less

705 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse