Home
/
Authors
/
Shervin Malmasi

Author

Shervin Malmasi

Other affiliations: Harvard University, Amazon.com, Macquarie University

Bio: Shervin Malmasi is an academic researcher from Brigham and Women's Hospital. The author has contributed to research in topics: Computer science & Native-language identification. The author has an hindex of 31, co-authored 87 publications receiving 3549 citations. Previous affiliations of Shervin Malmasi include Harvard University & Amazon.com.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Predicting the Type and Target of Offensive Posts in Social Media

[...]

Marcos Zampieri¹, Shervin Malmasi², Preslav Nakov³, Sara Rosenthal⁴, Noura Farra⁴, Ritesh Kumar⁵ - Show less +2 more•Institutions (5)

University of Wolverhampton¹, Brigham and Women's Hospital², Sofia University³, Columbia University⁴, Indian Institutes of Technology⁵

01 Feb 2019

TL;DR: The Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, is complied and made publicly available.

...read moreread less

Abstract: As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.

...read moreread less

520 citations

Proceedings Article•DOI•

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval).

[...]

Marcos Zampieri¹, Shervin Malmasi², Preslav Nakov³, Sara Rosenthal⁴, Noura Farra⁴, Ritesh Kumar⁵ - Show less +2 more•Institutions (5)

University of Wolverhampton¹, Brigham and Women's Hospital², Qatar Computing Research Institute³, Columbia University⁴, Indian Institutes of Technology⁵

01 Jun 2019

TL;DR: The SemEval-2019 Task 6 on Identifying and categorizing Offensive Language in Social Media (OffensEval) as mentioned in this paper was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets, and featured three sub-tasks.

...read moreread less

Abstract: We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets, and it featured three sub-tasks. In sub-task A, systems were asked to discriminate between offensive and non-offensive posts. In sub-task B, systems had to identify the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, nearly 800 teams signed up to participate in the task and 115 of them submitted results, which are presented and analyzed in this report.

...read moreread less

498 citations

Proceedings Article•DOI•

Findings of the 2019 Conference on Machine Translation (WMT19)

[...]

Loïc Barrault¹, Ondřej Bojar², Marta R. Costa-jussà³, Christian Federmann⁴, Mark Fishel⁵, Yvette Graham⁶, Barry Haddow⁷, Matthias Huck⁷, Philipp Koehn⁸, Shervin Malmasi⁹, Christof Monz¹⁰, Mathias Müller¹¹, Santanu Pal¹², Matt Post⁸, Marcos Zampieri¹³ - Show less +11 more•Institutions (13)

University of Maine¹, Charles University in Prague², Polytechnic University of Catalonia³, Microsoft⁴, University of Tartu⁵, Dublin City University⁶, University of Edinburgh⁷, Johns Hopkins University⁸, Harvard University⁹, University of Amsterdam¹⁰, University of Zurich¹¹, Saarland University¹², University of Wolverhampton¹³

02 Aug 2019

TL;DR: This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019, asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories.

...read moreread less

Abstract: This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

...read moreread less

433 citations

Proceedings Article•

Benchmarking Aggression Identification in Social Media.

[...]

Ritesh Kumar¹, Atul Kr. Ojha², Shervin Malmasi³, Marcos Zampieri⁴•Institutions (4)

Babasaheb Bhimrao Ambedkar University¹, National University of Ireland, Galway², Harvard University³, University of Wolverhampton⁴

01 Aug 2018

TL;DR: The Shared Task on Aggression Identification organised as part of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING 2018 was to develop a classifier that could discriminate between Overtly Aggression, Covertly Aggressive, and Non-aggressive texts.

...read moreread less

Abstract: In this paper, we present the report and findings of the Shared Task on Aggression Identification organised as part of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING 2018. The task was to develop a classifier that could discriminate between Overtly Aggressive, Covertly Aggressive, and Non-aggressive texts. For this task, the participants were provided with a dataset of 15,000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation. For testing, two different sets - one from Facebook and another from a different social media - were provided. A total of 130 teams registered to participate in the task, 30 teams submitted their test runs, and finally 20 teams also sent their system description paper which are included in the TRAC workshop proceedings. The best system obtained a weighted F-score of 0.64 for both Hindi and English on the Facebook test sets, while the best scores on the surprise set were 0.60 and 0.50 for English and Hindi respectively. The results presented in this report depict how challenging the task is. The positive response from the community and the great levels of participation in the first edition of this shared task also highlights the interest in this topic.

...read moreread less

346 citations

Journal Article•DOI•

Challenges in discriminating profanity from hate speech

[...]

Shervin Malmasi¹, Marcos Zampieri²•Institutions (2)

Harvard University¹, University of Wolverhampton²

13 Dec 2017-Journal of Experimental and Theoretical Artificial Intelligence

TL;DR: In this paper, the problem of distinguishing general profanity from hate speech in social media has been addressed, using a new dataset annotated with specifical information. But the work is limited to hate speech.

...read moreread less

Abstract: In this study, we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered. Using a new dataset annotated specifical...

...read moreread less

242 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Collapse

Cited by

PDF

Open Access

More filters

IEEE transactions on pattern analysis and machine intelligence

[...]

Ieee Xplore

01 Jan 1979

TL;DR: This special issue aims at gathering the recent advances in learning with shared information methods and their applications in computer vision and multimedia analysis and addressing interesting real-world computer Vision and multimedia applications.

...read moreread less

Abstract: In the real world, a realistic setting for computer vision or multimedia recognition problems is that we have some classes containing lots of training data and many classes contain a small amount of training data. Therefore, how to use frequent classes to help learning rare classes for which it is harder to collect the training data is an open question. Learning with Shared Information is an emerging topic in machine learning, computer vision and multimedia analysis. There are different level of components that can be shared during concept modeling and machine learning stages, such as sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, etc. Regarding the specific methods, multi-task learning, transfer learning and deep learning can be seen as using different strategies to share information. These learning with shared information methods are very effective in solving real-world large-scale problems. This special issue aims at gathering the recent advances in learning with shared information methods and their applications in computer vision and multimedia analysis. Both state-of-the-art works, as well as literature reviews, are welcome for submission. Papers addressing interesting real-world computer vision and multimedia applications are especially encouraged. Topics of interest include, but are not limited to: • Multi-task learning or transfer learning for large-scale computer vision and multimedia analysis • Deep learning for large-scale computer vision and multimedia analysis • Multi-modal approach for large-scale computer vision and multimedia analysis • Different sharing strategies, e.g., sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, • Real-world computer vision and multimedia applications based on learning with shared information, e.g., event detection, object recognition, object detection, action recognition, human head pose estimation, object tracking, location-based services, semantic indexing. • New datasets and metrics to evaluate the benefit of the proposed sharing ability for the specific computer vision or multimedia problem. • Survey papers regarding the topic of learning with shared information. Authors who are unsure whether their planned submission is in scope may contact the guest editors prior to the submission deadline with an abstract, in order to receive feedback.

...read moreread less

1,758 citations

Proceedings Article•DOI•

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

[...]

Peng Qi¹, Yuhao Zhang¹, Yuhui Zhang², Jason Bolton¹, Christopher D. Manning¹ - Show less +1 more•Institutions (2)

Stanford University¹, Tsinghua University²

16 Mar 2020

TL;DR: This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.

...read moreread less

Abstract: We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlpgithubio/stanza/

...read moreread less

1,040 citations

Book•

第二语言习得研究 = The study of second language acquisition

[...]

Rod Ellis

01 Jan 1999

TL;DR: Second language acquisition research has been extensively studied in the literature as discussed by the authors, with a focus on second language acquisition in the context of English as a Second Language Learning (ESL) programs.

...read moreread less

Abstract: Acknowledgements Introduction PART ONE - BACKGROUND Introduction 1. Second language acquisition research: an overview PART TWO - THE DESCRIPTION OF LEARNER LANGUAGE Introduction 2. Learner errors and error analysis 3. Developmental patterns: order and sequence in second language acquisition 4. Variability in learner language 5. Pragmatic aspects of learner language PART THREE - EXPLAINING SECOND LANGUAGE ACQUISITION: EXTERNAL FACTORS Introduction 6. Social factors and second language acquisition 7. Input and interaction and second language acquisition PART FOUR - EXPLAINING SECOND LANGUAGE ACQUISITION: INTERNAL FACTORS Introduction 8. Language transfer 9. Cognitive accounts of second language acquisition 10. Linguistic universals and second language acquisition PART FIVE - EXPLAINING INDIVIDUAL DIFFERENCES IN SECOND LANGUAGE ACQUISITION Introduction 11. Individual learner differences 12. Learning strategies PART SIX - CLASSROOM SECOND LANGUAGE ACQUISITION Introduction 13. Classroom interaction and second language acquisition 14. Formal instruction and second language acquisition PART SEVEN - CONCLUSION Introduction 15. Data, theory, and applications in second language acquisition research Glossary Bibliography Author index Subject index

...read moreread less

981 citations

Journal Article•DOI•

Understanding second language acquisition

[...]

Andreas Digeser

01 Jan 1988-System

887 citations

Posted Content•

CTRL: A Conditional Transformer Language Model for Controllable Generation

[...]

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher - Show less +1 more

11 Sep 2019-arXiv: Computation and Language

TL;DR: CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.

...read moreread less

Abstract: Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at this https URL.

...read moreread less

844 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse