scispace - formally typeset
Search or ask a question
Author

Shervin Malmasi

Bio: Shervin Malmasi is an academic researcher from Brigham and Women's Hospital. The author has contributed to research in topics: Computer science & Native-language identification. The author has an hindex of 31, co-authored 87 publications receiving 3549 citations. Previous affiliations of Shervin Malmasi include Harvard University & Amazon.com.

Papers published on a yearly basis

Papers
More filters
Proceedings ArticleDOI
01 Feb 2019
TL;DR: The Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, is complied and made publicly available.
Abstract: As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.

520 citations

Proceedings ArticleDOI
01 Jun 2019
TL;DR: The SemEval-2019 Task 6 on Identifying and categorizing Offensive Language in Social Media (OffensEval) as mentioned in this paper was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets, and featured three sub-tasks.
Abstract: We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets, and it featured three sub-tasks. In sub-task A, systems were asked to discriminate between offensive and non-offensive posts. In sub-task B, systems had to identify the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, nearly 800 teams signed up to participate in the task and 115 of them submitted results, which are presented and analyzed in this report.

498 citations

Proceedings ArticleDOI
02 Aug 2019
TL;DR: This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019, asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories.
Abstract: This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

433 citations

Proceedings Article
01 Aug 2018
TL;DR: The Shared Task on Aggression Identification organised as part of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING 2018 was to develop a classifier that could discriminate between Overtly Aggression, Covertly Aggressive, and Non-aggressive texts.
Abstract: In this paper, we present the report and findings of the Shared Task on Aggression Identification organised as part of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING 2018. The task was to develop a classifier that could discriminate between Overtly Aggressive, Covertly Aggressive, and Non-aggressive texts. For this task, the participants were provided with a dataset of 15,000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation. For testing, two different sets - one from Facebook and another from a different social media - were provided. A total of 130 teams registered to participate in the task, 30 teams submitted their test runs, and finally 20 teams also sent their system description paper which are included in the TRAC workshop proceedings. The best system obtained a weighted F-score of 0.64 for both Hindi and English on the Facebook test sets, while the best scores on the surprise set were 0.60 and 0.50 for English and Hindi respectively. The results presented in this report depict how challenging the task is. The positive response from the community and the great levels of participation in the first edition of this shared task also highlights the interest in this topic.

346 citations

Journal ArticleDOI
TL;DR: In this paper, the problem of distinguishing general profanity from hate speech in social media has been addressed, using a new dataset annotated with specifical information. But the work is limited to hate speech.
Abstract: In this study, we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered. Using a new dataset annotated specifical...

242 citations


Cited by
More filters
01 Jan 1979
TL;DR: This special issue aims at gathering the recent advances in learning with shared information methods and their applications in computer vision and multimedia analysis and addressing interesting real-world computer Vision and multimedia applications.
Abstract: In the real world, a realistic setting for computer vision or multimedia recognition problems is that we have some classes containing lots of training data and many classes contain a small amount of training data. Therefore, how to use frequent classes to help learning rare classes for which it is harder to collect the training data is an open question. Learning with Shared Information is an emerging topic in machine learning, computer vision and multimedia analysis. There are different level of components that can be shared during concept modeling and machine learning stages, such as sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, etc. Regarding the specific methods, multi-task learning, transfer learning and deep learning can be seen as using different strategies to share information. These learning with shared information methods are very effective in solving real-world large-scale problems. This special issue aims at gathering the recent advances in learning with shared information methods and their applications in computer vision and multimedia analysis. Both state-of-the-art works, as well as literature reviews, are welcome for submission. Papers addressing interesting real-world computer vision and multimedia applications are especially encouraged. Topics of interest include, but are not limited to: • Multi-task learning or transfer learning for large-scale computer vision and multimedia analysis • Deep learning for large-scale computer vision and multimedia analysis • Multi-modal approach for large-scale computer vision and multimedia analysis • Different sharing strategies, e.g., sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, • Real-world computer vision and multimedia applications based on learning with shared information, e.g., event detection, object recognition, object detection, action recognition, human head pose estimation, object tracking, location-based services, semantic indexing. • New datasets and metrics to evaluate the benefit of the proposed sharing ability for the specific computer vision or multimedia problem. • Survey papers regarding the topic of learning with shared information. Authors who are unsure whether their planned submission is in scope may contact the guest editors prior to the submission deadline with an abstract, in order to receive feedback.

1,758 citations

Proceedings ArticleDOI
16 Mar 2020
TL;DR: This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.
Abstract: We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlpgithubio/stanza/

1,040 citations

Book
01 Jan 1999
TL;DR: Second language acquisition research has been extensively studied in the literature as discussed by the authors, with a focus on second language acquisition in the context of English as a Second Language Learning (ESL) programs.
Abstract: Acknowledgements Introduction PART ONE - BACKGROUND Introduction 1. Second language acquisition research: an overview PART TWO - THE DESCRIPTION OF LEARNER LANGUAGE Introduction 2. Learner errors and error analysis 3. Developmental patterns: order and sequence in second language acquisition 4. Variability in learner language 5. Pragmatic aspects of learner language PART THREE - EXPLAINING SECOND LANGUAGE ACQUISITION: EXTERNAL FACTORS Introduction 6. Social factors and second language acquisition 7. Input and interaction and second language acquisition PART FOUR - EXPLAINING SECOND LANGUAGE ACQUISITION: INTERNAL FACTORS Introduction 8. Language transfer 9. Cognitive accounts of second language acquisition 10. Linguistic universals and second language acquisition PART FIVE - EXPLAINING INDIVIDUAL DIFFERENCES IN SECOND LANGUAGE ACQUISITION Introduction 11. Individual learner differences 12. Learning strategies PART SIX - CLASSROOM SECOND LANGUAGE ACQUISITION Introduction 13. Classroom interaction and second language acquisition 14. Formal instruction and second language acquisition PART SEVEN - CONCLUSION Introduction 15. Data, theory, and applications in second language acquisition research Glossary Bibliography Author index Subject index

981 citations

Journal ArticleDOI
01 Jan 1988-System

887 citations

Posted Content
TL;DR: CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.
Abstract: Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at this https URL.

844 citations