Showing papers on "Voice activity detection published in 2020"

PDF

Open Access

Proceedings Article•DOI•

Libri-Light: A Benchmark for ASR with Limited or No Supervision

[...]

Jacob Kahn¹, Morgane Riviere¹, Weiyi Zheng¹, Eugene Kharitonov¹, Qiantong Xu¹, Pierre-Emmanuel Mazaré¹, Julien Karadayi², Vitaliy Liptchinsky¹, Ronan Collobert¹, Christian Fuegen¹, Tatiana Likhomanenko¹, Gabriel Synnaeve¹, Armand Joulin¹, Abdelrahman Mohamed¹, Emmanuel Dupoux¹ - Show less +11 more•Institutions (2)

Facebook¹, School for Advanced Studies in the Social Sciences²

04 May 2020

TL;DR: In this article, the authors introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, which is derived from open-source audio books from the LibriVox project.

...read moreread less

Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi- supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

...read moreread less

207 citations

Proceedings Article•DOI•

Pyannote.Audio: Neural Building Blocks for Speaker Diarization

[...]

Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, Marie-Philippe Gill - Show less +6 more

01 May 2020

TL;DR: This work introduces pyannote.audio, an open-source toolkit written in Python for speaker diarization, which provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker darization pipelines.

...read moreread less

Abstract: We introduce pyannote.audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote.audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding – reaching state-of-the-art performance for most of them.

...read moreread less

179 citations

Proceedings Article•DOI•

Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario.

[...]

Ivan Medennikov¹, Maxim Korenevsky¹, Tatiana Prisyach, Yuri Y. Khokhlov, Mariya Korenevskaya, Ivan Sorokin¹, Tatiana Timofeeva, Anton Mitrofanov¹, Andrei Andrusenko¹, Ivan Podluzhny¹, Aleksandr Laptev¹, Aleksei Romanenko² - Show less +8 more•Institutions (2)

Saint Petersburg State University of Information Technologies, Mechanics and Optics¹, Lappeenranta University of Technology²

25 Oct 2020

TL;DR: A novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame, outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

...read moreread less

Abstract: Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

...read moreread less

141 citations

Journal Article•DOI•

A Multilingual Evaluation for Online Hate Speech Detection

[...]

Michele Corazza¹, Stefano Menini², Elena Cabrio³, Sara Tonelli², Serena Villata³ - Show less +1 more•Institutions (3)

University of Bologna¹, fondazione bruno kessler², French Institute for Research in Computer Science and Automation³

14 Mar 2020-ACM Transactions on Internet Technology

TL;DR: This article proposes a robust neural architecture that is shown to perform in a satisfactory way across different languages; namely, English, Italian, and German and addresses an extensive analysis of the obtained experimental results over the three languages.

...read moreread less

Abstract: The increasing popularity of social media platforms such as Twitter and Facebook has led to a rise in the presence of hate and aggressive speech on these platforms. Despite the number of approaches recently proposed in the Natural Language Processing research area for detecting these forms of abusive language, the issue of identifying hate speech at scale is still an unsolved problem. In this article, we propose a robust neural architecture that is shown to perform in a satisfactory way across different languages; namely, English, Italian, and German. We address an extensive analysis of the obtained experimental results over the three languages to gain a better understanding of the contribution of the different components employed in the system, both from the architecture point of view (i.e., Long Short Term Memory, Gated Recurrent Unit, and bidirectional Long Short Term Memory) and from the feature selection point of view (i.e., ngrams, social network–specific features, emotion lexica, emojis, word embeddings). To address such in-depth analysis, we use three freely available datasets for hate speech detection on social media in English, Italian, and German.

...read moreread less

107 citations

Proceedings Article•DOI•

Exploring Hate Speech Detection in Multimodal Publications

[...]

Raul Gomez, Jaume Gibert, Lluis Gomez¹, Dimosthenis Karatzas¹•Institutions (1)

Autonomous University of Barcelona¹

01 Mar 2020

TL;DR: It is found that, even though images are useful for the hate speech detection task, current multimodal models cannot outperform models analyzing only text.

...read moreread less

Abstract: In this work we target the problem of hate speech detection in multimodal publications formed by a text and an image. We gather and annotate a large scale dataset from Twitter, MMHS150K, and propose different models that jointly analyze textual and visual information for hate speech detection, comparing them with unimodal detection. We provide quantitative and qualitative results and analyze the challenges of the proposed task. We find that, even though images are useful for the hate speech detection task, current multimodal models cannot outperform models analyzing only text. We discuss why and open the field and the dataset for further research.

...read moreread less

96 citations

Journal Article•DOI•

rVAD: An unsupervised segment-based robust voice activity detection method

[...]

Zheng-Hua Tan¹, Achintya Kumar Sarkar¹, Najim Dehak²•Institutions (2)

Aalborg University¹, Johns Hopkins University²

01 Jan 2020-Computer Speech & Language

TL;DR: A modified version of rVAD is presented where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation, which significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices.

...read moreread less

90 citations

Journal Article•DOI•

Vulnerable community identification using hate speech detection on social media

[...]

Zewdie Mossie¹, Jenq-Haur Wang¹•Institutions (1)

National Taipei University of Technology¹

01 May 2020-Information Processing and Management

TL;DR: This paper proposes a hate speech detection approach to identify hatred against vulnerable minority groups on social media and can successfully identify the Tigre ethnic group as the highly vulnerable community in terms of hatred compared with Amhara and Oromo.

...read moreread less

Abstract: With the rapid development in mobile computing and Web technologies, online hate speech has been increasingly spread in social network platforms since it's easy to post any opinions. Previous studies confirm that exposure to online hate speech has serious offline consequences to historically deprived communities. Thus, research on automated hate speech detection has attracted much attention. However, the role of social networks in identifying hate-related vulnerable community is not well investigated. Hate speech can affect all population groups, but some are more vulnerable to its impact than others. For example, for ethnic groups whose languages have few computational resources, it is a challenge to automatically collect and process online texts, not to mention automatic hate speech detection on social media. In this paper, we propose a hate speech detection approach to identify hatred against vulnerable minority groups on social media. Firstly, in Spark distributed processing framework, posts are automatically collected and pre-processed, and features are extracted using word n-grams and word embedding techniques such as Word2Vec. Secondly, deep learning algorithms for classification such as Gated Recurrent Unit (GRU), a variety of Recurrent Neural Networks (RNNs), are used for hate speech detection. Finally, hate words are clustered with methods such as Word2Vec to predict the potential target ethnic group for hatred. In our experiments, we use Amharic language in Ethiopia as an example. Since there was no publicly available dataset for Amharic texts, we crawled Facebook pages to prepare the corpus. Since data annotation could be biased by culture, we recruit annotators from different cultural backgrounds and achieved better inter-annotator agreement. In our experimental results, feature extraction using word embedding techniques such as Word2Vec performs better in both classical and deep learning-based classification algorithms for hate speech detection, among which GRU achieves the best result. Our proposed approach can successfully identify the Tigre ethnic group as the highly vulnerable community in terms of hatred compared with Amhara and Oromo. As a result, hatred vulnerable group identification is vital to protect them by applying automatic hate speech detection model to remove contents that aggravate psychological harm and physical conflicts. This can also encourage the way towards the development of policies, strategies, and tools to empower and protect vulnerable communities.

...read moreread less

88 citations

Proceedings Article•DOI•

Demoting Racial Bias in Hate Speech Detection

[...]

Mengzhou Xia¹, Anjalie Field¹, Yulia Tsvetkov¹•Institutions (1)

Carnegie Mellon University¹

25 May 2020

TL;DR: Experimental results suggest that the adversarial training method used in this paper is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.

...read moreread less

Abstract: In the task of hate speech detection, there exists a high correlation between African American English (AAE) and annotators’ perceptions of toxicity in current datasets. This bias in annotated training data and the tendency of machine learning models to amplify it cause AAE text to often be mislabeled as abusive/offensive/hate speech (high false positive rate) by current hate speech classifiers. Here, we use adversarial training to mitigate this bias. Experimental results on one hate speech dataset and one AAE dataset suggest that our method is able to reduce the false positive rate for AAE text with only a minimal compromise on the performance of hate speech classification.

...read moreread less

86 citations

Journal Article•DOI•

A Framework for Hate Speech Detection Using Deep Convolutional Neural Network

[...]

Pradeep Kumar Roy¹, Asis Kumar Tripathy¹, Tapan Kumar Das¹, Xiao-Zhi Gao²•Institutions (2)

VIT University¹, University of Eastern Finland²

10 Nov 2020-IEEE Access

TL;DR: The proposed DCNN model utilises the tweet text with GloVe embedding vector to capture the tweets’ semantics with the help of convolution operation and achieved the precision, recall and F1-score value as 0.97, 0.88, and 0.92 respectively for the best case and outperformed the existing models.

...read moreread less

Abstract: The rapid growth of Internet users led to unwanted cyber issues, including cyberbullying, hate speech, and many more. This article deals with the problems of hate speech on Twitter. Hate speech appears to be an inflammatory kind of interaction process that uses misconceptions to express a hate ideology. The hate speech focuses on various protected aspects, including gender, religion, race, and disability. Owing to hate speech, sometimes unwanted crimes are going to happen as someone or a group of people get disheartened. Hence, it is essential to monitor user’s posts and filter the hate speech related post before it is spread. However, Twitter receives more than six hundred tweets per second and about 500 million tweets per day. Manually filtering any information from such a huge incoming traffic is almost impossible. Concerning to this aspect, an automated system is developed using the Deep Convolutional Neural Network (DCNN). The proposed DCNN model utilises the tweet text with GloVe embedding vector to capture the tweets’ semantics with the help of convolution operation and achieved the precision, recall and F1-score value as 0.97, 0.88, 0.92 respectively for the best case and outperformed the existing models.

...read moreread less

74 citations

Journal Article•DOI•

A deep neural network based multi-task learning approach to hate speech detection

[...]

Prashant Kapil¹, Asif Ekbal¹•Institutions (1)

Indian Institute of Technology Patna¹

27 Dec 2020-Knowledge Based Systems

TL;DR: A deep multi-task learning (MTL) framework is proposed to leverage useful information from multiple related classification tasks in order to improve the performance of the individual task.

...read moreread less

Abstract: With the advent of the internet and numerous social media platforms, citizens now have enormous opportunities to express and share their opinions on various societal and political issues. This phenomenal growth of the internet, social media networks, and messaging platforms provide plenty of opportunities for building intelligent systems, but these are also being heavily misused by certain groups who often disseminate offensive, racial, and hate speeches. Hence, detecting hate speech at the right time plays a crucial role as its spread might affect social fabrics. In recent times, although a few benchmark datasets have emerged for hate speech detection, these are limited in volume and also do not follow any uniform annotation schema. In this paper, a deep multi-task learning (MTL) framework is proposed to leverage useful information from multiple related classification tasks in order to improve the performance of the individual task. The proposed multi-task model is based on the shared-private scheme that assigns shared and private layers to capture the shared-features and task-specific features from five classification tasks. Experiments 1 on the 5 datasets show that the proposed framework attains encouraging performance in terms of macro-F1 and weighted-F1.

...read moreread less

71 citations

Posted Content•

Deep Learning Models for Multilingual Hate Speech Detection

[...]

Sai Saketh Aluru, Binny Mathew, Punyajoy Saha, Animesh Mukherjee

14 Apr 2020-arXiv: Social and Information Networks

TL;DR: A large scale analysis of multilingual hate speech in 9 languages from 16 different sources shows that in low resource setting, simple models such as LASER embedding with logistic regression performs the best, while in high resource setting BERT based models perform better.

...read moreread less

Abstract: Hate speech detection is a challenging problem with most of the datasets available in only one language: English. In this paper, we conduct a large scale analysis of multilingual hate speech in 9 languages from 16 different sources. We observe that in low resource setting, simple models such as LASER embedding with logistic regression performs the best, while in high resource setting BERT based models perform better. In case of zero-shot classification, languages such as Italian and Portuguese achieve good results. Our proposed framework could be used as an efficient solution for low-resource languages. These models could also act as good baselines for future multilingual hate speech detection tasks. We have made our code and experimental settings public for other researchers at this https URL.

...read moreread less

Journal Article•DOI•

Advances in anti-spoofing: from the perspective of ASVspoof challenges

[...]

Madhu R. Kamble¹, Hardik B. Sailor², Hemant A. Patil¹, Haizhou Li³•Institutions (3)

Dhirubhai Ambani Institute of Information and Communication Technology¹, University of Sheffield², National University of Singapore³

14 Jan 2020

TL;DR: The literature review of ASV spoof detection, novel acoustic feature representations, deep learning, end-to-end systems, etc, along with recent efforts to develop countermeasures for spoof speech detection (SSD) task are presented.

...read moreread less

Abstract: In recent years, automatic speaker verification (ASV) is used extensively for voice biometrics. This leads to an increased interest to secure these voice biometric systems for real-world applications. The ASV systems are vulnerable to various kinds of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins, and impersonation. This paper provides the literature review of ASV spoof detection, novel acoustic feature representations, deep learning, end-to-end systems, etc. Furthermore, the paper also summaries previous studies of spoofing attacks with emphasis on SS, VC, and replay along with recent efforts to develop countermeasures for spoof speech detection (SSD) task. The limitations and challenges of SSD task are also presented. While several countermeasures were reported in the literature, they are mostly validated on a particular database, furthermore, their performance is far from perfect. The security of voice biometrics systems against spoofing attacks remains a challenging topic. This paper is based on a tutorial presented at APSIPA Annual Summit and Conference 2017 to serve as a quick start for those interested in the topic.

...read moreread less

Journal Article•DOI•

Automatic hate speech detection using killer natural language processing optimizing ensemble deep learning approach

[...]

Zafer Al-Makhadmeh¹, Amr Tolba¹, Amr Tolba²•Institutions (2)

King Saud University¹, Menoufia University²

01 Feb 2020-Computing

TL;DR: This paper introduces a method for using a hybrid of natural language processing and with machine learning technique to predict hate speech from social media websites.

...read moreread less

Abstract: Over the last decade, the increased use of social media has led to an increase in hateful activities in social networks. Hate speech is one of the most dangerous of these activities, so users have to protect themselves from these activities from YouTube, Facebook, Twitter etc. This paper introduces a method for using a hybrid of natural language processing and with machine learning technique to predict hate speech from social media websites. After hate speech is collected, steaming, token splitting, character removal and inflection elimination is performed before performing hate speech recognition process. After that collected data is examined using a killer natural language processing optimization ensemble deep learning approach (KNLPEDNN). This method detects hate speech on social media websites using an effective learning process that classifies the text into neutral, offensive and hate language. The performance of the system is then evaluated using overall accuracy, f-score, precision and recall metrics. The system attained minimum deviations mean square error − 0.019, Cross Entropy Loss − 0.015 and Logarithmic loss L-0.0238 and 98.71% accuracy.

...read moreread less

Posted Content•

VoxLingua107: a Dataset for Spoken Language Recognition

[...]

Jörgen Valk¹, Tanel Alumäe¹•Institutions (1)

Tallinn University of Technology¹

25 Nov 2020-arXiv: Audio and Speech Processing

TL;DR: This paper generates semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages and uses the data to build language recognition models for several spoken language identification tasks.

...read moreread less

Abstract: This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is used to remove segments from the database that are likely not in the given language, increasing the proportion of correctly labeled segments to 98%, based on crowd-sourced verification. The size of the resulting training set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances. We use the data to build language recognition models for several spoken language identification tasks. Experiments show that using the automatically retrieved training data gives competitive results to using hand-labeled proprietary datasets. The dataset is publicly available.

...read moreread less

Journal Article•DOI•

Time of your hate: The challenge of time in hate speech detection on social media

[...]

Komal Florio, Valerio Basile, Marco Polignano, Pierpaolo Basile, Viviana Patti - Show less +1 more

01 Jun 2020-Applied Sciences

TL;DR: The temporal robustness of a BERT model for Italian (AlBERTo) is explored, showing how AlBERTo is highly sensitive to the temporal distance of the fine-tuning set, and how with an adequate time window, the performance increases, while requiring less annotated data than a traditional classifier.

...read moreread less

Abstract: The availability of large annotated corpora from social media and the development of powerful classification approaches have contributed in an unprecedented way to tackle the challenge of monitoring users’ opinions and sentiments in online social platforms across time. Such linguistic data are strongly affected by events and topic discourse, and this aspect is crucial when detecting phenomena such as hate speech, especially from a diachronic perspective. We address this challenge by focusing on a real case study: the “Contro l’odio” platform for monitoring hate speech against immigrants in the Italian Twittersphere. We explored the temporal robustness of a BERT model for Italian (AlBERTo), the current benchmark on non-diachronic detection settings. We tested different training strategies to evaluate how the classification performance is affected by adding more data temporally distant from the test set and hence potentially different in terms of topic and language use. Our analysis points out the limits that a supervised classification model encounters on data that are heavily influenced by events. Our results show how AlBERTo is highly sensitive to the temporal distance of the fine-tuning set. However, with an adequate time window, the performance increases, while requiring less annotated data than a traditional classifier.

...read moreread less

Proceedings Article•DOI•

DeepHate: Hate Speech Detection via Multi-Faceted Text Representations

[...]

Rui Cao¹, Roy Ka-Wei Lee², Tuan-Anh Hoang³•Institutions (3)

University of Electronic Science and Technology of China¹, University of Saskatchewan², Leibniz University of Hanover³

06 Jul 2020

TL;DR: DeepHate is a novel deep learning model that combines multi-faceted text representations such as word embeddings, sentiments, and topical information, to detect hate speech in online social platforms and outperforms the state-of-the-art baselines on the hate speech detection task.

...read moreread less

Abstract: Online hate speech is an important issue that breaks the cohesiveness of online social communities and even raises public safety concerns in our societies Motivated by this rising issue, researchers have developed many traditional machine learning and deep learning methods to detect hate speech in online social platforms automatically However, most of these methods have only considered single type textual feature, eg, term frequency, or using word embeddings Such approaches neglect the other rich textual information that could be utilized to improve hate speech detection In this paper, we propose DeepHate, a novel deep learning model that combines multi-faceted text representations such as word embeddings, sentiments, and topical information, to detect hate speech in online social platforms We conduct extensive experiments and evaluate DeepHate on three large publicly available real-world datasets Our experiment results show that DeepHate outperforms the state-of-the-art baselines on the hate speech detection task We also perform case studies to provide insights into the salient features that best aid in detecting hate speech in online social platforms

...read moreread less

Journal Article•DOI•

Significance of Subband Features for Synthetic Speech Detection

[...]

Jichen Yang¹, Rohan Kumar Das¹, Haizhou Li¹•Institutions (1)

National University of Singapore¹

01 Jan 2020-IEEE Transactions on Information Forensics and Security

TL;DR: It is found that subband transform captures the artifacts in synthetic speech more effectively than full band transform.

...read moreread less

Abstract: In text-to-speech or voice conversion based synthetic speech detection, it is a common practice that spectral information over the entire frequency band is used for feature representation. We propose a new method, referred to as subband transform, that characterizes the signals by subband. It is found that subband transform captures the artifacts in synthetic speech more effectively than full band transform. We propose equal subband transform, octave subband transform, and mel subband transform for three novel features, namely, constant-Q equal subband transform (CQ-EST), constant-Q octave subband transform (CQ-OST) and discrete Fourier mel subband transform (DF-MST). We evaluate the three features on the ASVspoof 2015, noisy ASVspoof 2015 and ASVspoof 2019 logical access corpora. The experiments show that the proposed CQ-EST feature achieves an average equal error rate of 0.056% on ASVspoof 2015 evaluation set. The study observes that the features based on subband transform outperform those based on full band transform under both clean and noisy conditions. In addition, the tandem detection cost function of CQ-OST can reach 0.188 on ASVspoof 2019 logical access evaluation set.

...read moreread less

Proceedings Article•

HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection

[...]

Binny Mathew¹, Punyajoy Saha¹, Seid Muhie Yimam², Chris Biemann², Pawan Goyal¹, Animesh Mukherjee¹ - Show less +2 more•Institutions (2)

Indian Institute of Technology Kharagpur¹, University of Hamburg²

18 Dec 2020

TL;DR: HateXplain this paper is a dataset for hate speech detection, which is annotated from three different perspectives: the basic, commonly used 3-class classification, the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e. the portions of the post on which their labeling decision (as hate, offensive or normal) is based.

...read moreread less

Abstract: Hate speech is a challenging issue plaguing the online social media. While better models for hate speech detection are continuously being developed, there is little research on the bias and interpretability aspects of hate speech. In this paper, we introduce HateXplain, the first benchmark hate speech dataset covering multiple aspects of the issue. Each post in our dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labelling decision (as hate, offensive or normal) is based. We utilize existing state-of-the-art models and observe that even models that perform very well in classification do not score high on explainability metrics like model plausibility and faithfulness. We also observe that models, which utilize the human rationales for training, perform better in reducing unintended bias towards target communities. We have made our code and dataset public for other researchers.

...read moreread less

Proceedings Article•DOI•

Personal VAD: Speaker-Conditioned Voice Activity Detection

[...]

Ignacio Lopez Moreno, Li Wan, Quan Wang, Shaojin Ding, Shuo-Yiin Chang - Show less +1 more

01 Nov 2020

TL;DR: Personal VAD as discussed by the authors is a system to detect the voice activity of a target speaker at the frame level by training a VAD-alike neural network conditioned on the target speaker embedding or the speaker verification score.

...read moreread less

Abstract: In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Under our optimal setup, we are able to train a model with only 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task.

...read moreread less

Book Chapter•DOI•

HaSpeeDe 2 @ EVALITA2020: Overview of the EVALITA 2020 Hate Speech Detection Task

[...]

Manuela Sanguinetti¹, Gloria Comandini², Elisa Di Nuovo³, Simona Frenda³, Marco Stranisci³, Cristina Bosco³, Tommaso Caselli, Viviana Patti³, Irene Russo - Show less +5 more•Institutions (3)

University of Cagliari¹, University of Trento², University of Turin³

01 Jan 2020

TL;DR: The Hate Speech Detection (HaSpeeDe 2) task is the second edition of a shared task on the detection of hateful content in Italian Twitter messages and is composed of a Main task (hate speech detection) and two Pilot tasks, (stereotype and nominal utterance detection).

...read moreread less

Abstract: The Hate Speech Detection (HaSpeeDe 2) task is the second edition of a shared task on the detection of hateful content in Italian Twitter messages. HaSpeeDe 2 is composed of a Main task (hate speech detection) and two Pilot tasks, (stereotype and nominal utterance detection). Systems were challenged along two dimensions: (i) time, with test data coming from a different time period than the training data, and (ii) domain, with test data coming from the news domain (i.e., news headlines). Overall, 14 teams participated in the Main task, the best systems achieved a macro F1-score of 0.8088 and 0.7744 on the indomain in the out-of-domain test sets, respectively; 6 teams submitted their results for Pilot task 1 (stereotype detection), the best systems achieved a macro F1-score of 0.7719 and 0.7203 on in-domain and outof-domain test sets. We did not receive any submission for Pilot task 2.

...read moreread less

Journal Article•DOI•

A Deep Learning Approach for Automatic Hate Speech Detection in the Saudi Twittersphere

[...]

Raghad Alshalan, Hend S. Al-Khalifa

01 Dec 2020-Applied Sciences

TL;DR: This paper aimed to investigate several neural network models based on convolutional neural network (CNN) and recurrent neuralnetwork (RNN) to detect hate speech in Arabic tweets and evaluated the recent language representation model bidirectional encoder representations from transformers (BERT) on the task of Arabic hate speech detection.

...read moreread less

Abstract: With the rise of hate speech phenomena in the Twittersphere, significant research efforts have been undertaken in order to provide automatic solutions for detecting hate speech, varying from simple machine learning models to more complex deep neural network models. Despite this, research works investigating hate speech problem in Arabic are still limited. This paper, therefore, aimed to investigate several neural network models based on convolutional neural network (CNN) and recurrent neural network (RNN) to detect hate speech in Arabic tweets. It also evaluated the recent language representation model bidirectional encoder representations from transformers (BERT) on the task of Arabic hate speech detection. To conduct our experiments, we firstly built a new hate speech dataset that contained 9316 annotated tweets. Then, we conducted a set of experiments on two datasets to evaluate four models: CNN, gated recurrent units (GRU), CNN + GRU, and BERT. Our experimental results in our dataset and an out-domain dataset showed that the CNN model gave the best performance, with an F1-score of 0.79 and area under the receiver operating characteristic curve (AUROC) of 0.89.

...read moreread less

HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion

[...]

Bharathi Raja Chakravarthi

01 Dec 2020

TL;DR: This paper annotated hope speech for equality, diversity and inclusion in a multilingual setting and determined that the inter-annotator agreement of their dataset using Krippendorff's alpha.

...read moreread less

Abstract: Over the past few years, systems have been developed to control online content and eliminate abusive, offensive or hate speech content. However, people in power sometimes misuse this form of censorship to obstruct the democratic right of freedom of speech. Therefore, it is imperative that research should take a positive reinforcement approach towards online content that is encouraging, positive and supportive contents. Until now, most studies have focused on solving this problem of negativity in the English language, though the problem is much more than just harmful content. Furthermore, it is multilingual as well. Thus, we have constructed a Hope Speech dataset for Equality, Diversity and Inclusion (HopeEDI) containing user-generated comments from the social media platform YouTube with 28,451, 20,198 and 10,705 comments in English, Tamil and Malayalam, respectively, manually labelled as containing hope speech or not. To our knowledge, this is the first research of its kind to annotate hope speech for equality, diversity and inclusion in a multilingual setting. We determined that the inter-annotator agreement of our dataset using Krippendorff’s alpha. Further, we created several baselines to benchmark the resulting dataset and the results have been expressed using precision, recall and F1-score. The dataset is publicly available for the research community. We hope that this resource will spur further research on encouraging inclusive and responsive speech that reinforces positiveness.

...read moreread less

Proceedings Article•DOI•

BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

[...]

Jihyung Moon¹, Won Ik Cho¹, Junbum Lee¹•Institutions (1)

Seoul National University¹

01 Jul 2020

TL;DR: This work presents 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea, and provides benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks.

...read moreread less

Abstract: Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff’s alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.

...read moreread less

Proceedings Article•DOI•

Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection

[...]

Latane Bullock¹, Hervé Bredin², Leibny Paola Garcia-Perera³•Institutions (3)

Rice University¹, Université Paris-Saclay², Johns Hopkins University³

04 May 2020

TL;DR: In this article, a neural long short-term memory (LSTM) based architecture for overlap detection is proposed, and detected overlap regions are exploited in conjunction with a frame-level speaker posterior matrix to make two-speaker assignments for overlapped frames in the resegmentation step.

...read moreread less

Abstract: We address the problem of effectively handling overlapping speech in a diarization system. First, we detail a neural Long Short-Term Memory- based architecture for overlap detection. Secondly, detected overlap regions are exploited in conjunction with a frame-level speaker posterior matrix to make two-speaker assignments for overlapped frames in the resegmentation step. The overlap detection module achieves state-of-the-art performance on the AMI, DIHARD, and ETAPE corpora. We apply overlap-aware resegmentation on AMI, resulting in a 20% relative DER reduction over the baseline system. While this approach is by no means an end-all solution to overlap-aware diarization, it reveals promising directions for handling overlap.

...read moreread less

Journal Article•DOI•

Automatic Hate Speech Detection using Machine Learning: A Comparative Study

[...]

Sindhu Abro, Sarang Shaikh, Zahid Hussain, Zafar Ali, Sajid A. Khan, Ghulam Mujtaba - Show less +2 more

01 Jan 2020-International Journal of Advanced Computer Science and Applications

TL;DR: This paper compares the performance of three feature engineering techniques and eight machine learning algorithms to evaluate their performance on a publicly available dataset having three distinct classes and showed that the bigram features when used with the support vector machine algorithm best performed with 79% off overall accuracy.

...read moreread less

Abstract: The increasing use of social media and information sharing has given major benefits to humanity. However, this has also given rise to a variety of challenges including the spreading and sharing of hate speech messages. Thus, to solve this emerging issue in social media sites, recent studies employed a variety of feature engineering techniques and machine learning algorithms to automatically detect the hate speech messages on different datasets. However, to the best of our knowledge, there is no study to compare the variety of feature engineering techniques and machine learning algorithms to evaluate which feature engineering technique and machine learning algorithm outperform on a standard publicly available dataset. Hence, the aim of this paper is to compare the performance of three feature engineering techniques and eight machine learning algorithms to evaluate their performance on a publicly available dataset having three distinct classes. The experimental results showed that the bigram features when used with the support vector machine algorithm best performed with 79% off overall accuracy. Our study holds practical implication and can be used as a baseline study in the area of detecting automatic hate speech messages. Moreover, the output of different comparisons will be used as state-of-art techniques to compare future researches for existing automated text classification techniques.

...read moreread less

Posted Content•

Replay and Synthetic Speech Detection with Res2net Architecture

[...]

Xu Li¹, Na Li², Chao Weng², Xunying Liu¹, Dan Su², Dong Yu², Helen Meng¹ - Show less +3 more•Institutions (2)

The Chinese University of Hong Kong¹, Tencent²

28 Oct 2020-arXiv: Audio and Speech Processing

TL;DR: The Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus and the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios.

...read moreread less

Abstract: Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into multiple channel groups and designs a residual-like connection across different channel groups. Such connection increases the possible receptive fields, resulting in multiple feature scales. This multiple scaling mechanism significantly improves the countermeasure's generalizability to unseen spoofing attacks. It also decreases the model size compared to ResNet-based models. Experimental results show that the Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus. Moreover, integration with the squeeze-and-excitation (SE) block can further enhance performance. For feature engineering, we investigate the generalizability of Res2Net combined with different acoustic features, and observe that the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios. Our best single system outperforms other state-of-the-art single systems in both PA and LA of the ASVspoof 2019 corpus.

...read moreread less

Journal Article•DOI•

Machine learning techniques for hate speech classification of twitter data: State-of-the-art, future challenges and research directions

[...]

Femi Emmanuel Ayo¹, Olusegun Folorunso², Friday Thomas Ibharalu², Idowu Ademola Osinuga²•Institutions (2)

McPherson University¹, Federal University of Agriculture, Abeokuta²

01 Nov 2020-Computer Science Review

TL;DR: The results showed that the developed system is very good for automatic topic detection and categorization, and indicates a more perfect test having an AUC of 0.97, when compared to similar methods.

...read moreread less

Journal Article•DOI•

A new way to enhance speech signal based on compressed sensing

[...]

Houria Haneche¹, Bachir Boudraa¹, Abdeldjalil Ouahabi²•Institutions (2)

University of Science and Technology Houari Boumediene¹, French Institute of Health and Medical Research²

01 Feb 2020-Measurement

TL;DR: Comparison with recent state-of-the-art methods is performed in terms of segmental signal to noise ratio, perceptual evaluation of speech quality, and short-time objective intelligibility.

...read moreread less

Journal Article•DOI•

Hate and offensive speech detection on Arabic social media

[...]

Safa Alsafari¹, Samira Sadaoui¹, Malek Mouhoub¹•Institutions (1)

University of Regina¹

01 Sep 2020-Online Social Networks and Media

TL;DR: This study builds a reliable Arabic textual corpus by crawling data from Twitter using four robust extraction strategies that are implemented based on four types of hate: religion, ethnicity, nationality, and gender, and labels the corpus based on a three-hierarchical annotation scheme.

...read moreread less

Posted Content•

Cross-lingual Zero- and Few-shot Hate Speech Detection Utilising Frozen Transformer Language Models and AXEL

[...]

Lukas Stappen, Fabian Brunn¹, Björn Schuller•Institutions (1)

Technische Universität München¹

13 Apr 2020-arXiv: Computation and Language

TL;DR: A tailored architecture based on frozen, pre-trained Transformers is developed to examine cross-lingual zero-shot and few-shot learning, in addition to uni-lingUAL learning, on the HatEval challenge data set, demonstrating highly competitive results on the English and Spanish subsets.

...read moreread less

Abstract: Detecting hate speech, especially in low-resource languages, is a non-trivial challenge. To tackle this, we developed a tailored architecture based on frozen, pre-trained Transformers to examine cross-lingual zero-shot and few-shot learning, in addition to uni-lingual learning, on the HatEval challenge data set. With our novel attention-based classification block AXEL, we demonstrate highly competitive results on the English and Spanish subsets. We also re-sample the English subset, enabling additional, meaningful comparisons in the future.

...read moreread less

Collapse