scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 2020"


Proceedings ArticleDOI
04 May 2020
TL;DR: In this article, the authors introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, which is derived from open-source audio books from the LibriVox project.
Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi- supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

207 citations


Proceedings ArticleDOI
01 May 2020
TL;DR: This work introduces pyannote.audio, an open-source toolkit written in Python for speaker diarization, which provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker darization pipelines.
Abstract: We introduce pyannote.audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote.audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding – reaching state-of-the-art performance for most of them.

179 citations


Proceedings ArticleDOI
25 Oct 2020
TL;DR: A novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame, outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.
Abstract: Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

141 citations


Journal ArticleDOI
TL;DR: This article proposes a robust neural architecture that is shown to perform in a satisfactory way across different languages; namely, English, Italian, and German and addresses an extensive analysis of the obtained experimental results over the three languages.
Abstract: The increasing popularity of social media platforms such as Twitter and Facebook has led to a rise in the presence of hate and aggressive speech on these platforms. Despite the number of approaches recently proposed in the Natural Language Processing research area for detecting these forms of abusive language, the issue of identifying hate speech at scale is still an unsolved problem. In this article, we propose a robust neural architecture that is shown to perform in a satisfactory way across different languages; namely, English, Italian, and German. We address an extensive analysis of the obtained experimental results over the three languages to gain a better understanding of the contribution of the different components employed in the system, both from the architecture point of view (i.e., Long Short Term Memory, Gated Recurrent Unit, and bidirectional Long Short Term Memory) and from the feature selection point of view (i.e., ngrams, social network–specific features, emotion lexica, emojis, word embeddings). To address such in-depth analysis, we use three freely available datasets for hate speech detection on social media in English, Italian, and German.

107 citations


Proceedings ArticleDOI
01 Mar 2020
TL;DR: It is found that, even though images are useful for the hate speech detection task, current multimodal models cannot outperform models analyzing only text.
Abstract: In this work we target the problem of hate speech detection in multimodal publications formed by a text and an image. We gather and annotate a large scale dataset from Twitter, MMHS150K, and propose different models that jointly analyze textual and visual information for hate speech detection, comparing them with unimodal detection. We provide quantitative and qualitative results and analyze the challenges of the proposed task. We find that, even though images are useful for the hate speech detection task, current multimodal models cannot outperform models analyzing only text. We discuss why and open the field and the dataset for further research.

96 citations


Journal ArticleDOI
TL;DR: A modified version of rVAD is presented where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation, which significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices.

90 citations


Journal ArticleDOI
TL;DR: This paper proposes a hate speech detection approach to identify hatred against vulnerable minority groups on social media and can successfully identify the Tigre ethnic group as the highly vulnerable community in terms of hatred compared with Amhara and Oromo.
Abstract: With the rapid development in mobile computing and Web technologies, online hate speech has been increasingly spread in social network platforms since it's easy to post any opinions. Previous studies confirm that exposure to online hate speech has serious offline consequences to historically deprived communities. Thus, research on automated hate speech detection has attracted much attention. However, the role of social networks in identifying hate-related vulnerable community is not well investigated. Hate speech can affect all population groups, but some are more vulnerable to its impact than others. For example, for ethnic groups whose languages have few computational resources, it is a challenge to automatically collect and process online texts, not to mention automatic hate speech detection on social media. In this paper, we propose a hate speech detection approach to identify hatred against vulnerable minority groups on social media. Firstly, in Spark distributed processing framework, posts are automatically collected and pre-processed, and features are extracted using word n-grams and word embedding techniques such as Word2Vec. Secondly, deep learning algorithms for classification such as Gated Recurrent Unit (GRU), a variety of Recurrent Neural Networks (RNNs), are used for hate speech detection. Finally, hate words are clustered with methods such as Word2Vec to predict the potential target ethnic group for hatred. In our experiments, we use Amharic language in Ethiopia as an example. Since there was no publicly available dataset for Amharic texts, we crawled Facebook pages to prepare the corpus. Since data annotation could be biased by culture, we recruit annotators from different cultural backgrounds and achieved better inter-annotator agreement. In our experimental results, feature extraction using word embedding techniques such as Word2Vec performs better in both classical and deep learning-based classification algorithms for hate speech detection, among which GRU achieves the best result. Our proposed approach can successfully identify the Tigre ethnic group as the highly vulnerable community in terms of hatred compared with Amhara and Oromo. As a result, hatred vulnerable group identification is vital to protect them by applying automatic hate speech detection model to remove contents that aggravate psychological harm and physical conflicts. This can also encourage the way towards the development of policies, strategies, and tools to empower and protect vulnerable communities.

88 citations


Proceedings ArticleDOI
25 May 2020
TL;DR: Experimental results suggest that the adversarial training method used in this paper is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.
Abstract: In the task of hate speech detection, there exists a high correlation between African American English (AAE) and annotators’ perceptions of toxicity in current datasets. This bias in annotated training data and the tendency of machine learning models to amplify it cause AAE text to often be mislabeled as abusive/offensive/hate speech (high false positive rate) by current hate speech classifiers. Here, we use adversarial training to mitigate this bias. Experimental results on one hate speech dataset and one AAE dataset suggest that our method is able to reduce the false positive rate for AAE text with only a minimal compromise on the performance of hate speech classification.

86 citations


Journal ArticleDOI
TL;DR: The proposed DCNN model utilises the tweet text with GloVe embedding vector to capture the tweets’ semantics with the help of convolution operation and achieved the precision, recall and F1-score value as 0.97, 0.88, and 0.92 respectively for the best case and outperformed the existing models.
Abstract: The rapid growth of Internet users led to unwanted cyber issues, including cyberbullying, hate speech, and many more. This article deals with the problems of hate speech on Twitter. Hate speech appears to be an inflammatory kind of interaction process that uses misconceptions to express a hate ideology. The hate speech focuses on various protected aspects, including gender, religion, race, and disability. Owing to hate speech, sometimes unwanted crimes are going to happen as someone or a group of people get disheartened. Hence, it is essential to monitor user’s posts and filter the hate speech related post before it is spread. However, Twitter receives more than six hundred tweets per second and about 500 million tweets per day. Manually filtering any information from such a huge incoming traffic is almost impossible. Concerning to this aspect, an automated system is developed using the Deep Convolutional Neural Network (DCNN). The proposed DCNN model utilises the tweet text with GloVe embedding vector to capture the tweets’ semantics with the help of convolution operation and achieved the precision, recall and F1-score value as 0.97, 0.88, 0.92 respectively for the best case and outperformed the existing models.

74 citations


Journal ArticleDOI
TL;DR: A deep multi-task learning (MTL) framework is proposed to leverage useful information from multiple related classification tasks in order to improve the performance of the individual task.
Abstract: With the advent of the internet and numerous social media platforms, citizens now have enormous opportunities to express and share their opinions on various societal and political issues. This phenomenal growth of the internet, social media networks, and messaging platforms provide plenty of opportunities for building intelligent systems, but these are also being heavily misused by certain groups who often disseminate offensive, racial, and hate speeches. Hence, detecting hate speech at the right time plays a crucial role as its spread might affect social fabrics. In recent times, although a few benchmark datasets have emerged for hate speech detection, these are limited in volume and also do not follow any uniform annotation schema. In this paper, a deep multi-task learning (MTL) framework is proposed to leverage useful information from multiple related classification tasks in order to improve the performance of the individual task. The proposed multi-task model is based on the shared-private scheme that assigns shared and private layers to capture the shared-features and task-specific features from five classification tasks. Experiments 1 on the 5 datasets show that the proposed framework attains encouraging performance in terms of macro-F1 and weighted-F1.

71 citations


Posted Content
TL;DR: A large scale analysis of multilingual hate speech in 9 languages from 16 different sources shows that in low resource setting, simple models such as LASER embedding with logistic regression performs the best, while in high resource setting BERT based models perform better.
Abstract: Hate speech detection is a challenging problem with most of the datasets available in only one language: English. In this paper, we conduct a large scale analysis of multilingual hate speech in 9 languages from 16 different sources. We observe that in low resource setting, simple models such as LASER embedding with logistic regression performs the best, while in high resource setting BERT based models perform better. In case of zero-shot classification, languages such as Italian and Portuguese achieve good results. Our proposed framework could be used as an efficient solution for low-resource languages. These models could also act as good baselines for future multilingual hate speech detection tasks. We have made our code and experimental settings public for other researchers at this https URL.

Journal ArticleDOI
14 Jan 2020
TL;DR: The literature review of ASV spoof detection, novel acoustic feature representations, deep learning, end-to-end systems, etc, along with recent efforts to develop countermeasures for spoof speech detection (SSD) task are presented.
Abstract: In recent years, automatic speaker verification (ASV) is used extensively for voice biometrics. This leads to an increased interest to secure these voice biometric systems for real-world applications. The ASV systems are vulnerable to various kinds of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins, and impersonation. This paper provides the literature review of ASV spoof detection, novel acoustic feature representations, deep learning, end-to-end systems, etc. Furthermore, the paper also summaries previous studies of spoofing attacks with emphasis on SS, VC, and replay along with recent efforts to develop countermeasures for spoof speech detection (SSD) task. The limitations and challenges of SSD task are also presented. While several countermeasures were reported in the literature, they are mostly validated on a particular database, furthermore, their performance is far from perfect. The security of voice biometrics systems against spoofing attacks remains a challenging topic. This paper is based on a tutorial presented at APSIPA Annual Summit and Conference 2017 to serve as a quick start for those interested in the topic.

Journal ArticleDOI
TL;DR: This paper introduces a method for using a hybrid of natural language processing and with machine learning technique to predict hate speech from social media websites.
Abstract: Over the last decade, the increased use of social media has led to an increase in hateful activities in social networks. Hate speech is one of the most dangerous of these activities, so users have to protect themselves from these activities from YouTube, Facebook, Twitter etc. This paper introduces a method for using a hybrid of natural language processing and with machine learning technique to predict hate speech from social media websites. After hate speech is collected, steaming, token splitting, character removal and inflection elimination is performed before performing hate speech recognition process. After that collected data is examined using a killer natural language processing optimization ensemble deep learning approach (KNLPEDNN). This method detects hate speech on social media websites using an effective learning process that classifies the text into neutral, offensive and hate language. The performance of the system is then evaluated using overall accuracy, f-score, precision and recall metrics. The system attained minimum deviations mean square error − 0.019, Cross Entropy Loss − 0.015 and Logarithmic loss L-0.0238 and 98.71% accuracy.

Posted Content
TL;DR: This paper generates semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages and uses the data to build language recognition models for several spoken language identification tasks.
Abstract: This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is used to remove segments from the database that are likely not in the given language, increasing the proportion of correctly labeled segments to 98%, based on crowd-sourced verification. The size of the resulting training set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances. We use the data to build language recognition models for several spoken language identification tasks. Experiments show that using the automatically retrieved training data gives competitive results to using hand-labeled proprietary datasets. The dataset is publicly available.

Journal ArticleDOI
TL;DR: The temporal robustness of a BERT model for Italian (AlBERTo) is explored, showing how AlBERTo is highly sensitive to the temporal distance of the fine-tuning set, and how with an adequate time window, the performance increases, while requiring less annotated data than a traditional classifier.
Abstract: The availability of large annotated corpora from social media and the development of powerful classification approaches have contributed in an unprecedented way to tackle the challenge of monitoring users’ opinions and sentiments in online social platforms across time. Such linguistic data are strongly affected by events and topic discourse, and this aspect is crucial when detecting phenomena such as hate speech, especially from a diachronic perspective. We address this challenge by focusing on a real case study: the “Contro l’odio” platform for monitoring hate speech against immigrants in the Italian Twittersphere. We explored the temporal robustness of a BERT model for Italian (AlBERTo), the current benchmark on non-diachronic detection settings. We tested different training strategies to evaluate how the classification performance is affected by adding more data temporally distant from the test set and hence potentially different in terms of topic and language use. Our analysis points out the limits that a supervised classification model encounters on data that are heavily influenced by events. Our results show how AlBERTo is highly sensitive to the temporal distance of the fine-tuning set. However, with an adequate time window, the performance increases, while requiring less annotated data than a traditional classifier.

Proceedings ArticleDOI
06 Jul 2020
TL;DR: DeepHate is a novel deep learning model that combines multi-faceted text representations such as word embeddings, sentiments, and topical information, to detect hate speech in online social platforms and outperforms the state-of-the-art baselines on the hate speech detection task.
Abstract: Online hate speech is an important issue that breaks the cohesiveness of online social communities and even raises public safety concerns in our societies Motivated by this rising issue, researchers have developed many traditional machine learning and deep learning methods to detect hate speech in online social platforms automatically However, most of these methods have only considered single type textual feature, eg, term frequency, or using word embeddings Such approaches neglect the other rich textual information that could be utilized to improve hate speech detection In this paper, we propose DeepHate, a novel deep learning model that combines multi-faceted text representations such as word embeddings, sentiments, and topical information, to detect hate speech in online social platforms We conduct extensive experiments and evaluate DeepHate on three large publicly available real-world datasets Our experiment results show that DeepHate outperforms the state-of-the-art baselines on the hate speech detection task We also perform case studies to provide insights into the salient features that best aid in detecting hate speech in online social platforms

Journal ArticleDOI
TL;DR: It is found that subband transform captures the artifacts in synthetic speech more effectively than full band transform.
Abstract: In text-to-speech or voice conversion based synthetic speech detection, it is a common practice that spectral information over the entire frequency band is used for feature representation. We propose a new method, referred to as subband transform, that characterizes the signals by subband. It is found that subband transform captures the artifacts in synthetic speech more effectively than full band transform. We propose equal subband transform, octave subband transform, and mel subband transform for three novel features, namely, constant-Q equal subband transform (CQ-EST), constant-Q octave subband transform (CQ-OST) and discrete Fourier mel subband transform (DF-MST). We evaluate the three features on the ASVspoof 2015, noisy ASVspoof 2015 and ASVspoof 2019 logical access corpora. The experiments show that the proposed CQ-EST feature achieves an average equal error rate of 0.056% on ASVspoof 2015 evaluation set. The study observes that the features based on subband transform outperform those based on full band transform under both clean and noisy conditions. In addition, the tandem detection cost function of CQ-OST can reach 0.188 on ASVspoof 2019 logical access evaluation set.

Proceedings Article
18 Dec 2020
TL;DR: HateXplain this paper is a dataset for hate speech detection, which is annotated from three different perspectives: the basic, commonly used 3-class classification, the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e. the portions of the post on which their labeling decision (as hate, offensive or normal) is based.
Abstract: Hate speech is a challenging issue plaguing the online social media. While better models for hate speech detection are continuously being developed, there is little research on the bias and interpretability aspects of hate speech. In this paper, we introduce HateXplain, the first benchmark hate speech dataset covering multiple aspects of the issue. Each post in our dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labelling decision (as hate, offensive or normal) is based. We utilize existing state-of-the-art models and observe that even models that perform very well in classification do not score high on explainability metrics like model plausibility and faithfulness. We also observe that models, which utilize the human rationales for training, perform better in reducing unintended bias towards target communities. We have made our code and dataset public for other researchers.

Proceedings ArticleDOI
01 Nov 2020
TL;DR: Personal VAD as discussed by the authors is a system to detect the voice activity of a target speaker at the frame level by training a VAD-alike neural network conditioned on the target speaker embedding or the speaker verification score.
Abstract: In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Under our optimal setup, we are able to train a model with only 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task.

Book ChapterDOI
01 Jan 2020
TL;DR: The Hate Speech Detection (HaSpeeDe 2) task is the second edition of a shared task on the detection of hateful content in Italian Twitter messages and is composed of a Main task (hate speech detection) and two Pilot tasks, (stereotype and nominal utterance detection).
Abstract: The Hate Speech Detection (HaSpeeDe 2) task is the second edition of a shared task on the detection of hateful content in Italian Twitter messages. HaSpeeDe 2 is composed of a Main task (hate speech detection) and two Pilot tasks, (stereotype and nominal utterance detection). Systems were challenged along two dimensions: (i) time, with test data coming from a different time period than the training data, and (ii) domain, with test data coming from the news domain (i.e., news headlines). Overall, 14 teams participated in the Main task, the best systems achieved a macro F1-score of 0.8088 and 0.7744 on the indomain in the out-of-domain test sets, respectively; 6 teams submitted their results for Pilot task 1 (stereotype detection), the best systems achieved a macro F1-score of 0.7719 and 0.7203 on in-domain and outof-domain test sets. We did not receive any submission for Pilot task 2.

Journal ArticleDOI
TL;DR: This paper aimed to investigate several neural network models based on convolutional neural network (CNN) and recurrent neuralnetwork (RNN) to detect hate speech in Arabic tweets and evaluated the recent language representation model bidirectional encoder representations from transformers (BERT) on the task of Arabic hate speech detection.
Abstract: With the rise of hate speech phenomena in the Twittersphere, significant research efforts have been undertaken in order to provide automatic solutions for detecting hate speech, varying from simple machine learning models to more complex deep neural network models. Despite this, research works investigating hate speech problem in Arabic are still limited. This paper, therefore, aimed to investigate several neural network models based on convolutional neural network (CNN) and recurrent neural network (RNN) to detect hate speech in Arabic tweets. It also evaluated the recent language representation model bidirectional encoder representations from transformers (BERT) on the task of Arabic hate speech detection. To conduct our experiments, we firstly built a new hate speech dataset that contained 9316 annotated tweets. Then, we conducted a set of experiments on two datasets to evaluate four models: CNN, gated recurrent units (GRU), CNN + GRU, and BERT. Our experimental results in our dataset and an out-domain dataset showed that the CNN model gave the best performance, with an F1-score of 0.79 and area under the receiver operating characteristic curve (AUROC) of 0.89.

01 Dec 2020
TL;DR: This paper annotated hope speech for equality, diversity and inclusion in a multilingual setting and determined that the inter-annotator agreement of their dataset using Krippendorff's alpha.
Abstract: Over the past few years, systems have been developed to control online content and eliminate abusive, offensive or hate speech content. However, people in power sometimes misuse this form of censorship to obstruct the democratic right of freedom of speech. Therefore, it is imperative that research should take a positive reinforcement approach towards online content that is encouraging, positive and supportive contents. Until now, most studies have focused on solving this problem of negativity in the English language, though the problem is much more than just harmful content. Furthermore, it is multilingual as well. Thus, we have constructed a Hope Speech dataset for Equality, Diversity and Inclusion (HopeEDI) containing user-generated comments from the social media platform YouTube with 28,451, 20,198 and 10,705 comments in English, Tamil and Malayalam, respectively, manually labelled as containing hope speech or not. To our knowledge, this is the first research of its kind to annotate hope speech for equality, diversity and inclusion in a multilingual setting. We determined that the inter-annotator agreement of our dataset using Krippendorff’s alpha. Further, we created several baselines to benchmark the resulting dataset and the results have been expressed using precision, recall and F1-score. The dataset is publicly available for the research community. We hope that this resource will spur further research on encouraging inclusive and responsive speech that reinforces positiveness.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: This work presents 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea, and provides benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks.
Abstract: Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff’s alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.

Proceedings ArticleDOI
04 May 2020
TL;DR: In this article, a neural long short-term memory (LSTM) based architecture for overlap detection is proposed, and detected overlap regions are exploited in conjunction with a frame-level speaker posterior matrix to make two-speaker assignments for overlapped frames in the resegmentation step.
Abstract: We address the problem of effectively handling overlapping speech in a diarization system. First, we detail a neural Long Short-Term Memory- based architecture for overlap detection. Secondly, detected overlap regions are exploited in conjunction with a frame-level speaker posterior matrix to make two-speaker assignments for overlapped frames in the resegmentation step. The overlap detection module achieves state-of-the-art performance on the AMI, DIHARD, and ETAPE corpora. We apply overlap-aware resegmentation on AMI, resulting in a 20% relative DER reduction over the baseline system. While this approach is by no means an end-all solution to overlap-aware diarization, it reveals promising directions for handling overlap.

Journal ArticleDOI
TL;DR: This paper compares the performance of three feature engineering techniques and eight machine learning algorithms to evaluate their performance on a publicly available dataset having three distinct classes and showed that the bigram features when used with the support vector machine algorithm best performed with 79% off overall accuracy.
Abstract: The increasing use of social media and information sharing has given major benefits to humanity. However, this has also given rise to a variety of challenges including the spreading and sharing of hate speech messages. Thus, to solve this emerging issue in social media sites, recent studies employed a variety of feature engineering techniques and machine learning algorithms to automatically detect the hate speech messages on different datasets. However, to the best of our knowledge, there is no study to compare the variety of feature engineering techniques and machine learning algorithms to evaluate which feature engineering technique and machine learning algorithm outperform on a standard publicly available dataset. Hence, the aim of this paper is to compare the performance of three feature engineering techniques and eight machine learning algorithms to evaluate their performance on a publicly available dataset having three distinct classes. The experimental results showed that the bigram features when used with the support vector machine algorithm best performed with 79% off overall accuracy. Our study holds practical implication and can be used as a baseline study in the area of detecting automatic hate speech messages. Moreover, the output of different comparisons will be used as state-of-art techniques to compare future researches for existing automated text classification techniques.

Posted Content
Xu Li1, Na Li2, Chao Weng2, Xunying Liu1, Dan Su2, Dong Yu2, Helen Meng1 
TL;DR: The Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus and the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios.
Abstract: Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into multiple channel groups and designs a residual-like connection across different channel groups. Such connection increases the possible receptive fields, resulting in multiple feature scales. This multiple scaling mechanism significantly improves the countermeasure's generalizability to unseen spoofing attacks. It also decreases the model size compared to ResNet-based models. Experimental results show that the Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus. Moreover, integration with the squeeze-and-excitation (SE) block can further enhance performance. For feature engineering, we investigate the generalizability of Res2Net combined with different acoustic features, and observe that the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios. Our best single system outperforms other state-of-the-art single systems in both PA and LA of the ASVspoof 2019 corpus.

Journal ArticleDOI
TL;DR: The results showed that the developed system is very good for automatic topic detection and categorization, and indicates a more perfect test having an AUC of 0.97, when compared to similar methods.

Journal ArticleDOI
TL;DR: Comparison with recent state-of-the-art methods is performed in terms of segmental signal to noise ratio, perceptual evaluation of speech quality, and short-time objective intelligibility.

Journal ArticleDOI
TL;DR: This study builds a reliable Arabic textual corpus by crawling data from Twitter using four robust extraction strategies that are implemented based on four types of hate: religion, ethnicity, nationality, and gender, and labels the corpus based on a three-hierarchical annotation scheme.

Posted Content
TL;DR: A tailored architecture based on frozen, pre-trained Transformers is developed to examine cross-lingual zero-shot and few-shot learning, in addition to uni-lingUAL learning, on the HatEval challenge data set, demonstrating highly competitive results on the English and Spanish subsets.
Abstract: Detecting hate speech, especially in low-resource languages, is a non-trivial challenge. To tackle this, we developed a tailored architecture based on frozen, pre-trained Transformers to examine cross-lingual zero-shot and few-shot learning, in addition to uni-lingual learning, on the HatEval challenge data set. With our novel attention-based classification block AXEL, we demonstrate highly competitive results on the English and Spanish subsets. We also re-sample the English subset, enabling additional, meaningful comparisons in the future.