scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 2019"


Journal ArticleDOI
20 Aug 2019-PLOS ONE
TL;DR: This work identifies and examines challenges faced by online automatic approaches for hate speech detection in text, and proposes a multi-view SVM approach that achieves near state-of-the-art performance, while being simpler and producing more easily interpretable decisions than neural methods.
Abstract: As online content continues to grow, so does the spread of hate speech. We identify and examine challenges faced by online automatic approaches for hate speech detection in text. Among these difficulties are subtleties in language, differing definitions on what constitutes hate speech, and limitations of data availability for training and testing of these systems. Furthermore, many recent approaches suffer from an interpretability problem-that is, it can be difficult to understand why the systems make the decisions that they do. We propose a multi-view SVM approach that achieves near state-of-the-art performance, while being simpler and producing more easily interpretable decisions than neural methods. We also discuss both technical and practical challenges that remain for this task.

350 citations


Book ChapterDOI
10 Dec 2019
TL;DR: This study introduces a novel transfer learning approach based on an existing pre-trained language model called BERT (Bidirectional Encoder Representations from Transformers) and investigates the ability of BERT at capturing hateful context within social media content by using new fine-tuning methods based on transfer learning.
Abstract: Generated hateful and toxic content by a portion of users in social media is a rising phenomenon that motivated researchers to dedicate substantial efforts to the challenging direction of hateful content identification. We not only need an efficient automatic hate speech detection model based on advanced machine learning and natural language processing, but also a sufficiently large amount of annotated data to train a model. The lack of a sufficient amount of labelled hate speech data, along with the existing biases, has been the main issue in this domain of research. To address these needs, in this study we introduce a novel transfer learning approach based on an existing pre-trained language model called BERT (Bidirectional Encoder Representations from Transformers). More specifically, we investigate the ability of BERT at capturing hateful context within social media content by using new fine-tuning methods based on transfer learning. To evaluate our proposed approach, we use two publicly available datasets that have been annotated for racism, sexism, hate, or offensive content on Twitter. The results show that our solution obtains considerable performance on these datasets in terms of precision and recall in comparison to existing approaches. Consequently, our model can capture some biases in data annotation and collection process and can potentially lead us to a more accurate model.

223 citations


Journal ArticleDOI
TL;DR: In this article, the authors focus on the long tail of hate speech and propose deep neural network structures serving as feature extractors that are particularly effective for capturing the semantics of the hate speech.
Abstract: In recent years, the increasing propagation of hate speech on social media and the urgent need for effective counter-measures have drawn significant investment from governments, companies, and researchers. A large number of methods have been developed for automated hate speech detection online. This aims to classify textual content into non-hate or hate speech, in which case the method may also identify the targeting characteristics (i.e., types of hate, such as race, and religion) in the hate speech. However, we notice significant difference between the performance of the two (i.e., non-hate v.s. hate). In this work, we argue for a focus on the latter problem for practical reasons. We show that it is a much more challenging task, as our analysis of the language in the typical datasets shows that hate speech lacks unique, discriminative features and therefore is found in the 'long tail' in a dataset that is difficult to discover. We then propose Deep Neural Network structures serving as feature extractors that are particularly effective for capturing the semantics of hate speech. Our methods are evaluated on the largest collection of hate speech datasets based on Twitter, and are shown to be able to outperform the best performing method by up to 5 percentage points in macro-average F1, or 8 percentage points in the more challenging case of identifying hateful content.

196 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: This paper presents a new multilingual multi-aspect hate speech analysis dataset and uses it to test the current state-of-the-art multilingual multitask learning approaches.
Abstract: Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual multi-aspect hate speech analysis dataset and use it to test the current state-of-the-art multilingual multitask learning approaches. We evaluate our dataset in various classification settings, then we discuss how to leverage our annotations in order to improve hate speech detection and classification in general.

167 citations


Proceedings ArticleDOI
15 Sep 2019
TL;DR: The second edition of the DIHARD challenge as discussed by the authors was designed to improve the robustness of speaker diarization systems to variation in recording equipment, noise conditions, and conversational domain.
Abstract: This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises four tracks evaluating diarization performance under two input conditions (single channel vs. multi-channel) and two segmentation conditions (diarization from a reference speech segmentation vs. diarization from scratch). In order to prevent participants from overtuning to a particular combination of recording conditions and conversational domain, recordings are drawn from a variety of sources ranging from read audiobooks to meeting speech, to child language acquisition recordings, to dinner parties, to web video. We describe the task and metrics, challenge design, datasets, and baseline systems for speech enhancement, speech activity detection, and diarization.

135 citations


Proceedings ArticleDOI
01 Aug 2019
TL;DR: This research discusses multi-label text classification for abusive language and hate speech detection including detecting the target, category, and level of hate speech in Indonesian Twitter using machine learning approach with Support Vector Machine, Naive Bayes, and Random Forest Decision Tree methods.
Abstract: Hate speech and abusive language spreading on social media need to be detected automatically to avoid conflict between citizen. Moreover, hate speech has a target, category, and level that also needs to be detected to help the authority in prioritizing which hate speech must be addressed immediately. This research discusses multi-label text classification for abusive language and hate speech detection including detecting the target, category, and level of hate speech in Indonesian Twitter using machine learning approach with Support Vector Machine (SVM), Naive Bayes (NB), and Random Forest Decision Tree (RFDT) classifier and Binary Relevance (BR), Label Power-set (LP), and Classifier Chains (CC) as the data transformation method. We used several kinds of feature extractions which are term frequency, orthography, and lexicon features. Our experiment results show that in general RFDT classifier using LP as the transformation method gives the best accuracy with fast computational time.

109 citations


Proceedings ArticleDOI
23 Feb 2019
TL;DR: A background on hate speech and its related detection approaches is presented and challenges and recommendations for the Arabic hate speech detection problem are presented.
Abstract: In social media platforms, hate speech can be a reason of “cyber conflict” which can affect social life in both of individual-level and country-level. Hateful and antagonistic content propagated via social networks has the potential to cause harm and suffering on an individual basis and lead to social tension and disorder beyond cyber space. However, social networks cannot control all the content that users post. For this reason, there is a demand for automatic detection of hate speech. This demand particularly raises when the content is written in complex languages (e.g. Arabic). Arabic text is known with its challenges, complexity and scarcity of its resources. This paper will present a background on hate speech and its related detection approaches. In addition, the recent contributions on hate speech and its related anti-social behaviour topics will be reviewed. Finally, challenges and recommendations for the Arabic hate speech detection problem will be presented.

106 citations


Journal ArticleDOI
TL;DR: A deep weighted fusion method for audio-visual emotion recognition with consideration of cross-modal feature fusion, denoising and redundancy removing, and the fusion method show excellent performance on the selected data set.

91 citations


Proceedings ArticleDOI
03 Nov 2019
TL;DR: The proposed framework yields a significant increase in multi-class hate speech detection, outperforming the baseline in the largest online hate speech database by an absolute 5.7% increase in Macro-F1 score and 30% in hate speech class recall.
Abstract: In this paper, we address the issue of augmenting text data in supervised Natural Language Processing problems, exemplified by deep online hate speech classification. A great challenge in this domain is that although the presence of hate speech can be deleterious to the quality of service provided by social platforms, it still comprises only a tiny fraction of the content that can be found online, which can lead to performance deterioration due to majority class overfitting. To this end, we perform a thorough study on the application of deep learning to the hate speech detection problem: a) we propose three text-based data augmentation techniques aimed at reducing the degree of class imbalance and to maximise the amount of information we can extract from our limited resources and b) we apply them on a selection of top-performing deep architectures and hate speech databases in order to showcase their generalisation properties. The data augmentation techniques are based on a) synonym replacement based on word embedding vector closeness, b) warping of the word tokens along the padded sequence or c) class-conditional, recurrent neural language generation. Our proposed framework yields a significant increase in multi-class hate speech detection, outperforming the baseline in the largest online hate speech database by an absolute 5.7% increase in Macro-F1 score and 30% in hate speech class recall.

81 citations


Proceedings ArticleDOI
13 May 2019
TL;DR: This work systematically design methods to quantify the bias for any model and propose algorithms for identifying the set of words which the model stereotypes, and proposes novel methods leveraging knowledge-based generalizations for bias-free learning.
Abstract: With the ever-increasing cases of hate spread on social media platforms, it is critical to design abuse detection mechanisms to pro-actively avoid and control such incidents. While there exist methods for hate speech detection, they stereotype words and hence suffer from inherently biased training. Bias removal has been traditionally studied for structured datasets, but we aim at bias mitigation from unstructured text data. In this paper, we make two important contributions. First, we systematically design methods to quantify the bias for any model and propose algorithms for identifying the set of words which the model stereotypes. Second, we propose novel methods leveraging knowledge-based generalizations for bias-free learning. Knowledge-based generalization provides an effective way to encode knowledge because the abstraction they provide not only generalizes content but also facilitates retraction of information from the hate speech detection classifier, thereby reducing the imbalance. We experiment with multiple knowledge generalization policies and analyze their effect on general performance and in mitigating bias. Our experiments with two real-world datasets, a Wikipedia Talk Pages dataset (WikiDetox) of size ~ 96k and a Twitter dataset of size ~ 24k, show that the use of knowledge-based generalizations results in better performance by forcing the classifier to learn from generalized content. Our methods utilize existing knowledge-bases and can easily be extended to other tasks.

73 citations


Proceedings ArticleDOI
03 Jan 2019
TL;DR: This paper deals with the task of identification of hate speech from code-mixed social media text using two architectures namely sub-word level LSTM model and Hierarchical L STM model with attention based on phonemic sub-words.
Abstract: With the increase in user generated content, particularly on social media networks, the amount of hate speech is also steadily increasing. So, there is a need to automatically detect such hateful content and curb the wrongful activities. While relevant research has been done independently on code-mixed social media texts and hate speech detection, this paper deals with the task of identification of hate speech from code-mixed social media text. We perform experiments with available code-mixed dataset for hate speech detection using two architectures namely sub-word level LSTM model and Hierarchical LSTM model with attention based on phonemic sub-words.

Posted Content
TL;DR: In this paper, the problem of hate speech detection in multimodal publications formed by a text and an image is addressed, and different models that jointly analyze textual and visual information are compared.
Abstract: In this work we target the problem of hate speech detection in multimodal publications formed by a text and an image. We gather and annotate a large scale dataset from Twitter, MMHS150K, and propose different models that jointly analyze textual and visual information for hate speech detection, comparing them with unimodal detection. We provide quantitative and qualitative results and analyze the challenges of the proposed task. We find that, even though images are useful for the hate speech detection task, current multimodal models cannot outperform models analyzing only text. We discuss why and open the field and the dataset for further research.

Posted Content
TL;DR: This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarized systems to variation in recording equipment, noise conditions, and conversational domain.
Abstract: This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises four tracks evaluating diarization performance under two input conditions (single channel vs. multi-channel) and two segmentation conditions (diarization from a reference speech segmentation vs. diarization from scratch). In order to prevent participants from overtuning to a particular combination of recording conditions and conversational domain, recordings are drawn from a variety of sources ranging from read audiobooks to meeting speech, to child language acquisition recordings, to dinner parties, to web video. We describe the task and metrics, challenge design, datasets, and baseline systems for speech enhancement, speech activity detection, and diarization.

Posted Content
TL;DR: This work addresses the challenge of hate speech detection in Internet memes, and attempts using visual information to automatically detect hate speech, unlike any previous work of the knowledge.
Abstract: This work addresses the challenge of hate speech detection in Internet memes, and attempts using visual information to automatically detect hate speech, unlike any previous work of our knowledge. Memes are pixel-based multimedia documents that contain photos or illustrations together with phrases which, when combined, usually adopt a funny meaning. However, hate memes are also used to spread hate through social networks, so their automatic detection would help reduce their harmful societal impact. Our results indicate that the model can learn to detect some of the memes, but that the task is far from being solved with this simple architecture. While previous work focuses on linguistic hate speech, our experiments indicate how the visual modality can be much more informative for hate speech detection than the linguistic one in memes. In our experiments, we built a dataset of 5,020 memes to train and evaluate a multi-layer perceptron over the visual and language representations, whether independently or fused. The source code and mode and models are available this https URL .

Journal ArticleDOI
TL;DR: A mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones, and two strategies are proposed to merge source and filter information: feature and decision fusion.
Abstract: Voice Activity Detection (VAD) refers to the problem of distinguishing speech segments from background noise. Numerous approaches have been proposed for this purpose. Some are based on features derived from the power spectral density, others exploit the periodicity of the signal. The goal of this paper is to investigate the joint use of source and filter-based features. Interestingly, a mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones. The features are further the input of an artificial neural network-based classifier trained on a multi-condition database. Two strategies are proposed to merge source and filter information: feature and decision fusion. Our experiments indicate an absolute reduction of 3% of the equal error rate when using decision fusion. The final proposed system is compared to four state-of-the-art methods on 150 minutes of data recorded in real environments. Thanks to the robustness of its source-related features, its multi-condition training and its efficient information fusion, the proposed system yields over the best state-of-the-art VAD a substantial increase of accuracy across all conditions (24% absolute on average).

Journal ArticleDOI
TL;DR: Experimental results demonstrate the improved performance of the proposed end-to-end multimodal architecture compared to unimodal variants for VAD.
Abstract: Recently, there has been growing use of deep neural networks in many modern speech-based systems such as speaker recognition, speech enhancement, and emotion recognition. Inspired by this success, we propose to address the task of voice activity detection (VAD) by incorporating auditory and visual modalities into an end-to-end deep neural network. We evaluate our proposed system in challenging acoustic environments including high levels of noise and transients, which are common in real-life scenarios. Our multimodal setting includes a speech signal captured by a microphone and a corresponding video signal capturing the speaker's mouth region. Under such difficult conditions, robust features need to be extracted from both modalities in order for the system to accurately distinguish between speech and noise. For this purpose, we utilize a deep residual network, to extract features from the video signal, while for the audio modality, we employ a variant of WaveNet encoder for feature extraction. The features from both modalities are fused using multimodal compact bilinear pooling to form a joint representation of the speech signal. To further encode the temporal information, we feed the fused signal to a long short-term memory network and the system is then trained in an end-to-end supervised fashion. Experimental results demonstrate the improved performance of the proposed end-to-end multimodal architecture compared to unimodal variants for VAD. Upon the publication of this paper, we will make the implementation of our proposed models publicly available at https://github.com/iariav/End-to-End-VAD and https://israelcohen.com .

Proceedings ArticleDOI
01 Dec 2019
TL;DR: This paper develops Machine Learning (ML) algorithms based model, as well as Gated Recurrent Unit (GRU), based deep neural network model for classifying users' comments on Facebook pages, and produces the first contribution to the field of hateful speech detection in the Bengali language for social media.
Abstract: Online hateful speech detection and classification in social media for the various major languages other than English has drawn the attention of researchers recently. In this paper, we develop Machine Learning (ML) algorithms based model, as well as Gated Recurrent Unit (GRU), based deep neural network model for classifying users' comments on Facebook pages. We have collected, annotated 5,126 Bengali comments and classified them into six classes – Hate Speech, Communal Attack, Inciteful, Religious Hatred, Political Comments, and Religious Comments. The produced corpus is the first contribution to the field of hateful speech detection in the Bengali language for social media. Finally, we employ several machine learning algorithms, compare the performance, and attained 52.20% accuracy in Random Forest. The accuracy is improved in the case of GRU based model (70.10% accuracy) about 18%.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: A multi-channel model with three versions of BERT (MC-BERT), the English, Chinese, and multilingual BERTs for hate speech detection and the usage of translations as additional input by translating training and test sentences to the corresponding languages required for different BERT models is proposed.
Abstract: The growth of social networking services (SNS) has altered the way and scale of communication in cyberspace. However, the amount of online hate speech is increasing because of the anonymity and mobility such services provide. As manual hate speech detection by human annotators is both costly and time consuming, there are needs to develop an algorithm for automatic recognition. Transferring knowledge by fine-tuning a pre-trained language model has been shown to be effective for improving many downstream tasks in the field of natural language processing. The Bidirectional Encoder Representations from Transformers (BERT) is a language model that is pre-trained to learn deep bidirectional representations from a large corpus. In this paper, we propose a multi-channel model with three versions of BERT (MC-BERT), the English, Chinese, and multilingual BERTs for hate speech detection. We also explored the usage of translations as additional input by translating training and test sentences to the corresponding languages required for different BERT models. We used three datasets in non-English languages to compare our model with previous approaches including the 2019 SemEval HatEval Spanish dataset, 2018 GermEval shared task on the identification of Offensive Language dataset, and 2018 EvalIta HaSpeeDe Italian dataset. Finally, we were able to achieve the state-of-the-art or comparable performance on these datasets by conducting thorough experiments.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: The FoR Dataset is introduced, which contains more than 198,000 utterances from the latest deep-learning speech synthesizers as well as real speech, and is pertinent for machine learning studies, since it is able to train even complex deep learning models without overfitting.
Abstract: With the advancements in deep learning and other techniques, synthetic speech is getting closer to a natural sounding voice. Some of the state-of-art technologies achieve such a high level of naturalness that even humans have difficulties distinguishing real speech from computer generated speech. Moreover, these technologies allow a person to train a speech synthesizer with a target voice, creating a model that is able to reproduce someone’s voice with high fidelity.In this paper, we introduce the FoR Dataset, which contains more than 198,000 utterances from the latest deep-learning speech synthesizers as well as real speech. This dataset can be used as base for several studies in speech synthesis and synthetic speech detection. Due to its large amount of utterances, it is pertinent for machine learning studies, since it is able to train even complex deep learning models without overfitting. We present several experiments using this dataset, including a deep learning classifier that reached up to 99.96% accuracy in synthetic speech detection.

Journal ArticleDOI
TL;DR: It is concluded that simultaneous recordings of the perceived sound and the corresponding EEG response may be a practical tool to assess speech intelligibility in the context of hearing aids.
Abstract: Objective Speech signals have a remarkable ability to entrain brain activity to the rapid fluctuations of speech sounds. For instance, one can readily measure a correlation of the sound amplitude with the evoked responses of the electroencephalogram (EEG), and the strength of this correlation is indicative of whether the listener is attending to the speech. In this study we asked whether this stimulus-response correlation is also predictive of speech intelligibility. Approach We hypothesized that when a listener fails to understand the speech in adverse hearing conditions, attention wanes and stimulus-response correlation also drops. To test this, we measure a listener's ability to detect words in noisy speech while recording their brain activity using EEG. We alter intelligibility without changing the acoustic stimulus by pairing it with congruent and incongruent visual speech. Main results For almost all subjects we found that an improvement in speech detection coincided with an increase in correlation between the noisy speech and the EEG measured over a period of 30 min. Significance We conclude that simultaneous recordings of the perceived sound and the corresponding EEG response may be a practical tool to assess speech intelligibility in the context of hearing aids.

Proceedings ArticleDOI
01 Feb 2019
TL;DR: No sub $-\mu \mathrm {W}$ VAD has been reported to date, preventing the use of VADs in unobtrusive mm-scale sensor nodes, and their simple decision tree or fixed neural network-based approach limited broader use for various acoustic event targets.
Abstract: Acoustic sensing is one of the most widely used sensing modalities to intelligently assess the environment. In particular, ultra-low power (ULP) always-on voice activity detection (VAD) is gaining attention as an enabling technology for IoT platforms. In many practical applications, acoustic events-of-interest occur infrequently. Therefore, the system power consumption is typically dominated by the always-on acoustic wakeup detector, while the remainder of the system is power-gated the vast majority of the time. A previous acoustic wakeup detector [1] consumed just 12nW but could not process voice signals (up to 4kHz bandwidth) or handle non-stationary events, which are essential qualities for a VAD. Prior VAD ICs [2], [3] demonstrated reliable performance but consumed significant power $(\gt 20 \mu \mathrm {W})$ and lacked an analog frontend (AFE), which further increases power. Recent analog-domain feature extraction-based VADs [4], [5] also reported $\mu \mathrm {W}-$ level power consumption, and their simple decision tree [4] or fixed neural network-based approach [5] limited broader use for various acoustic event targets. In summary, no sub $-\mu \mathrm {W}$ VAD has been reported to date, preventing the use of VADs in unobtrusive mm-scale sensor nodes.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: This work explores several novel countermeasures based on long range acoustic features that are found to be effective for spoofing attack detection, and obtains a tandem detection cost function for logical and physical access on the best combined system submitted to the ASVspoof 2019 challenge.
Abstract: Speaker verification systems in practice are vulnerable to spoofing attacks. The high quality recording and playback devices make replay attack a real threat to speaker verification. Additionally, the furtherance in voice conversion and speech synthesis has produced perceptually natural sounding speech. The ASVspoof 2019 challenge is organized to study the robustness of countermeasures against such attacks, which cover two common modes of attacks, logical and physical access. The former deals with synthetic attacks arising from voice conversion and text-to-speech techniques, whereas the latter deals with replay attacks. In this work, we explore several novel countermeasures based on long range acoustic features that are found to be effective for spoofing attack detection. The long range features capture different aspects of long range information as they are computed from subbands and octave power spectrum in contrast to the conventional way from linear power spectrum. These novel features are combined with the other known features for improved detection of spoofing attacks. We obtain a tandem detection cost function of 0.1264 and 0.1381 (equal error rate 4.13% and 5.95%) for logical and physical access on the best combined system submitted to the challenge.

Journal ArticleDOI
TL;DR: The proposed CMC feature outperforms the conventional constant-Q cepstral coefficient based long-term feature obtained from linear power spectrum after uniform resampling and depicts the usefulness of MLT to extract salient artifacts from octave power spectrum.
Abstract: This article focuses on extracting information from the octave power spectra of long-term constant-Q transform (CQT) for spoofing attack detection. A novel framework based on multi-level transform (MLT) is proposed that can capture the relevant information from octave power spectra using level by level in a multi-level manner. We then derive a novel feature referred to as constant-Q multi-level coefficient (CMC) based on proposed MLT. The proposed feature is evaluated on synthetic as well as replay speech detection studies on ASVspoof 2015 and ASVspoof 2017 version 2.0 database, respectively. We find the proposed CMC feature outperforms the conventional constant-Q cepstral coefficient based long-term feature obtained from linear power spectrum after uniform resampling. This depicts the usefulness of MLT to extract salient artifacts from octave power spectrum. Further, the proposed CMC feature performs better than the existing the well known other state-of-the-art systems for spoofing attack detection that showcases its importance.

Proceedings ArticleDOI
19 Apr 2019
TL;DR: This research builds a new Indonesian hate speech dataset from Facebook and shows that the best performance obtained by Support Vector Machine (SVM) as its classifier algorithm using TF-IDF, char quad-gram, word unigram, and lexicon features that yield f1-score of 85%.
Abstract: Due to the growth of hate speech on social media in recent years, it is important to understand this issue. An automatic hate speech detection system is needed to help to counter this problem. There have been many studies on detecting hate speech in short documents like Twitter data. But to our knowledge, research on long documents is rare, we suppose that the difficulty is increasing due to the possibility of the message of the text may be hidden. In this research, we explore in detecting hate speech on Indonesian long documents using machine learning approach. We build a new Indonesian hate speech dataset from Facebook. The experiment showed that the best performance obtained by Support Vector Machine (SVM) as its classifier algorithm using TF-IDF, char quad-gram, word unigram, and lexicon features that yield f1-score of 85%.

Journal ArticleDOI
TL;DR: This article presents a voice and acoustic activity detector that uses a mixer-based architecture and ultra-low-power neural network (NN)-based classifier that features inaudible acoustic signature detection for intentional remote silent wakeup of the system while re-using a subset of the same system components.
Abstract: This article presents a voice and acoustic activity detector that uses a mixer-based architecture and ultra-low-power neural network (NN)-based classifier. By sequentially scanning 4 kHz of frequency bands and down-converting to below 500 Hz, feature extraction power consumption is reduced by 4 $\times $ . The NN processor employs computational sprinting, enabling 12 $\times $ power reduction. The system also features inaudible acoustic signature detection for intentional remote silent wakeup of the system while re-using a subset of the same system components. The measurement results achieve 91.5%/90% speech/non-speech hit rates at 10-dB SNR with babble noise and 142-nW power consumption. Acoustic signature detection consumes 66 nW, successfully detecting a signature 10 dB below the noise level.

Posted Content
TL;DR: A neural Long Short-Term Memory- based architecture for overlap detection is detail, which achieves state-of-the-art performance on the AMI, DIHARD, and ETAPE corpora and reveals promising directions for handling overlap.
Abstract: We address the problem of effectively handling overlapping speech in a diarization system. First, we detail a neural Long Short-Term Memory-based architecture for overlap detection. Secondly, detected overlap regions are exploited in conjunction with a frame-level speaker posterior matrix to make two-speaker assignments for overlapped frames in the resegmentation step. The overlap detection module achieves state-of-the-art performance on the AMI, DIHARD, and ETAPE corpora. We apply overlap-aware resegmentation on AMI, resulting in a 20% relative DER reduction over the baseline system. While this approach is by no means an end-all solution to overlap-aware diarization, it reveals promising directions for handling overlap.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: The process of developing a dataset that can be used to build a hate speech detection model is presented and the basic preprocessing and preliminary study using machine learning was implemented.
Abstract: During the 2019 election period in Indonesia, many hate speech and cyberbullying cases have occurred in social media platforms including Twitter. The government tries to filter every negative content to be spread out during this period. However, to detect hate speech is not an easy task. This paper presents the process of developing a dataset that can be used to build a hate speech detection model. More than 1 million tweets have been successfully collected from using Twitter API. The basic preprocessing and preliminary study using machine learning was implemented. Latent Dirichlet Allocation (LDA) algorithm was used to extract the topic for each tweet to see whether these topics can be associated with debate themes. Pretrained sentiment analysis was also applied to the dataset to generate a polarity score for each tweet. From 83,752 tweets included in the analysis step, the number of positive and negative tweets are almost the same.

Journal ArticleDOI
TL;DR: This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments, selected from the Google AudioSet dataset.
Abstract: Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speech.

Posted Content
TL;DR: Developing automated text analytics methods, capable of jointly learning a single representation of hate from several smaller, unrelated data sets, that enables generating an interpretable two-dimensional text visualization called the Map of Hate that is capable of separating different types of hate speech and explaining what makes text harmful.
Abstract: In today's society more and more people are connected to the Internet, and its information and communication technologies have become an essential part of our everyday life. Unfortunately, the flip side of this increased connectivity to social media and other online contents is cyber-bullying and -hatred, among other harmful and anti-social behaviors. Models based on machine learning and natural language processing provide a way to detect this hate speech in web text in order to make discussion forums and other media and platforms safer. The main difficulty, however, is annotating a sufficiently large number of examples to train these models. In this paper, we report on developing automated text analytics methods, capable of jointly learning a single representation of hate from several smaller, unrelated data sets. We train and test our methods on the total of $37,520$ English tweets that have been annotated for differentiating harmless messages from racist or sexists contexts in the first detection task, and hateful or offensive contents in the second detection task. Our most sophisticated method combines a deep neural network architecture with transfer learning. It is capable of creating word and sentence embeddings that are specific to these tasks while also embedding the meaning of generic hate speech. Its prediction correctness is the macro-averaged F1 of $78\%$ and $72\%$ in the first and second task, respectively. This method enables generating an interpretable two-dimensional text visualization --- called the Map of Hate --- that is capable of separating different types of hate speech and explaining what makes text harmful. These methods and insights hold a potential for not only safer social media, but also reduced need to expose human moderators and annotators to distressing online~messaging.

Posted Content
TL;DR: This article presented a new multilingual multi-aspect hate speech analysis dataset and used it to test the current state-of-the-art multilingual multitask learning approaches, and discussed how to leverage their annotations in order to improve hate speech detection and classification.
Abstract: Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual multi-aspect hate speech analysis dataset and use it to test the current state-of-the-art multilingual multitask learning approaches. We evaluate our dataset in various classification settings, then we discuss how to leverage our annotations in order to improve hate speech detection and classification in general.