scispace - formally typeset
Search or ask a question
Author

Dillon Knox

Bio: Dillon Knox is an academic researcher from University of Southern California. The author has contributed to research in topics: Emotional expression & Film genre. The author has an hindex of 1, co-authored 5 publications receiving 1 citations.

Papers
More filters
Journal ArticleDOI
08 Apr 2021-PLOS ONE
TL;DR: In this article, supervised neural network models with various pooling mechanisms were used to predict a film's genre from its soundtrack, using handcrafted music information retrieval (MIR) features against VGGish audio embedding features.
Abstract: Film music varies tremendously across genre in order to bring about different responses in an audience. For instance, composers may evoke passion in a romantic scene with lush string passages or inspire fear throughout horror films with inharmonious drones. This study investigates such phenomena through a quantitative evaluation of music that is associated with different film genres. We construct supervised neural network models with various pooling mechanisms to predict a film's genre from its soundtrack. We use these models to compare handcrafted music information retrieval (MIR) features against VGGish audio embedding features, finding similar performance with the top-performing architectures. We examine the best-performing MIR feature model through permutation feature importance (PFI), determining that mel-frequency cepstral coefficient (MFCC) and tonal features are most indicative of musical differences between genres. We investigate the interaction between musical and visual features with a cross-modal analysis, and do not find compelling evidence that music characteristic of a certain genre implies low-level visual features associated with that genre. Furthermore, we provide software code to replicate this study at https://github.com/usc-sail/mica-music-in-media. This work adds to our understanding of music's use in multi-modal contexts and offers the potential for future inquiry into human affective experiences.

8 citations

Proceedings ArticleDOI
28 Jun 2021
TL;DR: In this paper, an ensemble-based convolutional neural network (CNN) model trained using various loss functions for tagging musical genres from audio is presented, and the effect of different loss functions and resampling strategies on prediction performance is investigated.
Abstract: Given the ever-increasing volume of music created and released every day, it has never been more important to study automatic music tagging. In this paper, we present an ensemble-based convolutional neural network (CNN) model trained using various loss functions for tagging musical genres from audio. We investigate the effect of different loss functions and resampling strategies on prediction performance, finding that using focal loss improves overall performance on the the MTG-Jamendo dataset: an imbalanced, multi-label dataset with over 18,000 songs in the public domain, containing 57 labels. Additionally, we report results from varying the receptive field on our base classifier—a CNN-based architecture trained using Mel spectrograms—which also results in a model performance boost and state-of-the-art performance on the Jamendo dataset. We conclude that the choice of the loss function is paramount for improving on existing methods in music tagging, particularly in the presence of class imbalance.

3 citations

Posted Content
TL;DR: In this article, a neural-network based denoiser is used as a pre-processor in the ASR pipeline to counter the adversarial attacks by adding small perturbations to the original speech signal.
Abstract: In this paper we investigate speech denoising as a defense against adversarial attacks on automatic speech recognition (ASR) systems. Adversarial attacks attempt to force misclassification by adding small perturbations to the original speech signal. We propose to counteract this by employing a neural-network based denoiser as a pre-processor in the ASR pipeline. The denoiser is independent of the downstream ASR model, and thus can be rapidly deployed in existing systems. We found that training the denoisier using a perceptually motivated loss function resulted in increased adversarial robustness without compromising ASR performance on benign samples. Our defense was evaluated (as a part of the DARPA GARD program) on the 'Kenansville' attack strategy across a range of attack strengths and speech samples. An average improvement in Word Error Rate (WER) of about 7.7% was observed over the undefended model at 20 dB signal-to-noise-ratio (SNR) attack strength.

Cited by
More filters
Proceedings ArticleDOI
07 Oct 2022
TL;DR: This work shows that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance in a wide range of music labelling tasks, each with novel content and vocabularies, and restricts the domain of the pre-training dataset to music to allow for training with smaller batch sizes.
Abstract: In this work, we provide a broad comparative analysis of strategies for pre-training audio understanding models for several tasks in the music domain, including labelling of genre, era, origin, mood, instrumentation, key, pitch, vocal characteristics, tempo and sonority. Specifically, we explore how the domain of pre-training datasets (music or generic audio) and the pre-training methodology (supervised or unsupervised) affects the adequacy of the resulting audio embeddings for downstream tasks. We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance in a wide range of music labelling tasks, each with novel content and vocabularies. This can be done in an efficient manner with models containing less than 100 million parameters that require no fine-tuning or reparameterization for downstream tasks, making this approach practical for industry-scale audio catalogs. Within the class of unsupervised learning strategies, we show that the domain of the training dataset can significantly impact the performance of representations learned by the model. We find that restricting the domain of the pre-training dataset to music allows for training with smaller batch sizes while achieving state-of-the-art in unsupervised learning -- and in some cases, supervised learning -- for music understanding. We also corroborate that, while achieving state-of-the-art performance on many tasks, supervised learning can cause models to specialize to the supervised information provided, somewhat compromising a model's generality.

7 citations

Journal ArticleDOI
TL;DR: The methods and models proposed can distinguish music signals and generate different music, and the discrimination accuracy of different music signals is higher, which is superior to the traditional restricted Boltzmann machine method.
Abstract: The research expects to explore the application of intelligent music recognition technology in music teaching. Based on the Long Short-Term Memory network knowledge, an algorithm model which can distinguish various music signals and generate various genres of music is designed and implemented. First, by analyzing the application of machine learning and deep learning in the field of music, the algorithm model is designed to realize the function of intelligent music generation, which provides a theoretical basis for relevant research. Then, by selecting massive music data, the music style discrimination and generation model is tested. The experimental results show that when the number of hidden layers of the designed model is 4 and the number of neurons in each layer is 1,024, 512, 256, and 128, the training result difference of the model is the smallest. The classification accuracy of jazz, classical, rock, country, and disco music types can be more than 60% using the designed algorithm model. Among them, the classification effect of jazz schools is the best, which is 77.5%. Moreover, compared with the traditional algorithm, the frequency distribution of the music score generated by the designed algorithm is almost consistent with the spectrum of the original music. Therefore, the methods and models proposed can distinguish music signals and generate different music, and the discrimination accuracy of different music signals is higher, which is superior to the traditional restricted Boltzmann machine method.

5 citations

Journal ArticleDOI
TL;DR: In this article, an audience evaluation model of teaching quality based on the multilayer perceptron genetic neural network algorithm for the data processing link in the evaluation of the symphony performance effect was proposed.
Abstract: Traditional symphony performances need to obtain a large amount of data in terms of effect evaluation to ensure the authenticity and stability of the data. In the process of processing the audience evaluation data, there are problems such as large calculation dimensions and low data relevance. Based on this, this article studies the audience evaluation model of teaching quality based on the multilayer perceptron genetic neural network algorithm for the data processing link in the evaluation of the symphony performance effect. Multilayer perceptrons are combined to collect data on the audience’s evaluation information; genetic neural network algorithm is used for comprehensive analysis to realize multivariate analysis and objective evaluation of all vocal data of the symphony performance process and effects according to different characteristics and expressions of the audience evaluation. Changes are analyzed and evaluated accurately. The experimental results show that the performance evaluation model of symphony performance based on the multilayer perceptron genetic neural network algorithm can be quantitatively evaluated in real time and is at least higher in accuracy than the results obtained by the mainstream evaluation method of data postprocessing with optimized iterative algorithms as the core 23.1%, its scope of application is also wider, and it has important practical significance in real-time quantitative evaluation of the effect of symphony performance.

3 citations

Journal ArticleDOI
20 Jan 2022-PLOS ONE
TL;DR: The Ethio Kiñits Model (EKM), based on VGG, for classifying Kiñit classification is presented and EKM was found to have the best accuracy (95.00%) as well as the fastest training time.
Abstract: In this paper, we create EMIR, the first-ever Music Information Retrieval dataset for Ethiopian music. EMIR is freely available for research purposes and contains 600 sample recordings of Orthodox Tewahedo chants, traditional Azmari songs and contemporary Ethiopian secular music. Each sample is classified by five expert judges into one of four well-known Ethiopian Kiñits, Tizita, Bati, Ambassel and Anchihoye. Each Kiñit uses its own pentatonic scale and also has its own stylistic characteristics. Thus, Kiñit classification needs to combine scale identification with genre recognition. After describing the dataset, we present the Ethio Kiñits Model (EKM), based on VGG, for classifying the EMIR clips. In Experiment 1, we investigated whether Filterbank, Mel-spectrogram, Chroma, or Mel-frequency Cepstral coefficient (MFCC) features work best for Kiñit classification using EKM. MFCC was found to be superior and was therefore adopted for Experiment 2, where the performance of EKM models using MFCC was compared using three different audio sample lengths. 3s length gave the best results. In Experiment 3, EKM and four existing models were compared on the EMIR dataset: AlexNet, ResNet50, VGG16 and LSTM. EKM was found to have the best accuracy (95.00%) as well as the fastest training time. However, the performance of VGG16 (93.00%) was found not to be significantly worse (P < 0.01). We hope this work will encourage others to explore Ethiopian music and to experiment with other models for Kiñit classification.

2 citations

Journal ArticleDOI
TL;DR: A visual system of music teaching is designed for visualizing the emotions, which is helpful to students’ understanding of music works and the improvement of teaching effect.
Abstract: The study aims to overcome the shortcomings of the traditional music teaching system, for it cannot analyze the emotions of music works and does not have the advantages in music aesthetic teaching. First, the relevant theories of emotional teaching are expounded and the important roles of emotional teaching and aesthetic teaching in shaping students’ personalities are described. Second, a music emotion classification model based on the deep neural network (DNN) is proposed, and it can accurately classify music emotions through model training. Finally, according to the emotional teaching theory and the model based on DNN, a visual system of music teaching is designed for visualizing the emotions, which is helpful to students’ understanding of music works and the improvement of teaching effect. The results show that: (1) the teaching system designed has five parts, namely the audio input layer, emotion classification layer, virtual role perception layer, emotion expression layer, and output layer. The system can classify the emotions of the current input audio and map it to the virtual characters for emotional expression. Finally, the emotions are displayed to the students through the display screen layer to realize the visualization of the emotions of music works, so that the students can intuitively feel the emotional elements in the works. (2) The accuracy of the music emotion classification model based on DNN is more than 3.4% higher than other models and has better performance. The study provides important technical support for the upgrading of the teaching system and improving the quality of music aesthetic teaching.

2 citations