Contextual Joint Factor Acoustic Embeddings

doi:10.1109/SLT48900.2021.9383592

Open AccessProceedings ArticleDOI

Contextual Joint Factor Acoustic Embeddings

Yanpei Shi, +1 more

- pp 750-757

Chats0

TLDR

In this paper, two unsupervised approaches to generate acoustic embeddings by modeling of acoustic context are proposed, one is a contextual joint factor synthesis encoder, where the encoder in an encoder/decoder framework is trained to extract joint factors from surrounding audio frames to best generate the target output.

Abstract:

Embedding acoustic information into fixed length representations is of interest for a whole range of applications in speech and audio technology. Two novel unsupervised approaches to generate acoustic embeddings by modelling of acoustic context are proposed. The first approach is a contextual joint factor synthesis encoder, where the encoder in an encoder/decoder framework is trained to extract joint factors from surrounding audio frames to best generate the target output. The second approach is a contextual joint factor analysis encoder, where the encoder is trained to analyse joint factors from the source signal that correlates best with the neighbouring audio. To evaluate the effectiveness of our approaches compared to prior work, two tasks are conducted-phone classification and speaker recognition - and test on different TIMIT data sets. Experimental results show that one of the proposed approaches outperforms phone classification baselines, yielding a classification accuracy of 74.1%. When using additional out-of-domain data for training, an additional 3% improvements can be obtained, for both for phone classification and speaker recognition tasks.

Contextual Joint Factor Acoustic Embeddings

Citations

Improved Acoustic Word Embeddings for Zero-Resource Languages Using Multilingual Transfer

Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language

References

Adam: A Method for Stochastic Optimization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Glove: Global Vectors for Word Representation

Visualizing Data using t-SNE

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Related Papers (5)

Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training

Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching

An Online Speaker-aware Speech Separation Approach Based on Time-domain Representation

Speaker recognition in noisy conditions with limited training data

Heterogeneous acoustic measurements and multiple classifiers for speech recognition