Home
/
Authors
/
Rosanna Milner

Author

Rosanna Milner

Bio: Rosanna Milner is an academic researcher from University of Sheffield. The author has contributed to research in topics: Speaker diarisation & Speaker recognition. The author has an hindex of 6, co-authored 12 publications receiving 102 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

[...]

Oscar Saz¹, Mortaza Doulaty¹, Salil Deena¹, Rosanna Milner¹, Raymond W. M. Ng¹, Madina Hasan¹, Yulan Liu¹, Thomas Hain¹ - Show less +4 more•Institutions (1)

University of Sheffield¹

21 Dec 2015-arXiv: Computation and Language

TL;DR: In this article, the authors describe the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge task of transcribing multi-genre broadcast shows.

...read moreread less

Abstract: We describe the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge task of transcribing multi-genre broadcast shows. Transcription was one of four tasks proposed in the MGB challenge, with the aim of advancing the state of the art of automatic speech recognition, speaker diarisation and automatic alignment of subtitles for broadcast media. Four topics are investigated in this work: Data selection techniques for training with unreliable data, automatic speech segmentation of broadcast media shows, acoustic modelling and adaptation in highly variable environments, and language modelling of multi-genre shows. The final system operates in multiple passes, using an initial unadapted decoding stage to refine segmentation, followed by three adapted passes: a hybrid DNN pass with input features normalised by speaker-based cepstral normalisation, another hybrid stage with input features normalised by speaker feature-MLLR transformations, and finally a bottleneck-based tandem stage with noise and speaker factorisation. The combination of these three system outputs provides a final error rate of 27.5% on the official development set, consisting of 47 multi-genre shows.

...read moreread less

25 citations

Proceedings Article•DOI•

Empirical Interpretation of Speech Emotion Perception with Attention Based Model for Speech Emotion Recognition.

[...]

Md. Asif Jalal, Rosanna Milner¹, Thomas Hain¹•Institutions (1)

University of Sheffield¹

25 Oct 2020

TL;DR: A convolution- based model and a long-shortterm memory-based model are applied to investigate theories of speech emotion on computational models and it is seen that the vowel sounds are more important for defining emotion acoustic cues than the consonants, and the model can assign word importance based on acoustic context.

...read moreread less

Abstract: Speech emotion recognition is essential for obtaining emotional intelligence which affects the understanding of context and meaning of speech. Harmonically structured vowel and consonant sounds add indexical and linguistic cues in spoken information. Previous research argued whether vowel sound cues were more important in carrying the emotional context from a psychological and linguistic point of view. Other research also claimed that emotion information could exist in small overlapping acoustic cues. However, these claims are not corroborated in computational speech emotion recognition systems. In this research, a convolution-based model and a long-shortterm memory-based model, both using attention, are applied to investigate these theories of speech emotion on computational models. The role of acoustic context and word importance is demonstrated for the task of speech emotion recognition. The IEMOCAP corpus is evaluated by the proposed models, and 80.1% unweighted accuracy is achieved on pure acoustic data which is higher than current state-of-the-art models on this task. The phones and words are mapped to the attention vectors and it is seen that the vowel sounds are more important for defining emotion acoustic cues than the consonants, and the model can assign word importance based on acoustic context.

...read moreread less

21 citations

Proceedings Article•DOI•

DNN-based speaker clustering for speaker diarisation

[...]

Rosanna Milner¹, Thomas Hain¹•Institutions (1)

University of Sheffield¹

08 Sep 2016

TL;DR: This paper presents a novel semi-supervised method of speaker clustering based on a deep neural network (DNN) model, which achieves a diarisation error rate of 14.8%, compared to a baseline of 19.9%.

...read moreread less

Abstract: Speaker diarisation, the task of answering "who spoke when?", is often considered to consist of three independent stages: speech activity detection, speaker segmentation and speaker clustering. These represent the separation of speech and nonspeech, the splitting into speaker homogeneous speech segments, followed by grouping together those which belong to the same speaker. This paper is concerned with speaker clustering, which is typically performed by bottom-up clustering using the Bayesian information criterion (BIC). We present a novel semi-supervised method of speaker clustering based on a deep neural network (DNN) model. A speaker separation DNN trained on independent data is used to iteratively relabel the test data set. This is achieved by reconfiguration of the output layer, combined with fine tuning in each iteration. A stopping criterion involving posteriors as confidence scores is investigated. Results are shown on a meeting task (RT07) for single distant microphones and compared with standard diarisation approaches. The new method achieves a diarisation error rate (DER) of 14.8%, compared to a baseline of 19.9%.

...read moreread less

18 citations

Proceedings Article•DOI•

A Cross-Corpus Study on Speech Emotion Recognition

[...]

Rosanna Milner¹, Asif Jalal¹, Raymond W. M. Ng, Thomas Hain¹•Institutions (1)

University of Sheffield¹

01 Dec 2019

TL;DR: Out-of-domain models, followed by adapting to the missing dataset, and domain adversarial training (DAT) are shown to be more suitable to generalising to emotions across datasets.

...read moreread less

Abstract: For speech emotion datasets, it has been difficult to acquire large quantities of reliable data and acted emotions may be over the top compared to less expressive emotions displayed in everyday life. Lately, larger datasets with natural emotions have been created. Instead of ignoring smaller, acted datasets, this study investigates whether information learnt from acted emotions is useful for detecting natural emotions. Cross-corpus research has mostly considered cross-lingual and even cross-age datasets, and difficulties arise from different methods of annotating emotions causing a drop in performance. To be consistent, four adult English datasets covering acted, elicited and natural emotions are considered. A state-of-the-art model is proposed to accurately investigate the degradation of performance. The system involves a bi-directional LSTM with an attention mechanism to classify emotions across datasets. Experiments study the effects of training models in a cross-corpus and multi-domain fashion and results show the transfer of information is not successful. Out-of-domain models, followed by adapting to the missing dataset, and domain adversarial training (DAT) are shown to be more suitable to generalising to emotions across datasets. This shows positive information transfer from acted datasets to those with more natural emotions and the benefits from training on different corpora.

...read moreread less

18 citations

Proceedings Article•DOI•

The 2015 sheffield system for longitudinal diarisation of broadcast media

[...]

Rosanna Milner¹, Oscar Saz¹, Salil Deena¹, Mortaza Doulaty¹, Raymond W. M. Ng¹, Thomas Hain¹ - Show less +2 more•Institutions (1)

University of Sheffield¹

01 Dec 2015

TL;DR: The University of Sheffield system consists of three main stages: speech activity detection using DNNs with novel adaptation and decoding methods; speaker segmentation and clustering, with adaptation of the DNN-based clustering models; and finally speaker linking to match speakers across shows.

...read moreread less

Abstract: Speaker diarisation is the task of answering "who spoke when" within a multi-speaker audio recording. Diarisation of broadcast media typically operates on individual television shows, and is a particularly difficult task, due to a high number of speakers and challenging background conditions. Using prior knowledge, such as that from previous shows in a series, can improve performance. Longitudinal diarisation allows to use knowledge from previous audio files to improve performance, but requires finding matching speakers across consecutive files. This paper describes the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge. The challenge required longitudinal diarisation of data from BBC archives, under very constrained resource settings. Our system consists of three main stages: speech activity detection using DNNs with novel adaptation and decoding methods; speaker segmentation and clustering, with adaptation of the DNN-based clustering models; and finally speaker linking to match speakers across shows. The final result on the development set of 19 shows from five different television series provided a Diarisation Error Rate of 50.77% in the diarisation and linking task.

...read moreread less

16 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines.

[...]

Neville Ryant¹, Kenneth Church², Christopher Cieri¹, Alejandrina Cristia³, Jun Du⁴, Sriram Ganapathy⁵, Mark Liberman¹ - Show less +3 more•Institutions (5)

University of Pennsylvania¹, Baidu², Centre national de la recherche scientifique³, University of Science and Technology of China⁴, Indian Institute of Science⁵

15 Sep 2019

TL;DR: The second edition of the DIHARD challenge as discussed by the authors was designed to improve the robustness of speaker diarization systems to variation in recording equipment, noise conditions, and conversational domain.

...read moreread less

Abstract: This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises four tracks evaluating diarization performance under two input conditions (single channel vs. multi-channel) and two segmentation conditions (diarization from a reference speech segmentation vs. diarization from scratch). In order to prevent participants from overtuning to a particular combination of recording conditions and conversational domain, recordings are drawn from a variety of sources ranging from read audiobooks to meeting speech, to child language acquisition recordings, to dinner parties, to web video. We describe the task and metrics, challenge design, datasets, and baseline systems for speech enhancement, speech activity detection, and diarization.

...read moreread less

135 citations

Proceedings Article•DOI•

The MGB challenge: Evaluating multi-genre broadcast media recognition

[...]

Peter Bell¹, Mark J. F. Gales², Thomas Hain³, Jonathan Kilgour¹, Pierre Lanchantin², Xunying Liu², A McParland, Steve Renals¹, Oscar Saz³, Mirjam Wester¹, Philip C. Woodland² - Show less +7 more•Institutions (3)

University of Edinburgh¹, University of Cambridge², University of Sheffield³

01 Dec 2015

TL;DR: An evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings at ASRU 2015 is described, and the results obtained are summarized.

...read moreread less

Abstract: This paper describes the Multi-Genre Broadcast (MGB) Challenge at ASRU 2015, an evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings. The challenge training data covered the whole range of seven weeks BBC TV output across four channels, resulting in about 1,600 hours of broadcast audio. In addition several hundred million words of BBC subtitle text was provided for language modelling. A novel aspect of the evaluation was the exploration of speech recognition and speaker diarization in a longitudinal setting — i.e. recognition of several episodes of the same show, and speaker diarization across these episodes, linking speakers. The longitudinal tasks also offered the opportunity for systems to make use of supplied metadata including show title, genre tag, and date/time of transmission. This paper describes the task data and evaluation process used in the MGB challenge, and summarises the results obtained.

...read moreread less

135 citations

Posted Content•

Speaker Recognition Based on Deep Learning: An Overview

[...]

Zhongxin Bai¹, Xiao-Lei Zhang¹•Institutions (1)

Northwestern Polytechnical University¹

02 Dec 2020-arXiv: Audio and Speech Processing

TL;DR: Several major subtasks of speaker recognition are reviewed, including speaker verification, identification, diarization, and robust speaker recognition, with a focus on deep-learning-based methods.

...read moreread less

Abstract: Speaker recognition is a task of identifying persons from their voices. Recently, deep learning has dramatically revolutionized speaker recognition. However, there is lack of comprehensive reviews on the exciting progress. In this paper, we review several major subtasks of speaker recognition, including speaker verification, identification, diarization, and robust speaker recognition, with a focus on deep-learning-based methods. Because the major advantage of deep learning over conventional methods is its representation ability, which is able to produce highly abstract embedding features from utterances, we first pay close attention to deep-learning-based speaker feature extraction, including the inputs, network structures, temporal pooling strategies, and objective functions respectively, which are the fundamental components of many speaker recognition subtasks. Then, we make an overview of speaker diarization, with an emphasis of recent supervised, end-to-end, and online diarization. Finally, we survey robust speaker recognition from the perspectives of domain adaptation and speech enhancement, which are two major approaches of dealing with domain mismatch and noise problems. Popular and recently released corpora are listed at the end of the paper.

...read moreread less

104 citations

Journal Article•DOI•

Speaker recognition based on deep learning: An overview

[...]

Zhongxin Bai¹, Xiao-Lei Zhang¹•Institutions (1)

Northwestern Polytechnical University¹

17 Mar 2021-Neural Networks

TL;DR: In this article, the authors review several major subtasks of speaker recognition, including speaker verification, identification, diarization, and robust speaker recognition with a focus on deep learning-based methods.

...read moreread less

89 citations

Posted Content•

The Third DIHARD Diarization Challenge

[...]

Neville Ryant¹, Prachi Singh², Venkat Krishnamohan², Rajat Varma, Kenneth Church³, Christopher Cieri¹, Jun Du⁴, Sriram Ganapathy², Mark Liberman¹ - Show less +5 more•Institutions (4)

University of Pennsylvania¹, Indian Institute of Science², Baidu³, University of Science and Technology of China⁴

02 Dec 2020-arXiv: Audio and Speech Processing

TL;DR: The third DIHARD challenge, the third in a series of speaker diarization challenges intended to improve the robustness of diarized systems to variation in recording equipment, noise conditions, and conversational domain, is introduced.

...read moreread less

Abstract: DIHARD III was the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variability in recording equipment, noise conditions, and conversational domain. Speaker diarization was evaluated under two speech activity conditions (diarization from a reference speech activity vs. diarization from scratch) and 11 diverse domains. The domains span a range of recording conditions and interaction types, including read audio-books, meeting speech, clinical interviews, web videos, and, for the first time, conversational telephone speech. A total of 30 organizations (forming 21teams) from industry and academia submitted 499 valid system outputs. The evaluation results indicate that speaker diarization has improved markedly since DIHARD I, particularly for two-party interactions, but that for many domains (e.g., web video) the problem remains far from solved.

...read moreread less

67 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Collapse