JASA/123
Phonological Feature-based Speech Recognition System for1
Pronunciation Training in Non-native Language Learning2
Vipul Arora,
1
Aditi Lahiri,
1, a)
and Henning Reetz
2
3
1
Faculty of Linguistics, Philology and Phonetics, University of Oxford,4
U.K.5
2
Goethe University, Frankfurt am Main, Germany6
(Dated: 27 October 2017)7
1
Feature-based pronunciation training system
We address the question whether phonological features can be used effectively in an8
automatic speech recognition (ASR) system for pronunciation training in non-native9
language (L2) learning. Computer-aided pronunciation training (CAPT) consists of10
two essential tasks - detecting mispronunciations and providing corrective feedback,11
usually either on the basis of full words or phonemes. Phonemes, however, can be fur-12
ther disassembled into phonological features, which in turn define groups of phonemes.13
A phonological feature-based ASR system allows us to perform a sub-phonemic anal-14
ysis at feature level, providing a more effective feedback to reach the acoustic goal and15
perceptual constancy. Furthermore, phonological features provide a structured way16
for analysing the types of errors a learner makes, and can readily convey which pro-17
nunciations need improvement. This paper presents our implementation of such an18
ASR system using deep neural networks as acoustic model, and its use for detecting19
mispronunciations, analysing errors and rendering corrective feedback. Quantitative20
as well as qualitative evaluations are carried out for German and Italian learners of21
English. In addition to achieving high accuracy of mispronunciation detection, our22
system also provides accurate diagnosis of errors.23
2
Feature-based pronunciation training system
I. INTRODUCTION26
Learning a new language (L2) is common in the modern era of globalisation. Adults often27
experience difficulties in learning and even perceiving new sounds that are not present in28
their native language (L1). On the other hand, automatic speech recognition (ASR) tech-29
nology has made tremendous progress in recent times, becoming a useful tool in assisting30
the L2 learners, commonly known as computer aided language learning (CALL). An essen-31
tial component of CALL systems is computer-aided pronunciation training (CAPT), where32
the system can detect mispronunciations in the learner’s utterances, and can also provide33
corrective feedback to the learner. These systems are all based on whole phonemes. In34
contrast, this work highlights the utility of phonological features (which make up individual35
phonemes) in CALL applications. We propose a CAPT system using features not only to36
detect and analyse mispronunciations in learners utterances, but also to render corrective37
feedback through which they can efficiently improve their articulation to reach acoustic tar-38
gets. Further, phonological features can also be used to find patterns of mispronunciations39
of a particular speaker, that can be useful for designing his/her course based on the types40
of mistakes that occur. The proposed system uses an automatic speech recognition system41
4
Feature-based pronunciation training system
that consists of deep neural networks (DNNs) in the acoustic front-end and a hidden Markov42
model (HMM). The DNNs learn to estimate phonological features from the speech signal.43
These features are then mapped to phonemes for the task of speech recognition and mis-44
pronunciation detection. The estimated phonological features are then used to construct a45
corrective feedback for the phonemes or groups of phonemes that are mispronounced.46
The main characteristics of this work are:47
• A DNN based acoustic model to extract phonological features from the speech signal48
• An ASR system using phonological features to recognise and analyse learners speech49
• A mispronunciation detector50
• Analysis of mispronunciations based on phonological features51
• Rendering feedback in terms of phonological features52
The paper is organized as follows: Sec. II discusses the previous relevant literature. The53
ASR framework used for implementing the proposed system is described in Sec. III. Secs. IV54
and V provide details of the proposed system for detecting mispronunciations and rendering55
feature-based corrective feedback, respectively, along with experimental evaluation. The56
conclusion in Sec. VI also discusses the future directions.57
5