Phonological feature-based speech recognition system for pronunciation training in non-native language learning

doi:10.1121/1.5017834

JASA/123

Phonological Feature-based Speech Recognition System for1

Pronunciation Training in Non-native Language Learning2

Vipul Arora,

1

Aditi Lahiri,

1, a)

and Henning Reetz

2

3

1

Faculty of Linguistics, Philology and Phonetics, University of Oxford,4

U.K.5

2

Goethe University, Frankfurt am Main, Germany6

(Dated: 27 October 2017)7

1

Feature-based pronunciation training system

We address the question whether phonological features can be used eﬀectively in an8

automatic speech recognition (ASR) system for pronunciation training in non-native9

language (L2) learning. Computer-aided pronunciation training (CAPT) consists of10

two essential tasks - detecting mispronunciations and providing corrective feedback,11

usually either on the basis of full words or phonemes. Phonemes, however, can be fur-12

ther disassembled into phonological features, which in turn deﬁne groups of phonemes.13

A phonological feature-based ASR system allows us to perform a sub-phonemic anal-14

ysis at feature level, providing a more eﬀective feedback to reach the acoustic goal and15

perceptual constancy. Furthermore, phonological features provide a structured way16

for analysing the types of errors a learner makes, and can readily convey which pro-17

nunciations need improvement. This paper presents our implementation of such an18

ASR system using deep neural networks as acoustic model, and its use for detecting19

mispronunciations, analysing errors and rendering corrective feedback. Quantitative20

as well as qualitative evaluations are carried out for German and Italian learners of21

English. In addition to achieving high accuracy of mispronunciation detection, our22

system also provides accurate diagnosis of errors.23

2

Feature-based pronunciation training system

Keywords: Phonological features; mispronunciation detection; automatic speech24

recognition25

a)

aditi.lahiri@ling-phil.ox.ac.uk; Corresponding author.

3

Feature-based pronunciation training system

I. INTRODUCTION26

Learning a new language (L2) is common in the modern era of globalisation. Adults often27

experience diﬃculties in learning and even perceiving new sounds that are not present in28

their native language (L1). On the other hand, automatic speech recognition (ASR) tech-29

nology has made tremendous progress in recent times, becoming a useful tool in assisting30

the L2 learners, commonly known as computer aided language learning (CALL). An essen-31

tial component of CALL systems is computer-aided pronunciation training (CAPT), where32

the system can detect mispronunciations in the learner’s utterances, and can also provide33

corrective feedback to the learner. These systems are all based on whole phonemes. In34

contrast, this work highlights the utility of phonological features (which make up individual35

phonemes) in CALL applications. We propose a CAPT system using features not only to36

detect and analyse mispronunciations in learners utterances, but also to render corrective37

feedback through which they can eﬃciently improve their articulation to reach acoustic tar-38

gets. Further, phonological features can also be used to ﬁnd patterns of mispronunciations39

of a particular speaker, that can be useful for designing his/her course based on the types40

of mistakes that occur. The proposed system uses an automatic speech recognition system41

4

Feature-based pronunciation training system

that consists of deep neural networks (DNNs) in the acoustic front-end and a hidden Markov42

model (HMM). The DNNs learn to estimate phonological features from the speech signal.43

These features are then mapped to phonemes for the task of speech recognition and mis-44

pronunciation detection. The estimated phonological features are then used to construct a45

corrective feedback for the phonemes or groups of phonemes that are mispronounced.46

The main characteristics of this work are:47

• A DNN based acoustic model to extract phonological features from the speech signal48

• An ASR system using phonological features to recognise and analyse learners speech49

• A mispronunciation detector50

• Analysis of mispronunciations based on phonological features51

• Rendering feedback in terms of phonological features52

The paper is organized as follows: Sec. II discusses the previous relevant literature. The53

ASR framework used for implementing the proposed system is described in Sec. III. Secs. IV54

and V provide details of the proposed system for detecting mispronunciations and rendering55

feature-based corrective feedback, respectively, along with experimental evaluation. The56

conclusion in Sec. VI also discusses the future directions.57

5

Phonological feature-based speech recognition system for pronunciation training in non-native language learning

Figures

Citations

Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition

Cross-Lingual Transfer Learning of Non-Native Acoustic Modeling for Pronunciation Error Detection and Diagnosis

Developmental research on an interactive application for language speaking practice using speech recognition technology

Interlanguage of Automatic Speech Recognition

Transparent pronunciation scoring using articulatorily weighted phoneme edit distance

References

Articulatory phonology: an overview.

Weighted finite-state transducers in speech recognition

Weighted finite state transducers in speech recognition

Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates

Invariant cues for place of articulation in stop consonants

Related Papers (5)

Automatic assessment of phonological processes for speech therapy and language instruction

Automatic assessment of phonological processes

Using Multilingual Units for Improved Modeling of Pronunciation Variants

Improving Mispronunciation Detection and Diagnosis of Learners' Speech with Context-Sensitive Phonological Rules based on Language Transfer

Mispronunciation detection based on cross-language phonological comparisons