scispace - formally typeset
Open AccessJournal ArticleDOI

Phonological feature-based speech recognition system for pronunciation training in non-native language learning

TLDR
The authors implementation of a phonological feature-based ASR system using deep neural networks as an acoustic model and its use for detecting mispronunciation detection, analysing errors, and rendering corrective feedback is presented.
Abstract
The authors address the question whether phonological features can be used effectively in an automatic speech recognition (ASR) system for pronunciation training in non-native language (L2) learning. Computer-aided pronunciation training consists of two essential tasks—detecting mispronunciations and providing corrective feedback, usually either on the basis of full words or phonemes. Phonemes, however, can be further disassembled into phonological features, which in turn define groups of phonemes. A phonological feature-based ASR system allows the authors to perform a sub-phonemic analysis at feature level, providing a more effective feedback to reach the acoustic goal and perceptual constancy. Furthermore, phonological features provide a structured way for analysing the types of errors a learner makes, and can readily convey which pronunciations need improvement. This paper presents the authors implementation of such an ASR system using deep neural networks as an acoustic model, and its use for detectin...

read more

Content maybe subject to copyright    Report

JASA/123
Phonological Feature-based Speech Recognition System for1
Pronunciation Training in Non-native Language Learning2
Vipul Arora,
1
Aditi Lahiri,
1, a)
and Henning Reetz
2
3
1
Faculty of Linguistics, Philology and Phonetics, University of Oxford,4
U.K.5
2
Goethe University, Frankfurt am Main, Germany6
(Dated: 27 October 2017)7
1

Feature-based pronunciation training system
We address the question whether phonological features can be used effectively in an8
automatic speech recognition (ASR) system for pronunciation training in non-native9
language (L2) learning. Computer-aided pronunciation training (CAPT) consists of10
two essential tasks - detecting mispronunciations and providing corrective feedback,11
usually either on the basis of full words or phonemes. Phonemes, however, can be fur-12
ther disassembled into phonological features, which in turn define groups of phonemes.13
A phonological feature-based ASR system allows us to perform a sub-phonemic anal-14
ysis at feature level, providing a more effective feedback to reach the acoustic goal and15
perceptual constancy. Furthermore, phonological features provide a structured way16
for analysing the types of errors a learner makes, and can readily convey which pro-17
nunciations need improvement. This paper presents our implementation of such an18
ASR system using deep neural networks as acoustic model, and its use for detecting19
mispronunciations, analysing errors and rendering corrective feedback. Quantitative20
as well as qualitative evaluations are carried out for German and Italian learners of21
English. In addition to achieving high accuracy of mispronunciation detection, our22
system also provides accurate diagnosis of errors.23
2

Feature-based pronunciation training system
Keywords: Phonological features; mispronunciation detection; automatic speech24
recognition25
a)
aditi.lahiri@ling-phil.ox.ac.uk; Corresponding author.
3

Feature-based pronunciation training system
I. INTRODUCTION26
Learning a new language (L2) is common in the modern era of globalisation. Adults often27
experience difficulties in learning and even perceiving new sounds that are not present in28
their native language (L1). On the other hand, automatic speech recognition (ASR) tech-29
nology has made tremendous progress in recent times, becoming a useful tool in assisting30
the L2 learners, commonly known as computer aided language learning (CALL). An essen-31
tial component of CALL systems is computer-aided pronunciation training (CAPT), where32
the system can detect mispronunciations in the learner’s utterances, and can also provide33
corrective feedback to the learner. These systems are all based on whole phonemes. In34
contrast, this work highlights the utility of phonological features (which make up individual35
phonemes) in CALL applications. We propose a CAPT system using features not only to36
detect and analyse mispronunciations in learners utterances, but also to render corrective37
feedback through which they can efficiently improve their articulation to reach acoustic tar-38
gets. Further, phonological features can also be used to find patterns of mispronunciations39
of a particular speaker, that can be useful for designing his/her course based on the types40
of mistakes that occur. The proposed system uses an automatic speech recognition system41
4

Feature-based pronunciation training system
that consists of deep neural networks (DNNs) in the acoustic front-end and a hidden Markov42
model (HMM). The DNNs learn to estimate phonological features from the speech signal.43
These features are then mapped to phonemes for the task of speech recognition and mis-44
pronunciation detection. The estimated phonological features are then used to construct a45
corrective feedback for the phonemes or groups of phonemes that are mispronounced.46
The main characteristics of this work are:47
A DNN based acoustic model to extract phonological features from the speech signal48
An ASR system using phonological features to recognise and analyse learners speech49
A mispronunciation detector50
Analysis of mispronunciations based on phonological features51
Rendering feedback in terms of phonological features52
The paper is organized as follows: Sec. II discusses the previous relevant literature. The53
ASR framework used for implementing the proposed system is described in Sec. III. Secs. IV54
and V provide details of the proposed system for detecting mispronunciations and rendering55
feature-based corrective feedback, respectively, along with experimental evaluation. The56
conclusion in Sec. VI also discusses the future directions.57
5

Citations
More filters
Journal ArticleDOI

Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition

TL;DR: In this study, all the eight emotions of the speech from RAVDESS and TESS databases for English and IITKGP-SEHSC database for Hindi are classified and the experimental results show that the proposed algorithm obtains maximum accuracy.
Journal ArticleDOI

Cross-Lingual Transfer Learning of Non-Native Acoustic Modeling for Pronunciation Error Detection and Diagnosis

TL;DR: This work proposes to exploit two large native speech corpora of learner's native language and target language to model cross-lingual phenomena and achieves a lower word error rate in non-native speech recognition and improves the pronunciation error detection based on goodness of pronunciation score.
Journal ArticleDOI

Developmental research on an interactive application for language speaking practice using speech recognition technology

TL;DR: In this paper, the design and development process of computer-assisted language learning software is examined, and the results of expert reviews and usability tests are analyzed using a log-data analysis.
Proceedings ArticleDOI

Interlanguage of Automatic Speech Recognition

TL;DR: A literature review regarding the effort of phonological features extraction for Automatic Speech Recognition (ASR) in terms of interlanguage and presents the respective language phonology model to highlight the why factor causing the negative transfer phenomenon.
Posted Content

Transparent pronunciation scoring using articulatorily weighted phoneme edit distance

TL;DR: A white-box scoring model of mapped weighted Levenshtein edit distance between reference and error with error weights for articulatory differences computed from a training set of scored utterances is presented.
References
More filters
Journal ArticleDOI

Articulatory phonology: an overview.

TL;DR: It is suggested that the gestural approach clarifies the understanding of phonological development, by positing that prelinguistic units of action are harnessed into (gestural) phonological structures through differentiation and coordination.
Journal ArticleDOI

Weighted finite-state transducers in speech recognition

TL;DR: WFSTs provide a common and natural representation for hidden Markov models (HMMs), context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs, and general transducer operations combine these representations flexibly and efficiently.

Weighted finite state transducers in speech recognition

Mehryar Mohri
TL;DR: The use of weighted finite-state transducers (WFSTs) in speech recognition has been extensively studied in this article, where the authors show that WFSTs provide a common and natural representation for hidden Markov models (HMMs), context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs.
Book

Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates

TL;DR: The authors describes the ultimate discrete components of language, their specific structure, and their articulatory, acoustic, and perceptual correlates, and surveys their utilization in the language of the world, and presents an added paper on Tenseness and Laxness.
Journal ArticleDOI

Invariant cues for place of articulation in stop consonants

TL;DR: It was determined that the gross shape of the spectrum sampled at the consonantal release showed a distinctive shape for each place of articulation: a prominent midfrequency spectral peak for velars, a diffuse-rising spectrum for alveolars, and an diffuse-falling spectrum for labials.
Related Papers (5)