scispace - formally typeset
Search or ask a question
Author

Alexandra Markó

Bio: Alexandra Markó is an academic researcher from Eötvös Loránd University. The author has contributed to research in topics: Speech synthesis & Silent speech interface. The author has an hindex of 8, co-authored 36 publications receiving 171 citations. Previous affiliations of Alexandra Markó include Budapest University of Technology and Economics.

Papers
More filters
Proceedings ArticleDOI
20 Aug 2017
TL;DR: It is found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error.
Abstract: In this paper we present our initial results in articulatory-toacoustic conversion based on tongue movement recordings using Deep Neural Networks (DNNs). Despite the fact that deep learning has revolutionized several fields, so far only a few researchers have applied DNNs for this task. Here, we compare various possible feature representation approaches combined with DNN-based regression. As the input, we recorded synchronized 2D ultrasound images and speech signals. The task of the DNN was to estimate Mel-Generalized Cepstrum-based Line Spectral Pair (MGC-LSP) coefficients, which then served as input to a standard pulse-noise vocoder for speech synthesis. As the raw ultrasound images have a relatively high resolution, we experimented with various feature selection and transformation approaches to reduce the size of the feature vectors. The synthetic speech signals resulting from the various DNN configurations were evaluated both using objective measures and a subjective listening test. We found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error. Our results may be useful for creating Silent Speech Interface applications in the future.

54 citations

Proceedings ArticleDOI
15 Apr 2018
TL;DR: Deep neural networks are experimented with to perform articulatory-to-acoustic conversion from ultrasound images, with an emphasis on estimating the voicing feature and the F0 curve from the ultrasound input, with a correlation rate of 0.74.
Abstract: State-of-the-art silent speech interface systems apply vocoders to generate the speech signal directly from articulatory data Most of these approaches concentrate on estimating just the spectral features of the vocoder, and use the original F0, a constant F0 or white noise as excitation This solution is based on the assumption that the F0 curve is unpredictable from articulatory data that does not contain direct measurements of the vocal fold vibration Here, we experimented with deep neural networks to perform articulatory-to-acoustic conversion from ultrasound images, with an emphasis on estimating the voicing feature and the F0 curve from the ultrasound input Contrary to the common belief that F0 is unpredictable, we attained a correlation rate of 074 between the original and the predicted F0 curve What is more, the listening tests revealed that our subjects could not distinguish the sentences synthesized using the DNN-estimated and the original F0 curve, and ranked them as having the same quality

29 citations

Proceedings ArticleDOI
02 Sep 2018
TL;DR: The results show that the parallel learning of the two types of targets is indeed beneficial for both tasks, and improvements are obtained by using multi-task training of deep neural networks as a weight initialization step before task-specific training.
Abstract: Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acoustic conversion task. The recognition-and-synthesis approach applies speech recognition techniques to map the articulatory data to a textual transcript, which is then converted to speech by a conventional text-tospeech system. The direct synthesis approach seeks to convert the articulatory information directly to speech synthesis (vocoder) parameters. In both cases, deep neural networks are an evident and popular choice to learn the mapping task. Recognizing that the learning of speech recognition and speech synthesis targets (acoustic model states vs. vocoder parameters) are two closely related tasks over the same ultrasound tongue image input, here we experiment with the multi-task training of deep neural networks, which seeks to solve the two tasks simultaneously. Our results show that the parallel learning of the two types of targets is indeed beneficial for both tasks. Moreover, we obtained further improvements by using multi-task training as a weight initialization step before task-specific training. Overall, we report a relative error rate reduction of about 7% in both the speech recognition and the speech synthesis tasks.

17 citations

Journal ArticleDOI
TL;DR: The results show that one or the other, or some combination, of these various factors may play a role in the perception process in certain instances only; this suggests that other parameters, yet to be explored, are also involved in the identification of these functions.
Abstract: This study is the first attempt at detecting formal and positional characteristics of single-word simple discourse markers in a spontaneous speech sample of Hungarian. In the first part of the research, theoretical claims made in the relevant literature were tested. The data did not confirm or only partially confirmed the claims that Hungarian discourse markers (i) occur in turn-initial position and (ii) are prosodically independent, that is, are flanked by a pause on either side. In the second part, we looked at word forms both occurring as discourse markers and having syntactic functions in order to determine the features and cues which help us during speech perception to identify and distinguish between syntactic and discourse marking functions. The points of analysis were as follows: the position of the given word form in the clause, the degree of lenition in its articulation, the duration of the word form, the modulation of fundamental frequency, and the occurrence of sentence stress, if any, on the word form at hand. The results show that one or the other, or some combination, of these various factors may play a role in the perception process in certain instances only; this suggests that other parameters, yet to be explored, are also involved in the identification of these functions.

16 citations

Proceedings ArticleDOI
01 Jul 2019
TL;DR: In this article, an autoencoder neural network was trained on the ultrasound image and the estimation of the spectral speech parameters was done by a second DNN, using the activations of the bottleneck layer of the autoencoders as features.
Abstract: When using ultrasound video as input, Deep Neural Network-based Silent Speech Interfaces usually rely on the whole image to estimate the spectral parameters required for the speech synthesis step. Although this approach is quite straightforward, and it permits the synthesis of understandable speech, it has several disadvantages as well. Besides the inability to capture the relations between close regions (i.e. pixels) of the image, this pixelby-pixel representation of the image is also quite uneconomical. It is easy to see that a significant part of the image is irrelevant for the spectral parameter estimation task as the information stored by the neighbouring pixels is redundant, and the neural network is quite large due to the large number of input features. To resolve these issues, in this study we train an autoencoder neural network on the ultrasound image; the estimation of the spectral speech parameters is done by a second DNN, using the activations of the bottleneck layer of the autoencoder network as features. In our experiments, the proposed method proved to be more efficient than the standard approach: the measured normalized mean squared error scores were lower, while the correlation values were higher in each case. Based on the result of a listening test, the synthesized utterances also sounded more natural to native speakers. A further advantage of our proposed approach is that, due to the (relatively) small size of the bottleneck layer, we can utilize several consecutive ultrasound images during estimation without a significant increase in the network size, while significantly increasing the accuracy of parameter estimation.

13 citations


Cited by
More filters
Proceedings ArticleDOI
02 May 2019
TL;DR: A system to detect a user's unvoiced utterance and recognize the utterance contents without the user's uttering voice is proposed, and it is confirmed that audio signals generated by the system can control the existing smart speakers.
Abstract: The availability of digital devices operated by voice is expanding rapidly. However, the applications of voice interfaces are still restricted. For example, speaking in public places becomes an annoyance to the surrounding people, and secret information should not be uttered. Environmental noise may reduce the accuracy of speech recognition. To address these limitations, a system to detect a user's unvoiced utterance is proposed. From internal information observed by an ultrasonic imaging sensor attached to the underside of the jaw, our proposed system recognizes the utterance contents without the user's uttering voice. Our proposed deep neural network model is used to obtain acoustic features from a sequence of ultrasound images. We confirmed that audio signals generated by our system can control the existing smart speakers. We also observed that a user can adjust their oral movement to learn and improve the accuracy of their voice recognition.

82 citations

Book ChapterDOI
17 Apr 2015

64 citations

Journal ArticleDOI
TL;DR: A number of challenges remain to be addressed in future research before SSIs can be promoted to real-world applications, and future SSIs will improve the lives of persons with severe speech impairments by restoring their communication capabilities.
Abstract: This review summarises the status of silent speech interface (SSI) research. SSIs rely on non-acoustic biosignals generated by the human body during speech production to enable communication whenever normal verbal communication is not possible or not desirable. In this review, we focus on the first case and present latest SSI research aimed at providing new alternative and augmentative communication methods for persons with severe speech disorders. SSIs can employ a variety of biosignals to enable silent communication, such as electrophysiological recordings of neural activity, electromyographic (EMG) recordings of vocal tract movements or the direct tracking of articulator movements using imaging techniques. Depending on the disorder, some sensing techniques may be better suited than others to capture speech-related information. For instance, EMG and imaging techniques are well suited for laryngectomised patients, whose vocal tract remains almost intact but are unable to speak after the removal of the vocal folds, but fail for severely paralysed individuals. From the biosignals, SSIs decode the intended message, using automatic speech recognition or speech synthesis algorithms. Despite considerable advances in recent years, most present-day SSIs have only been validated in laboratory settings for healthy users. Thus, as discussed in this paper, a number of challenges remain to be addressed in future research before SSIs can be promoted to real-world applications. If these issues can be addressed successfully, future SSIs will improve the lives of persons with severe speech impairments by restoring their communication capabilities.

60 citations

BookDOI
25 Sep 2019
TL;DR: The Routledge Handbook of North American Languages as discussed by the authors is a one-stop reference for linguists on those topics that come up the most frequently in the study of the languages of North America (including Mexico).
Abstract: The Routledge Handbook of North American Languages is a one-stop reference for linguists on those topics that come up the most frequently in the study of the languages of North America (including Mexico). This handbook compiles a list of contributors from across many different theories and at different stages of their careers, all of whom are well-known experts in North American languages. The volume comprises two distinct parts: the first surveys some of the phenomena most frequently discussed in the study of North American languages, and the second surveys some of the most frequently discussed language families of North America. The consistent goal of each contribution is to couch the content of the chapter in contemporary theory so that the information is maximally relevant and accessible for a wide range of audiences, including graduate students and young new scholars, and even senior scholars who are looking for a crash course in the topics. Empirically driven chapters provide fundamental knowledge needed to participate in contemporary theoretical discussions of these languages, making this handbook an indispensable resource for linguistics scholars.

35 citations

Journal ArticleDOI
17 Feb 2021-Sensors
TL;DR: In this article, a survey of mouth interface technologies for speech recognition, production, and volitional control is presented, and the corresponding research to develop artificial mouth technologies based on various sensors, including electromyography (EMG), electroencephalography (EEG), electropalatography (EPG), electromagnetic articulography (EMA), permanent magnet articULography (PMA), gyros, images and 3-axial magnetic sensors, especially with deep learning techniques.
Abstract: Voice is one of the essential mechanisms for communicating and expressing one's intentions as a human being. There are several causes of voice inability, including disease, accident, vocal abuse, medical surgery, ageing, and environmental pollution, and the risk of voice loss continues to increase. Novel approaches should have been developed for speech recognition and production because that would seriously undermine the quality of life and sometimes leads to isolation from society. In this review, we survey mouth interface technologies which are mouth-mounted devices for speech recognition, production, and volitional control, and the corresponding research to develop artificial mouth technologies based on various sensors, including electromyography (EMG), electroencephalography (EEG), electropalatography (EPG), electromagnetic articulography (EMA), permanent magnet articulography (PMA), gyros, images and 3-axial magnetic sensors, especially with deep learning techniques. We especially research various deep learning technologies related to voice recognition, including visual speech recognition, silent speech interface, and analyze its flow, and systematize them into a taxonomy. Finally, we discuss methods to solve the communication problems of people with disabilities in speaking and future research with respect to deep learning components.

31 citations