scispace - formally typeset
Search or ask a question

Showing papers by "Rubén San-Segundo published in 2018"


Journal ArticleDOI
TL;DR: This work analyzes and proposes several techniques to improve the robustness of a Human Activity Recognition (HAR) system that uses accelerometer signals from different smartwatches and smartphones.

75 citations


Proceedings ArticleDOI
26 Sep 2018
TL;DR: Several feature sets compared across two classification algorithms for PD tremor detection find that features automatically learned by a Convolutional Neural Network lead to the best performance, although the authors' handcrafted features are close behind.
Abstract: Wearable sensor technology has the potential to transform the treatment of Parkinson's Disease (PD) by providing objective analysis about the frequency and severity of symptoms in everyday life. However, many challenges remain to developing a system that can robustly distinguish PD motor symptoms from normal motion. Stronger feature sets may help to improve the detection accuracy of such a system. In this work, we explore several feature sets compared across two classification algorithms for PD tremor detection. We find that features automatically learned by a Convolutional Neural Network (CNN) lead to the best performance, although our handcrafted features are close behind. We also find that CNNs benefit from training on data decomposed into tremor and activity spectra as opposed to raw data.

13 citations


Journal ArticleDOI
TL;DR: A memetic based framework to recognize faces prone to adverse conditions such as facial occlusions, expression and illumination variations and an intelligent single particle optimizer to take advantage of their global and local search capabilities are proposed.

9 citations


Journal ArticleDOI
TL;DR: This strategy permits out-of-vocabulary (OOV) terms in a Large Vocabulary Continuous Speech Recognition (LVCSR) system to be discovered and then put into the final transcription, and is able to improve the transcription in Word Error Rate (WER) terms significantly.
Abstract: In this work we present a neural network embedding we call Resource2Vec, which is able to represent the resources that make up some Linked Data (LD) corpora. A vector representation of these resources allows more advantageous processing (in computational terms) as is the case with known word or document embeddings. We give a quantitative analysis for their study. Furthermore, we employ them in an Automatic Speech Recognition (ASR) task to demonstrate their functionality by designing a strategy for term discovery. This strategy permits out-of-vocabulary (OOV) terms in a Large Vocabulary Continuous Speech Recognition (LVCSR) system to be discovered and then put into the final transcription. First, we detect where a potential OOV term may have been uttered in the LVCSR output speech segments. Second, we carry out a candidate OOV search in some LD corpora. This search is oriented by distance measurements between the transcription context around the potential-OOV speech segment and the resources of the LD corpora in Resource2Vec format, obtaining a set of candidates. To rank them, we mainly depend on the phone transcription of that segment. Finally, we decide whether or not to incorporate a candidate into the final transcription. The results show we are able to improve the transcription in Word Error Rate (WER) terms significantly, after our strategy is used on speech in Spanish.

6 citations


Journal ArticleDOI
TL;DR: This project is focused on advancing, developing and improving speech and language technologies as well as image and video technologies in the analysis of multimedia content adding to this analysis the extraction of affective-emotional information.
Abstract: Traditionally, textual content has been the main source of information extraction and indexing, and other technologies that are capable of extracting information from the audio and video of multimedia documents have joined later. Other major axis of analysis is the emotional and affective aspect intrinsic in human communication. This information of emotions, stances, preferences, figurative language, irony, sarcasm, etc. is fundamental and irreplaceable for a complete understanding of the content in conversations, speeches, debates, discussions, etc. The objective of this project is focused on advancing, developing and improving speech and language technologies as well as image and video technologies in the analysis of multimedia content adding to this analysis the extraction of affective-emotional information. As additional steps forward, we will advance in the methodologies and ways for presenting the information to the user, working on technologies for language simplification, automatic reports and summary generation, emotional speech synthesis and natural and inclusive interaction.

4 citations


Proceedings ArticleDOI
21 Nov 2018
TL;DR: The use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier, contribute to relative improvement over the baseline using a Skip-Gram model and a Glove model.
Abstract: Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled “phone-gram sequences”. In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic iVector system provides up to 34,1% improvement over the acoustic system alone.

3 citations