scispace - formally typeset
Search or ask a question
Author

Takafumi Koshinaka

Bio: Takafumi Koshinaka is an academic researcher from NEC. The author has contributed to research in topics: Speaker recognition & Acoustic model. The author has an hindex of 16, co-authored 80 publications receiving 1044 citations. Previous affiliations of Takafumi Koshinaka include Tokyo Institute of Technology & Yokohama City University.


Papers
More filters
Proceedings ArticleDOI
29 Mar 2018
TL;DR: Attention statistics pooling for deep speaker embedding in text-independent speaker verification uses an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations, which can capture long-term variations in speaker characteristics more effectively.
Abstract: This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.

450 citations

Patent
Takafumi Koshinaka1
27 Nov 2008
TL;DR: In this article, the authors propose a method to robustly detect a pronunciation variation example and acquire a variation rule having a high generalization property, with less effort, by using a mixture of a speech data storage unit, a base form pronunciation storage unit and a difference extraction unit.
Abstract: A problem to be solved is to robustly detect a pronunciation variation example and acquire a pronunciation variation rule having a high generalization property, with less effort. The problem can be solved by a pronunciation variation rule extraction apparatus including a speech data storage unit, a base form pronunciation storage unit, a sub word language model generation unit, a speech recognition unit, and a difference extraction unit. The speech data storage unit stores speech data. The base form pronunciation storage unit stores base form pronunciation data representing base form pronunciation of the speech data. The sub word language model generation unit generates a sub word language model from the base form pronunciation data. The speech recognition unit recognizes the speech data by using the sub word language model. The difference extraction unit extracts a difference between a recognition result outputted from the speech recognition unit and the base form pronunciation data by comparing the recognition result and the base form pronunciation data.

160 citations

Proceedings ArticleDOI
12 May 2019
TL;DR: In this article, an unsupervised linear discriminant analysis (PLDA) adaptation algorithm was proposed to learn from a small amount of unlabeled in-domain data, which was inspired by a prior work on feature-based domain adaptation technique known as the correlation alignment (CORAL).
Abstract: State-of-the-art speaker recognition systems comprise an x-vector (or i-vector) speaker embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) backend. The effectiveness of these components relies on the availability of a large collection of labeled training data. In practice, it is common that the domains (e.g., language, demographic) in which the system is deployed differ from that we trained the system. To close the gap due to the domain mismatch, we propose an unsupervised PLDA adaptation algorithm to learn from a small amount of unlabeled in-domain data. The proposed method was inspired by a prior work on feature-based domain adaptation technique known as the correlation alignment (CORAL). We refer to the model-based adaptation technique proposed in this paper as CORAL+. The efficacy of the proposed technique is experimentally validated on the recent NIST 2016 and 2018 Speaker Recognition Evaluation (SRE’16, SRE’18) datasets.

46 citations

Patent
Takafumi Koshinaka1
02 Feb 2007
TL;DR: In this paper, a speech recognition dictionary compilation assisting system can create and update speech recognition dictionaries and language models efficiently so as to reduce speech recognition errors by utilizing text data available at a low cost.
Abstract: A speech recognition dictionary compilation assisting system can create and update speech recognition dictionary and language models efficiently so as to reduce speech recognition errors by utilizing text data available at a low cost. The system includes speech recognition dictionary storage section 105 , language model storage section 106 and acoustic model storage section 107 . A virtual speech recognition processing section 102 processes analyzed text data generated by the text analyzing section 101 by making reference to the recognition dictionary, language models and acoustic models so as to generate virtual text data resulted from speech recognition, and compares the virtual text data resulted from speech recognition with the analyzed text data. The update processing section 103 updates the recognition dictionary and language models so as to reduce different point(s) between both sets of text data.

35 citations


Cited by
More filters
Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Patent
11 Jan 2011
TL;DR: In this article, an intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions.
Abstract: An intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions. The system can be implemented using any of a number of different platforms, such as the web, email, smartphone, and the like, or any combination thereof. In one embodiment, the system is based on sets of interrelated domains and tasks, and employs additional functionally powered by external services with which the system can interact.

1,462 citations

Proceedings ArticleDOI
14 May 2020
TL;DR: The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the Voxceleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.
Abstract: Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complementary information, we aggregate and propagate features of different hierarchical levels. Finally, we improve the statistics pooling module with channel-dependent frame attention. This enables the network to focus on different subsets of frames during each of the channel’s statistics estimation. The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

617 citations

Patent
28 Sep 2012
TL;DR: In this article, a virtual assistant uses context information to supplement natural language or gestural input from a user, which helps to clarify the user's intent and reduce the number of candidate interpretations of user's input, and reduces the need for the user to provide excessive clarification input.
Abstract: A virtual assistant uses context information to supplement natural language or gestural input from a user. Context helps to clarify the user's intent and to reduce the number of candidate interpretations of the user's input, and reduces the need for the user to provide excessive clarification input. Context can include any available information that is usable by the assistant to supplement explicit user input to constrain an information-processing problem and/or to personalize results. Context can be used to constrain solutions during various phases of processing, including, for example, speech recognition, natural language processing, task flow processing, and dialog generation.

593 citations

Patent
Jeongyun Heo1, Hyoungjoo Kim1, Jungeun Shin1, Sohoon Yi1, Soohyun Lee1, Moonkyung Kim1 
22 Aug 2014
TL;DR: In this article, a mobile terminal can display a movement of an icon being displayed on the displayed wallpapers and preview screens, allowing the user to intuitively recognize a location of the icon and effectively move the icon.
Abstract: A mobile terminal and a method of controlling a mobile terminal may be provided. The mobile terminal may include a display to display one of a plurality of wallpapers including at least one icon; and a controller to display at least two of the plurality of wallpapers and a plurality of preview screens corresponding to the plurality of wallpapers on the display upon reception of an input of moving at least one icon, moving of the at least one icon being displayed on the displayed wallpapers and preview screens. The mobile terminal can display a movement of icon being displayed on the displayed wallpapers and preview screens. Accordingly, a user may intuitively recognize a location of icon and effectively move a location of icon.

531 citations