Home
/
Authors
/
Takafumi Koshinaka

Author

Takafumi Koshinaka

Other affiliations: Tokyo Institute of Technology, Yokohama City University

Bio: Takafumi Koshinaka is an academic researcher from NEC. The author has contributed to research in topics: Speaker recognition & Acoustic model. The author has an hindex of 16, co-authored 80 publications receiving 1044 citations. Previous affiliations of Takafumi Koshinaka include Tokyo Institute of Technology & Yokohama City University.

Papers published on a yearly basis

2023
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2005
2000
1999

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Attentive Statistics Pooling for Deep Speaker Embedding

[...]

Koji Okabe¹, Takafumi Koshinaka¹, Koichi Shinoda²•Institutions (2)

NEC¹, Tokyo Institute of Technology²

29 Mar 2018

TL;DR: Attention statistics pooling for deep speaker embedding in text-independent speaker verification uses an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations, which can capture long-term variations in speaker characteristics more effectively.

...read moreread less

Abstract: This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.

...read moreread less

450 citations

Patent•

Pronunciation variation rule extraction apparatus, pronunciation variation rule extraction method, and pronunciation variation rule extraction program

[...]

Takafumi Koshinaka¹•Institutions (1)

NEC¹

27 Nov 2008

TL;DR: In this article, the authors propose a method to robustly detect a pronunciation variation example and acquire a variation rule having a high generalization property, with less effort, by using a mixture of a speech data storage unit, a base form pronunciation storage unit and a difference extraction unit.

...read moreread less

Abstract: A problem to be solved is to robustly detect a pronunciation variation example and acquire a pronunciation variation rule having a high generalization property, with less effort. The problem can be solved by a pronunciation variation rule extraction apparatus including a speech data storage unit, a base form pronunciation storage unit, a sub word language model generation unit, a speech recognition unit, and a difference extraction unit. The speech data storage unit stores speech data. The base form pronunciation storage unit stores base form pronunciation data representing base form pronunciation of the speech data. The sub word language model generation unit generates a sub word language model from the base form pronunciation data. The speech recognition unit recognizes the speech data by using the sub word language model. The difference extraction unit extracts a difference between a recognition result outputted from the speech recognition unit and the base form pronunciation data by comparing the recognition result and the base form pronunciation data.

...read moreread less

160 citations

Proceedings Article•DOI•

Speaker Augmentation and Bandwidth Extension for Deep Speaker Embedding.

[...]

Hitoshi Yamamoto¹, Kong Aik Lee², Koji Okabe¹, Takafumi Koshinaka¹•Institutions (2)

NEC¹, Institute for Infocomm Research Singapore²

15 Sep 2019

54 citations

Proceedings Article•DOI•

The CORAL+ Algorithm for Unsupervised Domain Adaptation of PLDA

[...]

Kong Aik Lee¹, Qiongqiong Wang¹, Takafumi Koshinaka¹•Institutions (1)

NEC¹

12 May 2019

TL;DR: In this article, an unsupervised linear discriminant analysis (PLDA) adaptation algorithm was proposed to learn from a small amount of unlabeled in-domain data, which was inspired by a prior work on feature-based domain adaptation technique known as the correlation alignment (CORAL).

...read moreread less

Abstract: State-of-the-art speaker recognition systems comprise an x-vector (or i-vector) speaker embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) backend. The effectiveness of these components relies on the availability of a large collection of labeled training data. In practice, it is common that the domains (e.g., language, demographic) in which the system is deployed differ from that we trained the system. To close the gap due to the domain mismatch, we propose an unsupervised PLDA adaptation algorithm to learn from a small amount of unlabeled in-domain data. The proposed method was inspired by a prior work on feature-based domain adaptation technique known as the correlation alignment (CORAL). We refer to the model-based adaptation technique proposed in this paper as CORAL+. The efficacy of the proposed technique is experimentally validated on the recent NIST 2016 and 2018 Speaker Recognition Evaluation (SRE’16, SRE’18) datasets.

...read moreread less

46 citations

Patent•

Speech recognition dictionary compilation assisting system, speech recognition dictionary compilation assisting method and speech recognition dictionary compilation assisting program

[...]

Takafumi Koshinaka¹•Institutions (1)

NEC¹

02 Feb 2007

TL;DR: In this paper, a speech recognition dictionary compilation assisting system can create and update speech recognition dictionaries and language models efficiently so as to reduce speech recognition errors by utilizing text data available at a low cost.

...read moreread less

Abstract: A speech recognition dictionary compilation assisting system can create and update speech recognition dictionary and language models efficiently so as to reduce speech recognition errors by utilizing text data available at a low cost. The system includes speech recognition dictionary storage section 105 , language model storage section 106 and acoustic model storage section 107 . A virtual speech recognition processing section 102 processes analyzed text data generated by the text analyzing section 101 by making reference to the recognition dictionary, language models and acoustic models so as to generate virtual text data resulted from speech recognition, and compares the virtual text data resulted from speech recognition with the analyzed text data. The update processing section 103 updates the recognition dictionary and language models so as to reduce different point(s) between both sets of text data.

...read moreread less

35 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17

Collapse

Cited by

PDF

Open Access

More filters

Pattern Recognition and Machine Learning

[...]

Christopher M. Bishop¹•Institutions (1)

Microsoft¹

01 Jan 2006

TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.

...read moreread less

Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

...read moreread less

10,141 citations

Patent•

Intelligent Automated Assistant

[...]

Thomas R. Gruber¹, Adam Cheyer¹, Dag Kittlaus¹, Didier Rene Guzzoni¹, Christopher Dean Brigham¹, Richard Donald Giuli¹, Marcello Bastea-Forte¹, Harry J. Saddler¹ - Show less +4 more•Institutions (1)

Apple Inc.¹

11 Jan 2011

TL;DR: In this article, an intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions.

...read moreread less

Abstract: An intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions. The system can be implemented using any of a number of different platforms, such as the web, email, smartphone, and the like, or any combination thereof. In one embodiment, the system is based on sets of interrelated domains and tasks, and employs additional functionally powered by external services with which the system can interact.

...read moreread less

1,462 citations

Proceedings Article•DOI•

ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification

[...]

Brecht Desplanques¹, Jenthe Thienpondt¹, Kris Demuynck¹•Institutions (1)

Ghent University¹

14 May 2020

TL;DR: The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the Voxceleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

...read moreread less

Abstract: Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complementary information, we aggregate and propagate features of different hierarchical levels. Finally, we improve the statistics pooling module with channel-dependent frame attention. This enables the network to focus on different subsets of frames during each of the channel’s statistics estimation. The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

...read moreread less

617 citations

Patent•

Using context information to facilitate processing of commands in a virtual assistant

[...]

Thomas R. Gruber¹, Christopher Dean Brigham¹, Daniel S. Keen¹, Gregory Novick¹, Phipps Benjamin S¹ - Show less +1 more•Institutions (1)

Apple Inc.¹

28 Sep 2012

TL;DR: In this article, a virtual assistant uses context information to supplement natural language or gestural input from a user, which helps to clarify the user's intent and reduce the number of candidate interpretations of user's input, and reduces the need for the user to provide excessive clarification input.

...read moreread less

Abstract: A virtual assistant uses context information to supplement natural language or gestural input from a user. Context helps to clarify the user's intent and to reduce the number of candidate interpretations of the user's input, and reduces the need for the user to provide excessive clarification input. Context can include any available information that is usable by the assistant to supplement explicit user input to constrain an information-processing problem and/or to personalize results. Context can be used to constrain solutions during various phases of processing, including, for example, speech recognition, natural language processing, task flow processing, and dialog generation.

...read moreread less

593 citations

Patent•

Mobile terminal and method of controlling the same

[...]

Jeongyun Heo¹, Hyoungjoo Kim¹, Jungeun Shin¹, Sohoon Yi¹, Soohyun Lee¹, Moonkyung Kim¹ - Show less +2 more•Institutions (1)

LG Electronics¹

22 Aug 2014

TL;DR: In this article, a mobile terminal can display a movement of an icon being displayed on the displayed wallpapers and preview screens, allowing the user to intuitively recognize a location of the icon and effectively move the icon.

...read moreread less

Abstract: A mobile terminal and a method of controlling a mobile terminal may be provided. The mobile terminal may include a display to display one of a plurality of wallpapers including at least one icon; and a controller to display at least two of the plurality of wallpapers and a plurality of preview screens corresponding to the plurality of wallpapers on the display upon reception of an input of moving at least one icon, moving of the at least one icon being displayed on the displayed wallpapers and preview screens. The mobile terminal can display a movement of icon being displayed on the displayed wallpapers and preview screens. Accordingly, a user may intuitively recognize a location of icon and effectively move a location of icon.

...read moreread less

531 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse