Showing papers by "Gilles Boulianne published in 2008"

PDF

Open Access

Proceedings Article•

GPU accelerated acoustic likelihood computations.

[...]

Patrick Cardinal¹, Pierre Dumouchel², Gilles Boulianne, Michel Comeau•Institutions (2)

Massachusetts Institute of Technology¹, École de technologie supérieure²

01 Jan 2008

TL;DR: The use of Graphics Processors Unit for computing acoustic likelihoods in a speech recognition system shows that GPU is 5x faster than the CPU SSE-based implementation, which led to a speed up of 35% on a large vocabulary task.

...read moreread less

Abstract: This paper introduces the use of Graphics Processors Unit (GPU) for computing acoustic likelihoods in a speech recognition system. In addition to their high availability, GPUs provide high computing performance at low cost. We have used a NVidia GeForce 8800GTX programmed with the CUDA (Compute Unified Device Architecture) which shows the GPU as a parallel coprocessor. The acoustic likelihoods are computed as dot products, operations for which GPUs are highly efficient. The implementation in our speech recognition system shows that GPU is 5x faster than the CPU SSE-based implementation. This improvement led to a speed up of 35% on a large vocabulary task.

...read moreread less

46 citations

Proceedings Article•DOI•

Speaker diarization of French broadcast news

[...]

Vishwa Gupta, Gilles Boulianne, Patrick Kenny, Pierre Ouellet, Pierre Dumouchel - Show less +1 more

12 May 2008

TL;DR: This speaker diarization process is a multistage segmentation and clustering system and one of the stages is agglomerative clustering using state-of-the-art speaker identification methods (SID).

...read moreread less

Abstract: We report results on speaker diarization of French broadcast news and talk shows on current affairs. This speaker diarization process is a multistage segmentation and clustering system. One of the stages is agglomerative clustering using state-of-the-art speaker identification methods (SID). For the QMMs used in this stage, we tried many different feature parameters, including MFCCs, Gaussianized MFCCs, Gaussianized MFCCs with cepstral mean subtraction, and Gaussianized MFCCs with cepstral mean substraction containing only frames with high energy. We found that this last set of feature parameters gave the best results. Compared to Gaussianized MFCCs, these features reduced the diarization error rate (DER) by 12% on a development set and by 19% on a test set. We also combined clusters resulting from Gaussianized and non-Gaussianized feature sets. This cluster combination resulted in another 4% reduction in DER for both the development and the test sets. The best DER we have achieved is 15.4% on the development set, and 14.5% on the test set.

...read moreread less

22 citations

Journal Article•DOI•

A simplified audiovisual fusion model with application to large-vocabulary recognition of French Canadian speech

[...]

Langis Gagnon, Samuel Foucher, F. Laliberte, Gilles Boulianne

05 Sep 2008-Canadian Journal of Electrical and Computer Engineering-revue Canadienne De Genie Electrique Et Informatique

TL;DR: In this paper, a new way of fusing audio and visual information to enhance audiovisual automatic speech recognition within the framework of an application of large-vocabulary speech recognition of French Canadian speech is presented, and the experimental methodology is described in detail.

...read moreread less

Abstract: A new, simple and practical way of fusing audio and visual information to enhance audiovisual automatic speech recognition within the framework of an application of large-vocabulary speech recognition of French Canadian speech is presented, and the experimental methodology is described in detail. The visual information about mouth shape is extracted off-line using a cascade of weak classifiers and a Kalman filter, and is combined with the large-vocabulary speech recognition system of the Centre de Recherche Informatique de Montreal. The visual classification is performed by a pair-wise kernel-based linear discriminant analysis (KLDA) applied on a principal component analysis (PCA) subspace, followed by a binary combination and voting algorithm on 35 French phonetic classes. Three fusion approaches are compared: (1) standard low-level feature-based fusion, (2) decision-based fusion within the framework of the transferable belief model (an interpretation of the Dempster-Shafer evidential theory), and (3) a combination of (1) and (2). For decision-based fusion, the audio information is considered to be a precise Bayesian source, while the visual information is considered an imprecise evidential source. This treatment ensures that the visual information does not significantly degrade the audio information in situations where the audio performs well (e.g., a controlled noise-free environment). Results show significant improvement in the word error rate to a level comparable to that of more sophisticated systems. To the authors' knowledge, this work is the first to address large-vocabulary audiovisual recognition of French Canadian speech and decision-based audiovisual fusion within the transferable belief model.

...read moreread less

6 citations

Proceedings Article•

Advertisement detection in French broadcast news using acoustic repetition and Gaussian mixture models.

[...]

Vishwa Gupta, Gilles Boulianne, Patrick Kenny, Pierre Dumouchel¹•Institutions (1)

École de technologie supérieure¹

01 Jan 2008

4 citations

A simplified audiovisual fusion model with application to large-vocabulary recognition of French Canadian speech Un mod` ele simplifi´ e de fusion audio-visuelle avec application ` a la reconnaissance ` a grand vocabulaire du francais canadien parl´ e

[...]

L. Gagnon, S. Foucher, Gilles Boulianne

01 Jan 2008

TL;DR: To the authors' knowledge, this work is the first to address large-vocabulary audiovisual recognition of French Canadian speech and decision-based audiovISual fusion within the transferable belief model.

...read moreread less

Abstract: A new, simple and practical way of fusing audio and visual information to enhance audiovisual automatic speech recognition within the framework of an application of large-vocabulary speech recognition of French Canadian speech is presented, and the experimental methodology is described in detail. The visual information about mouth shape is extracted off-line using a cascade of weak classifiers and a Kalman filter, and is combined with the large-vocabulary speech recognition system of the Centre de Recherche Informatique de MontrThe visual classification is performed by a pair-wise kernel-based linear discriminant analysis (KLDA) applied on a principal component analysis (PCA) subspace, followed by a binary combination and voting algorithm on 35 French phonetic classes. Three fusion approaches are compared: (1) standard low-level feature-based fusion, (2) decision-based fusion within the framework of the transferable belief model (an interpretation of the Dempster-Shafer evidential theory), and (3) a combination of (1) and (2). For decision-based fusion, the audio information is considered to be a precise Bayesian source, while the visual information is considered an imprecise evidential source. This treatment ensures that the visual information does not significantly degrade the audio information in situations where the audio performs well (e.g., a controlled noise-free environment). Results show significant improvement in the word error rate to a level comparable to that of more sophisticated systems. To the authors' knowledge, this work is the first to address large-vocabulary audiovisual recognition of French Canadian speech and decision-based audiovisual fusion within the transferable belief model.

...read moreread less