scispace - formally typeset
Search or ask a question

Showing papers by "Paul A. Viola published in 2000"


Proceedings ArticleDOI
15 Jun 2000
TL;DR: An approach for image retrieval using a very large number of highly selective features and efficient online learning based on the assumption that each image is generated by a sparse set of visual "causes" and that images which are visually similar share causes.
Abstract: We present an approach for image retrieval using a very large number of highly selective features and efficient online learning. Our approach is predicated on the assumption that each image is generated by a sparse set of visual "causes" and that images which are visually similar share causes. We propose a mechanism for computing a very large number of highly selective features which capture some aspects of this causal structure (in our implementation there are over 45,000 highly selective features). At query time a user selects a few example images, and a technique known as "boosting" is used to learn a classification function in this feature space. By construction, the boosting procedure learns a simple classifier which only relies on 20 of the features. As a result a very large database of images can be scanned rapidly, perhaps a million images per second. Finally we will describe a set of experiments performed using our retrieval system on a database of 3000 images.

504 citations


Proceedings ArticleDOI
15 Jun 2000
TL;DR: A probability density over the set of transforms that arose from the congealing process is developed, and it is suggested that this density over transforms may be shared by many classes, and used to develop a classifier based on only a single training example for each class.
Abstract: We define a process called congealing in which elements of a dataset (images) are brought into correspondence with each other jointly, producing a data-defined model. It is based upon minimizing the summed component-wise (pixel-wise) entropies over a continuous set of transforms on the data. One of the biproducts of this minimization is a set of transform, one associated with each original training sample. We then demonstrate a procedure for effectively bringing test data into correspondence with the data-defined model produced in the congealing process. Subsequently; we develop a probability density over the set of transforms that arose from the congealing process. We suggest that this density over transforms may be shared by many classes, and demonstrate how using this density as "prior knowledge" can be used to develop a classifier based on only a single training example for each class.

450 citations


Patent
07 Sep 2000
TL;DR: In this paper, an audio element cache is provided that is capable of caching audio elements for each user in a personal radio server system, where customized radio content is provided to remote listeners by storing a plurality of audio elements in a file server, retrieving a subset of the audio elements from the file server by predicting the content desired by a remote listener based on a user profile of the remote listener.
Abstract: An audio element cache is provided that is capable of caching audio elements for each user in a personal radio server system. In operation, customized radio content is provided to remote listeners in a personal radio server system by: storing a plurality of audio elements in a file server; retrieving a subset of the plurality of audio elements from the file server by predicting the content desired by a remote listener based on a user profile of the remote listener; storing the subset of the plurality of audio elements in an audio element cache; selecting audio elements to provide to a remote listener from the audio element cache; and transmitting the audio elements to the remote listener. In an embodiment, the plurality of audio elements are stored in the audio element cache when a remote listener logs-on the personal radio server system.

253 citations


Proceedings Article
01 Jan 2000
TL;DR: First, the data is projected into a maximally informative, low-dimensional subspace, suitable for density estimation, and the complicated stochastic relationships between the signals are modeled using a nonparametric density estimator.
Abstract: People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a low-level, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic relationships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.

226 citations


Proceedings ArticleDOI
13 Jun 2000
TL;DR: In this article, an energy minimization formulation of the voxel occupancy problem is presented, which can be viewed as a generalization of silhouette intersection, with two advantages: it does not compute silhouettes, which are a major source of errors; and it can naturally incorporate spatial smoothness.
Abstract: Voxel occupancy is one approach for reconstructing the 3-dimensional shape of an object from multiple views. In voxel occupancy, the task is to produce a binary labeling of a set of voxels, that determines which voxels are filled and which are empty. In this paper, we give an energy minimization formulation of the voxel occupancy problem. The global minimum of this energy can be rapidly computed with a single graph cut, using a result due to D. Greig et al. (1989). The energy function we minimize contains a data term and a smoothness term. The data term is a sum over the individual voxels, where the penalty for a voxel is based on the observed intensities of the pixels that intersect it. The smoothness term is the number of empty voxels adjacent to filled ones. Our formulation can be viewed as a generalization of silhouette intersection, with two advantages: we do not compute silhouettes, which are a major source of errors; and we can naturally incorporate spatial smoothness. We give experimental results showing reconstructions from both real and synthetic imagery. Reconstruction using this smoothed energy function is not much more time consuming than simple silhouette intersection; it takes about 10 seconds to reconstruct a one million voxel volume.

213 citations


Patent
07 Sep 2000
TL;DR: In this article, a method for overlapping stored audio elements in a system for providing a customized radio broadcast is proposed, which includes the steps of dividing a first audio element into a plurality of audio element components.
Abstract: A method for overlapping stored audio elements in a system for providing a customized radio broadcast. The method includes the steps of dividing a first audio element into a plurality of audio element components; selecting one of said audio element components; decompressing the selected audio element component; selecting a second audio element; decompressing the second audio element; mixing the decompressed audio element component with the decompressed second audio element to form a mixed audio element component; and compressing the mixed audio element component to form a compressed overlapping audio element component. The compressed overlapping audio element component may replace the selected audio component. The first audio element may be a song, while the second audio element may be a DJ introduction. Accordingly, the compressed overlapping audio element may be broadcast followed by the remaining components of the song audio element.

81 citations


Patent
07 Sep 2000
TL;DR: In this paper, a method for generating a number audio element for playing a desired number in an audio system is presented, which is based on the idea of exact matching. But the method requires the number audio elements to be stored in a plurality of audio elements representing a subset of the range of numbers, and the exact match types used to determine if one or more matching audio elements exists in the subset of numbers.
Abstract: A method and apparatus for generating a number audio element for playing a desired number in an audio system. Specifically, the method sets forth the steps of storing a plurality of audio elements used to represent a subset of the range of numbers; defining a plurality of match types used to determine if one or more matching audio element exists in the subset of the range of numbers; defining a plurality of accuracy prefixes representative of the error associated with any rounding of the desired number to be played; setting the accuracy prefix to a value representing an exact match between the desired number and a number audio element in the stored subset of audio elements representative of the range of numbers; filtering the audio elements to determine if an exact match exists; if an exact match does not exist, rounding the desired number to a pre-determined level of precision to create an estimated desired number; setting the accuracy prefix to a value representing the error associated with any rounding of the desired number to be played; filtering the audio elements to determine if an exact match exists between the estimated desired number and any of the plurality of audio elements used to represent a subset of the range of numbers; and repeating the steps of filtering until such time as an exact match has been determined between the estimated desired number and any of the plurality of audio elements used to represent a subset of the range of numbers. Once an exact match is determined, the number audio element is transmitted to a remote user. The number audio element may be a stock quote or an announcement of the time. Further, the number audio element may be transmitted in telephone systems, automated teller machines, or other audio systems.

62 citations


Book ChapterDOI
14 Oct 2000
TL;DR: It is shown how audio utterances from several speakers recorded with a single microphone can be separated into constituent streams, and how the method can help reduce the effect of noise in automatic speech recognition.
Abstract: Audio-based interfaces usually suffer when noise or other acoustic sources are present in the environment. For robust audio recognition, a single source must first be isolated. Existing solutions to this problem generally require special microphone configurations, and often assume prior knowledge of the spurious sources. We have developed new algorithms for segmenting streams of audio-visual information into their constituent sources by exploiting the mutual information present between audio and visual tracks. Automatic face recognition and image motion analysis methods are used to generate visual features for a particular user; empirically these features have high mutual information with audio recorded from that user. We show how audio utterances from several speakers recorded with a single microphone can be separated into constituent streams; we also show how the method can help reduce the effect of noise in automatic speech recognition.

54 citations


Patent
07 Sep 2000
TL;DR: In this article, a method for efficiently comparing two trinary logic representations, including the steps of creating a first data structure (a VALUE data structure) representative of a first set of properties, creating a second data structure, a KNOWN data structure representative of whether the first set is known, and a third data structure representing a target set of property, was proposed.
Abstract: A method for efficiently comparing two trinary logic representations, including the steps of creating a first data structure (a VALUE data structure) representative of a first set of properties; creating a second data structure (a KNOWN data structure) representative of whether the first set of properties is known; creating a third data structure (a TARGET data structure) representative of a target set of properties; creating a fourth data structure (a WANT data structure) representative of whether the target set of properties is wanted; and comparing the first, second, third, and fourth data structures using bit-wise binary operations to determine whether the first set of known properties are wanted as a target set of properties. In exemplary embodiments, the bit-wise binary operations are performed according to the Boolean equation: (not WANT) or (KNOWN and ((TARGET xor VALUE))). Alternatively, the bit-wise binary operation are performed according to the Boolean equation: (not WANT) or (KNOWN and ((TARGET and VALUE) or ((not TARGET) and (not (VALUE))). These data structures may be any size computer word, including 16 and 32-bit words.

40 citations