scispace - formally typeset
Search or ask a question

Showing papers in "Multimedia Systems in 2006"


Journal ArticleDOI
TL;DR: A novel hybrid architecture that integrates both CDN- and P2P-based streaming media distribution is proposed and analyzed, which significantly lowers the cost of CDN capacity reservation, without compromising the media quality delivered.
Abstract: To distribute video and audio data in real-time streaming mode, two different technologies --- Content Distribution Network (CDN) and Peer-to-Peer (P2P) --- have been proposed. However, both technologies have their own limitations: CDN servers are expensive to deploy and maintain, and consequently incur a cost for media providers and/or clients for server capacity reservation. On the other hand, a P2P-based architecture requires sufficient number of seed supplying peers to jumpstart the distribution process. Compared with a CDN server, a peer usually offers much lower out-bound streaming rate and hence multiple peers must jointly stream a media data to a requesting peer. Furthermore, it is not clear how to determine how much a peer should contribute back to the system after receiving the media data, in order to sustain the overall media distribution capacity. In this paper, we propose and analyze a novel hybrid architecture that integrates both CDN- and P2P-based streaming media distribution. The architecture is highly cost-effective: it significantly lowers the cost of CDN capacity reservation, without compromising the media quality delivered. In particular, we propose and compare different limited contribution policies for peers that request a media data, so that the streaming capacity of each peer can be exploited on a fair and limited basis. We present: (1) in-depth analysis of the proposed architecture under different contribution policies, and (2) extensive simulation results which validate the analysis. Our analytical and simulation results form a rigorous basis for the planning and dimensioning of the hybrid architecture.

247 citations


Journal ArticleDOI
TL;DR: In comparing a number of representations for songs, the statistics of mel-frequency cepstral coefficients to perform best in precision-at-20 comparisons and it is shown that by choosing training examples intelligently, active learning requires half as many labeled examples to achieve the same accuracy as a standard scheme.
Abstract: Searching and organizing growing digital music collections requires a computational model of music similarity. This paper describes a system for performing flexible music similarity queries using SVM active learning. We evaluated the success of our system by classifying 1210 pop songs according to mood and style (from an online music guide) and by the performing artist. In comparing a number of representations for songs, we found the statistics of mel-frequency cepstral coefficients to perform best in precision-at-20 comparisons. We also show that by choosing training examples intelligently, active learning requires half as many labeled examples to achieve the same accuracy as a standard scheme.

146 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel technique for clustering and classification of object trajectory-based video motion clips using spatiotemporal function approximations, and proposes a Mahalanobis classifier for the detection of anomalous trajectories.
Abstract: This paper proposes a novel technique for clustering and classification of object trajectory-based video motion clips using spatiotemporal function approximations. Assuming the clusters of trajectory points are distributed normally in the coefficient feature space, we propose a Mahalanobis classifier for the detection of anomalous trajectories. Motion trajectories are considered as time series and modelled using orthogonal basis function representations. We have compared three different function approximations --- least squares polynomials, Chebyshev polynomials and Fourier series obtained by Discrete Fourier Transform (DFT). Trajectory clustering is then carried out in the chosen coefficient feature space to discover patterns of similar object motions. The coefficients of the basis functions are used as input feature vectors to a Self- Organising Map which can learn similarities between object trajectories in an unsupervised manner. Encoding trajectories in this way leads to efficiency gains over existing approaches that use discrete point-based flow vectors to represent the whole trajectory. Our proposed techniques are validated on three different datasets --- Australian sign language, hand-labelled object trajectories from video surveillance footage and real-time tracking data obtained in the laboratory. Applications to event detection and motion data mining for multimedia video surveillance systems are envisaged.

125 citations


Journal ArticleDOI
TL;DR: Two Affine-invariant representations for motion trajectories based on curvature scale space (CSS) and centroid distance function (CDF) are derived and facilitate the design of efficient recognition algorithms based on hidden Markov models (HMMs).
Abstract: Motion trajectories provide rich spatio-temporal information about an object's activity. The trajectory information can be obtained using a tracking algorithm on data streams available from a range of devices including motion sensors, video cameras, haptic devices, etc. Developing view-invariant activity recognition algorithms based on this high dimensional cue is an extremely challenging task. This paper presents efficient activity recognition algorithms using novel view-invariant representation of trajectories. Towards this end, we derive two Affine-invariant representations for motion trajectories based on curvature scale space (CSS) and centroid distance function (CDF). The properties of these schemes facilitate the design of efficient recognition algorithms based on hidden Markov models (HMMs). In the CSS-based representation, maxima of curvature zero crossings at increasing levels of smoothness are extracted to mark the location and extent of concavities in the curvature. The sequences of these CSS maxima are then modeled by continuous density (HMMs). For the case of CDF, we first segment the trajectory into subtrajectories using CDF-based representation. These subtrajectories are then represented by their Principal Component Analysis (PCA) coefficients. The sequences of these PCA coefficients from subtrajectories are then modeled by continuous density hidden Markov models (HMMs). Different classes of object motions are modeled by one Continuous HMM per class where state PDFs are represented by GMMs. Experiments using a database of around 1750 complex trajectories (obtained from UCI-KDD data archives) subdivided into five different classes are reported.

94 citations


Journal ArticleDOI
TL;DR: The evaluation of the prototype system on 17,000 images and 7736 automatically extracted annotation words from crawled Web pages for multi-modal image retrieval has indicated that the proposed semantic model and the developed Bayesian framework are superior to a state-of-the-art peer system in the literature.
Abstract: This paper addresses automatic image annotation problem and its application to multi-modal image retrieval. The contribution of our work is three-fold. (1) We propose a probabilistic semantic model in which the visual features and the textual words are connected via a hidden layer which constitutes the semantic concepts to be discovered to explicitly exploit the synergy among the modalities. (2) The association of visual features and textual words is determined in a Bayesian framework such that the confidence of the association can be provided. (3) Extensive evaluation on a large-scale, visually and semantically diverse image collection crawled from Web is reported to evaluate the prototype system based on the model. In the proposed probabilistic model, a hidden concept layer which connects the visual feature and the word layer is discovered by fitting a generative model to the training image and annotation words through an Expectation-Maximization (EM) based iterative learning procedure. The evaluation of the prototype system on 17,000 images and 7736 automatically extracted annotation words from crawled Web pages for multi-modal image retrieval has indicated that the proposed semantic model and the developed Bayesian framework are superior to a state-of-the-art peer system in the literature.

88 citations


Journal ArticleDOI
TL;DR: This work forms the multi-camera control strategy as an online scheduling problem and proposes a solution that combines the information gathered by the wide-FOV cameras with weighted round-robin scheduling to guide the available PTZ cameras, such that each pedestrian is observed by at least one PTZ camera while in the designated area.
Abstract: We present a surveillance system, comprising wide field-of-view (FOV) passive cameras and pan/tilt/zoom (PTZ) active cameras, which automatically captures high-resolution videos of pedestrians as they move through a designated area. A wide-FOV static camera can track multiple pedestrians, while any PTZ active camera can capture high-quality videos of one pedestrian at a time. We formulate the multi-camera control strategy as an online scheduling problem and propose a solution that combines the information gathered by the wide-FOV cameras with weighted round-robin scheduling to guide the available PTZ cameras, such that each pedestrian is observed by at least one PTZ camera while in the designated area. A centerpiece of our work is the development and testing of experimental surveillance systems within a visually and behaviorally realistic virtual environment simulator. The simulator is valuable as our research would be more or less infeasible in the real world given the impediments to deploying and experimenting with appropriately complex camera sensor networks in large public spaces. In particular, we demonstrate our surveillance system in a virtual train station environment populated by autonomous, lifelike virtual pedestrians, wherein easily reconfigurable virtual cameras generate synthetic video feeds. The video streams emulate those generated by real surveillance cameras monitoring richly populated public spaces.

81 citations


Journal ArticleDOI
TL;DR: A framework for information assimilation that addresses the issues – “when”, “what” and “how” to assimilate the information obtained from different media sources in order to detect events in multimedia surveillance systems is proposed.
Abstract: Most multimedia surveillance and monitoring systems nowadays utilize multiple types of sensors to detect events of interest as and when they occur in the environment. However, due to the asynchrony among and diversity of sensors, information assimilation --- how to combine the information obtained from asynchronous and multifarious sources is an important and challenging research problem. In this paper, we propose a framework for information assimilation that addresses the issues --- "when", "what" and "how" to assimilate the information obtained from different media sources in order to detect events in multimedia surveillance systems. The proposed framework adopts a hierarchical probabilistic assimilation approach to detect atomic and compound events. To detect an event, our framework uses not only the media streams available at the current instant but it also utilizes their two important properties --- first, accumulated past history of whether they have been providing concurring or contradictory evidences, and --- second, the system designer's confidence in them. The experimental results show the utility of the proposed framework.

77 citations


Journal ArticleDOI
TL;DR: This paper presents a novel metric learning approach, named “regularized metric learning,” for collaborative image retrieval, which learns a distance metric by exploring the correlation between low-level image features and the log data of users' relevance judgments.
Abstract: In content-based image retrieval (CBIR), relevant images are identified based on their similarities to query images. Most CBIR algorithms are hindered by the semantic gap between the low-level image features used for computing image similarity and the high-level semantic concepts conveyed in images. One way to reduce the semantic gap is to utilize the log data of users' feedback that has been collected by CBIR systems in history, which is also called "collaborative image retrieval." In this paper, we present a novel metric learning approach, named "regularized metric learning," for collaborative image retrieval, which learns a distance metric by exploring the correlation between low-level image features and the log data of users' relevance judgments. Compared to the previous research, a regularization mechanism is used in our algorithm to effectively prevent overfitting. Meanwhile, we formulate the proposed learning algorithm into a semidefinite programming problem, which can be solved very efficiently by existing software packages and is scalable to the size of log data. An extensive set of experiments has been conducted to show that the new algorithm can substantially improve the retrieval accuracy of a baseline CBIR system using Euclidean distance metric, even with a modest amount of log data. The experiment also indicates that the new algorithm is more effective and more efficient than two alternative algorithms, which exploit log data for image retrieval.

57 citations


Journal ArticleDOI
TL;DR: Detailed simulation-based evaluations illustrate that P2P Adaptive Layered Streaming can effectively cope with several angles of dynamics in the system including: bandwidth variations, peer participation, and partially available content at different peers.
Abstract: This paper presents the design and evaluation of an adaptive streaming mechanism from multiple senders to a single receiver in Peer-to-Peer (P2P) networks, called P2P Adaptive Layered Streaming, or PALS. PALS is a receiver-driven mechanism. It enables a receiver peer to orchestrate quality adaptive streaming of a single, layer-encoded video stream from multiple congestion-controlled senders, and is able to support a spectrum of noninteractive streaming applications. The primary challenge in the design of a streaming mechanism from multiple senders is that available bandwidth from individual peers is not known a priori, and could significantly change during delivery. In PALS, the receiver periodically performs quality adaptation based on the aggregate bandwidth from all senders to determine: (i) the overall quality (i.e number of layers) that can be collectively delivered by all senders, and more importantly (ii) the specific subset of packets that should be delivered by individual senders in order to gracefully cope with any sudden change in their bandwidth. Our detailed simulation-based evaluations illustrate that PALS can effectively cope with several angles of dynamics in the system including: bandwidth variations, peer participation, and partially available content at different peers. We also demonstrate the importance of coordination among senders and examine key design tradeoffs for the PALS mechanism.

42 citations


Journal ArticleDOI
TL;DR: A form of communication that could be used for lifelong learning as contribution to cultural computing and a multimedia communication concept that can cope with non-verbal, emotional and Kansei information is described.
Abstract: In this paper we describe a form of communication that could be used for lifelong learning as contribution to cultural computing. We call it Kansei Mediation. It is a multimedia communication concept that can cope with non-verbal, emotional and Kansei information. We introduce the distinction between the concepts of Kansei Communication and Kansei Media. We then develop a theory of communication (i.e. Kansei Mediation) as a combination of both. Based on recent results from brain research the proposed concept of Kansei Mediation is developed and discussed. The biased preference towards consciousness in established communication theories is critically reviewed and the relationship to pre- and unconscious brain processes explored. There are two tenets of the Kansei Mediation communication theory: (1) communication based on connected unconciousness, and (2) Satori as the ultimate form of experience.

42 citations


Journal ArticleDOI
TL;DR: This paper investigates audio features that have not been previously used in music-speech classification, such as the mean and variance of the discrete wavelet transform, the variance of Mel-frequency cepstral coefficients, the root mean square of a lowpass signal, and the difference of the maximum and minimum zero-crossings.
Abstract: The need to classify audio into categories such as speech or music is an important aspect of many multimedia document retrieval systems. In this paper, we investigate audio features that have not been previously used in music-speech classification, such as the mean and variance of the discrete wavelet transform, the variance of Mel-frequency cepstral coefficients, the root mean square of a lowpass signal, and the difference of the maximum and minimum zero-crossings. We, then, employ fuzzy C-means clustering to the problem of selecting a viable set of features that enables better classification accuracy. Three different classification frameworks have been studied:Multi-Layer Perceptron (MLP) Neural Networks, radial basis functions (RBF) Neural Networks, and Hidden Markov Model (HMM), and results of each framework have been reported and compared. Our extensive experimentation have identified a subset of features that contributes most to accurate classification, and have shown that MLP networks are the most suitable classification framework for the problem at hand.

Journal ArticleDOI
TL;DR: Investigation of the relationship between user cognitive styles and perceptual multimedia quality shows that whilst color choice is impacted by a participant's cognitive style, such Quality of Service parameters do not significantly affect perceived multimedia quality, and that users do not necessarily choose optimum presentation settings to enhance their perceived enjoyment and assimilation of multimedia informational content.
Abstract: Cognitive styles influence the way how humans process information, with previous research demonstrating that they have significant effects on student learning in multimedia environments. On the other hand, the perceptual quality of the human multimedia experience is notoriously difficult to measure. In this paper, we report the results of an empirical study, which investigated the relationship between user cognitive styles and perceptual multimedia quality, in which users had the possibility to specify their desired Quality of Service settings -- in terms of frame rates and color depth. Results show that whilst color choice is impacted by a participant's cognitive style, such Quality of Service parameters do not significantly affect perceived multimedia quality, and that users do not necessarily choose optimum presentation settings to enhance their perceived enjoyment and assimilation of multimedia informational content.

Journal ArticleDOI
TL;DR: This paper analyze two 6-month long traces of RTSP video requests recorded at different streaming video servers of an entertainment video-on-demand provider, and shows that the traces provide evidence that the internal popularity of the majority of the most popular videos obeys a k-transformed Zipf-like distribution.
Abstract: Most proxy caches for streaming videos do not cache the entire video but only a portion of it. This is partly due to the large size of video objects. Another reason is that the popularity of different parts of a video can be different, e.g., the prefix is generally more popular. Therefore, the development of efficient cache mechanisms requires an understanding of the internal popularity characteristics of streaming videos. This paper has two major contributions. Firstly, we analyze two 6-month long traces of RTSP video requests recorded at different streaming video servers of an entertainment video-on-demand provider, and show that the traces provide evidence that the internal popularity of the majority of the most popular videos obeys a k-transformed Zipf-like distribution. Secondly, we propose a caching algorithm which exploits this empirical internal popularity distribution. We find that this algorithm has similar performance compared with fine-grained caching but requires significantly less state information.

Journal ArticleDOI
TL;DR: This paper details features of the MediaObject model for documents describing I-TV programs that aim to minimize limitations and discusses the model by means of an application that is part of the I- TV prototype and by an annotation tool that shows that the model can be applied to other domains.
Abstract: Interactive video technology has demanded the development of standards, techniques and tools to create, deliver and present interactive content and associated metadata. The literature reports on models representing interactive-TV (I-TV) programs --- one of the main applications of the interactive video technology; the limitation of such models include: strict hierarchical relationships among media objects, low levels of granularity for metadata, programs and media objects descriptions cannot be separated, objects are not described beyond the frame level, lack of integration with context-aware computing. In this paper, we detail features of our MediaObject model for documents describing I-TV programs that aim to minimize those limitations. We discuss the model by means of an application that is part of our I-TV prototype and by an annotation tool that shows that the model can be applied to other domains.

Journal ArticleDOI
TL;DR: A novel paired feature AdaBoost learning system for relevance feedback-based image retrieval using Bayesian classification to replace the traditional binary weak classifiers to enhance their classification power, thus producing a stronger classifier.
Abstract: Boost learning algorithm, such as AdaBoost, has been widely used in a variety of applications in multimedia and computer vision. Relevance feedback-based image retrieval has been formulated as a classification problem with a small number of training samples. Several machine learning techniques have been applied to this problem recently. In this paper, we propose a novel paired feature AdaBoost learning system for relevance feedback-based image retrieval. To facilitate density estimation in our feature learning method, we propose an ID3-like balance tree quantization method to preserve most discriminative information. By using paired feature combination, we map all training samples obtained in the relevance feedback process onto paired feature spaces and employ the AdaBoost algorithm to select a few feature pairs with best discrimination capabilities in the corresponding paired feature spaces. In the AdaBoost algorithm, we employ Bayesian classification to replace the traditional binary weak classifiers to enhance their classification power, thus producing a stronger classifier. Experimental results on content-based image retrieval (CBIR) show superior performance of the proposed system compared to some previous methods.

Journal ArticleDOI
TL;DR: A linear programming approach is derived that determines jointly for each camera the pan and tilt angle that maximizes the coverage of the space at a given sampling frequency, demonstrating the gain in visual coverage.
Abstract: Many novel multimedia, home entertainment, visual surveillance and health applications use multiple audio-visual sensors. We present a novel approach for position and pose calibration of visual sensors, i.e., cameras, in a distributed network of general purpose computing devices (GPCs). It complements our work on position calibration of audio sensors and actuators in a distributed computing platform (Raykar et al. in proceedings of ACM Multimedia `03, pp. 572---581, 2003). The approach is suitable for a wide range of possible --- even mobile --- setups since (a) synchronization is not required, (b) it works automatically, (c) only weak restrictions are imposed on the positions of the cameras, and (d) no upper limit on the number of cameras under calibration is imposed. Corresponding points across different camera images are established automatically. Cameras do not have to share one common view. Only a reasonable overlap between camera subgroups is necessary. The method has been sucessfully tested in numerous multi-camera environments with a varying number of cameras and has proven itself to work extremely accurate. Once all distributed visual sensors are calibrated, we focus on post-optimizing their poses to increase coverage of the space observed. A linear programming approach is derived that determines jointly for each camera the pan and tilt angle that maximizes the coverage of the space at a given sampling frequency. Experimental results clearly demonstrate the gain in visual coverage.

Journal ArticleDOI
TL;DR: The MPEG-21 Bitstream Syntax Description Language (BSDL) specification is used to generate high-level XML descriptions of the structure of a bitstream, and the adaptation of a scalable video stream can be realized in the XML domain, rather than on the bitstream itself.
Abstract: In order to obtain a useful multichannel publication environment, a content producer has to respect the different terminal and network characteristics of the multimedia devices of its target audience. Embedded scalable video bitstreams, together with a complementary content adaptation framework, give the possibility to respond to heterogeneous usage environments. In this paper, temporally scalable H.264/MPEG-4 AVC encoded bitstreams and bitstreams encoded by relying on the fully-embedded MC-EZBC wavelet-based codec are used. The MPEG-21 Bitstream Syntax Description Language (BSDL) specification is used to generate high-level XML descriptions of the structure of a bitstream. As such, the adaptation of a scalable video stream can be realized in the XML domain, rather than on the bitstream itself. Different transformation technologies are compared to each other as well. Finally, a practical setup of a video streaming use case is discussed by relying on the MPEG-21 BSDL framework.

Journal ArticleDOI
TL;DR: The design, implementation and evaluation of EVE Community Prototype, an educational virtual community aiming to meet the requirements of a Virtual Collaboration Space and to support e-learning services, and an integrated platform for Networked Virtual Environments, called EVE Platform, which supports the afore-mentioned educational community.
Abstract: This paper presents the design, implementation and evaluation of EVE Community Prototype, which is an educational virtual community aiming to meet the requirements of a Virtual Collaboration Space and to support e-learning services. Furthermore, this paper describes the design and implementation of an integrated platform for Networked Virtual Environments, called EVE Platform, which supports the afore-mentioned educational community. This platform supports stable event sharing and creation of multi-user three dimensional (3D) places, H.323-based voice over IP services integrated in 3D spaces as well as multiple concurrent virtual worlds.

Journal ArticleDOI
TL;DR: As a multimedia search interface for digital libraries, strand maps appear to be promising tools to promote conceptual discovery and learning through content-based processes that promote learner engagement with relevant science knowledge.
Abstract: This article explores the use of a multimedia search interface for digital libraries based on strand maps developed by the American Association for the Advancement of Science. As semantic-spatial displays, strand maps provide a visual organization of relevant conceptual information that may promote the use of science content during digital library use. A study was conducted to compare users' cognitive processes during information seeking tasks when using a multimedia strand maps interface, versus the textual search interface currently implemented in the Digital Library for Earth System Education. Quantitative and qualitative data from think-aloud protocols revealed that students were more likely to engage with science content (e.g., analyzing the relevance of science concepts with regard to task needs) during search when using the strand maps interface compared to those using textual searching. In contrast, students using a textual search interface engaged more frequently with surface-level information (e.g., the type of a resource regardless of its science content) during search and retrieval. As a multimedia search interface for digital libraries, strand maps appear to be promising tools to promote conceptual discovery and learning through content-based processes that promote learner engagement with relevant science knowledge.

Journal ArticleDOI
TL;DR: This paper addresses the semantic integration issue of multi-media resources and learning processes with theoretical learning supports with a context-mediated approach that aims to enable semantic-based inter-operations across knowledge domains, even across the WWW and the Semantic Web.
Abstract: Internet-based e-Learning has experienced a boom and bust situation in the past 10 years [32]. To bring in new forces to knowledge-oriented e-Learning, this paper addresses the semantic integration issue of multi-media resources and learning processes with theoretical learning supports in an integrated framework. This paper proposes a context-mediated approach that aims to enable semantic-based inter-operations across knowledge domains, even across the WWW and the Semantic Web [8]. The proposed semantic e-Learning framework enables intelligent operations of heterogeneous multi-media contents based on a generic semantic context intermediation model. This framework supports intelligent e-Learning with a knowledge network for knowledge object visualization, an enhanced Kolb's learning cycle [31] to guide learning practices, and a learning health care framework for personalized learning.

Journal ArticleDOI
TL;DR: The proposed scheme, COSMOS-7, produces rich and multi-faceted semantic content models and supports a content-based filtering approach that only analyses content relating directly to the preferred content requirements of the user.
Abstract: Part 5 of the MPEG-7 standard specifies Multimedia Description Schemes (MDS); that is, the format multimedia content models should conform to in order to ensure interoperability across multiple platforms and applications. However, the standard does not specify how the content or the associated model may be filtered. This paper proposes an MPEG-7 scheme which can be deployed for digital video content modelling and filtering. The proposed scheme, COSMOS-7, produces rich and multi-faceted semantic content models and supports a content-based filtering approach that only analyses content relating directly to the preferred content requirements of the user. We present details of the scheme, front-end systems used for content modelling and filtering and experiences with a number of users.

Journal ArticleDOI
TL;DR: This paper addresses the service composition problem for multimedia services that can be modeled as directed acyclic graphs (DAGs) and formally defines the problem and proves its NP hardness, and designs a heuristic algorithm to solve the problem.
Abstract: Service composition is a promising approach to multimedia service provisioning, due to its ability to dynamically produce new multimedia content, and to customize the content for individual client devices. Previous research work has addressed various aspects of service composition such as composibility, QoS-awareness, and load balancing. However, most of the work has focused on applications where data flow from a single source is processed by intermediate services and then delivered to a single destination. In this paper, we address the service composition problem for multimedia services that can be modeled as directed acyclic graphs (DAGs). We formally define the problem and prove its NP hardness. We also design a heuristic algorithm to solve the problem. Our simulation results show that the algorithm is effective at finding low-cost composition solutions, and can trade off computation overhead for better results. When compared with a hop-by-hop approach for service composition, our algorithm can find composition solutions that aress 10% smaller in cost, even when the hop-by-hop approach uses exhaustive searches.

Journal ArticleDOI
TL;DR: A Context Expansion approach is explored to take advantages of such correlations by expanding the key regions of the queries using highly correlated environmental regions according to an image thesaurus to improve the performance of image retrieval.
Abstract: Bridging the cognitive gap in image retrieval has been an active research direction in recent years, of which a key challenge is to get enough training data to learn the mapping functions from low-level feature spaces to high-level semantics. In this paper, image regions are classified into two types: key regions representing the main semantic contents and environmental regions representing the contexts. We attempt to leverage the correlations between types of regions to improve the performance of image retrieval. A Context Expansion approach is explored to take advantages of such correlations by expanding the key regions of the queries using highly correlated environmental regions according to an image thesaurus. The thesaurus serves as both a mapping function between image low-level features and concepts and a store of the statistical correlations between different concepts. It is constructed through a data-driven approach which uses Web data (images, their surrounding textual annotations) as training data source to learn the region concepts and to explore the statistical correlations. Experimental results on a database of 10,000 general-purpose images show the effectiveness of our proposed approach in both improving search precision (i.e. filter irrelevant images) and recall (i.e. retrieval relevant images whose context may be varied). Several major factors which have impact on the performance of our approach are also studied.

Journal ArticleDOI
TL;DR: A real-time multi-camera system that collects images and videos of moving objects in such scenes, subject to task constraints, and constructs “task visibility intervals” that contain information about what can be sensed in future time intervals.
Abstract: Vision systems are increasingly being deployed to perform complex surveillance tasks. While improved algorithms are being developed to perform these tasks, it is also important that data suitable for these algorithms be acquired --- a non-trivial task in a dynamic and crowded scene viewed by multiple PTZ cameras. In this paper, we describe a real-time multi-camera system that collects images and videos of moving objects in such scenes, subject to task constraints. The system constructs "task visibility intervals" that contain information about what can be sensed in future time intervals. Constructing these intervals requires prediction of future object motion and consideration of several factors such as object occlusion and camera control parameters. Such intervals can also be combined to form multi-task intervals, during which a single camera can collect videos suitable for multiple tasks simultaneously. Experimental results are provided to illustrate the system capabilities in constructing such task visibility intervals, followed by scheduling them using a greedy algorithm.

Journal ArticleDOI
TL;DR: The design of a novel peer-to-peer streaming architecture called ACTIVE, which provides virtually all users with the low-latency service that before was only possible with a centralized approach, and a complete commercial scale voice chat system called AudioPeer that is powered by the ACTIVE protocol.
Abstract: Peer-to-peer (P2P) streaming is emerging as a viable communications paradigm. Recent research has focused on building efficient and optimal overlay multicast trees at the application level. A few commercial products are being implemented to provide voice services through P2P streaming platforms. However, even though many P2P protocols from the research community claim to be able to support large scale low-latency streaming, none of them have been adopted by a commercial voice system so far. This gap between advanced research prototypes and commercial implementations shows that there is a lack of a practical and scalable P2P system design that can provide low-latency service in a real implementation. After analyzing existing P2P system designs, we found two important issues that could lead to improvements. First, many existing designs that aim to build a low-latency streaming platform often make the unreasonable assumption that the processing time involved at each node is zero. However in a real implementation, these delays can add up to a significant amount of time after just a few overlay hops and make interactive applications difficult. Second, scant attention has been paid to the fact that even in a conversation involving a large number of users, only a few of the users are actually actively speaking at a given time. We term these users, who have more critical demands for low-latency, active users. In this paper, we detail the design of a novel peer-to-peer streaming architecture called ACTIVE. We then present a complete commercial scale voice chat system called AudioPeer that is powered by the ACTIVE protocol. The ACTIVE system significantly reduces the end-to-end delay experienced among active users while at the same time being capable of providing streaming services to very large multicast groups. ACTIVE uses realistic processing assumptions at each node and dynamically optimizes the streaming structure while the group of active users changes over time. Consequently, it provides virtually all users with the low-latency service that before was only possible with a centralized approach. We present results from both simulations and our real implementation, which clearly show that our ACTIVE system is a feasible approach to scalable, low-latency P2P streaming.

Journal ArticleDOI
TL;DR: A model that allows for the application of proximity measures that perform better than the usual choices on continuous feature data on content-based features is proposed.
Abstract: The selection of appropriate proximity measures is one of the crucial success factors of content-based visual information retrieval. In this area of research, proximity measures are used to estimate the similarity of media objects by the distance of feature vectors. The research focus of this work is the identification of proximity measures that perform better than the usual choices (e.g., Minkowski metrics). We evaluate a catalogue of 37 measures that are selected from various areas (psychology, sociology, economics, etc.). The evaluation is based on content-based MPEG-7 descriptions of carefully selected media collections. Unfortunately, some proximity measures are only defined on predicates (e.g., most psychological measures). One major contribution of this paper is a model that allows for the application of such measures on continuous feature data. The evaluation results uncover proximity measures that perform better than others on content-based features. Some predicate-based measures clearly outperform the frequently used distance norms. Eventually, the discussion of the evaluation leads to a catalogue of mathematical terms of successful retrieval and browsing measures.

Journal ArticleDOI
TL;DR: This paper tackles the problem of extracting semantic concepts from a large database of images effectively by mining the decisive feature patterns (DFPs), and can be generally applied to any domain of semantic concepts and low-level features.
Abstract: One major challenge in the content-based image retrieval (CBIR) and computer vision research is to bridge the so-called "semantic gap" between low-level visual features and high-level semantic concepts, that is, extracting semantic concepts from a large database of images effectively. In this paper, we tackle the problem by mining the decisive feature patterns (DFPs). Intuitively, a decisive feature pattern is a combination of low-level feature values that are unique and significant for describing a semantic concept. Interesting algorithms are developed to mine the decisive feature patterns and construct a rule base to automatically recognize semantic concepts in images. A systematic performance study on large image databases containing many semantic concepts shows that our method is more effective than some previously proposed methods. Importantly, our method can be generally applied to any domain of semantic concepts and low-level features.

Journal ArticleDOI
TL;DR: A novel approach to clustering for image segmentation and a new object-based image retrieval method are proposed, which leads to a higher number of relevant images retrieved and to improve upon the conventional classification and retrieval methods.
Abstract: A novel approach to clustering for image segmentation and a new object-based image retrieval method are proposed. The clustering is achieved using the Fisher discriminant as an objective function. The objective function is improved by adding a spatial constraint that encourages neighboring pixels to take on the same class label. A six-dimensional feature vector is used for clustering by way of the combination of color and busyness features for each pixel. After clustering, the dominant segments in each class are chosen based on area and used to extract features for image retrieval. The color content is represented using a histogram, and Haar wavelets are used to represent the texture feature of each segment. The image retrieval is segment-based; the user can select a query segment to perform the retrieval and assign weights to the image features. The distance between two images is calculated using the distance between features of the constituent segments. Each image is ranked based on this distance with respect to the query image segment. The algorithm is applied to a pilot database of natural images and is shown to improve upon the conventional classification and retrieval methods. The proposed segmentation leads to a higher number of relevant images retrieved, 83.5% on average compared to 72.8 and 68.7% for the k-means clustering and the global retrieval methods, respectively.

Journal ArticleDOI
TL;DR: It was found that detecting events, based on their volume only, returned satisfactory results and the results determined by applying this volume based approach to a range of physical environments were shown.
Abstract: In this article we set out to examine whether analysis of the audio from a multimedia surveillance application can be used to augment an event detection system based on visual processing, and possibly contribute to any improvements In processing audio information we are not concerned with identifying or classifying what type of event is detected as our aim is to keep audio processing to a minimum in order to allow deployment on a wireless sensor network We describe an experiment where we gathered information from a series of traditional wired microphones installed in a typical surveillance setting We also obtained information on activities carried out from cameras located in the same area We present the results of analysis of audio information based on the mean of the volume, the zero-crossing rate, and the frequency and how these correlate with events detected visually We found that detecting events, based on their volume only, returned satisfactory results We show the results determined by applying this volume based approach to a range of physical environments

Journal ArticleDOI
TL;DR: This paper explores the task of automatically recovering the relative geometry between a pan-tilt-zoom camera and a network of one-bit motion detectors and formulates and purses the novel goal of functional calibration, a blending of geometry estimation and simple behavioral model discovery.
Abstract: Wide-area context awareness is a crucial enabling technology for next generation smart buildings and surveillance systems. It is not practical to gather this context awareness by covering the entire building with cameras. However, significant gaps in coverage caused by installing cameras in a sparse way can make it very difficult to infer the missing information. As a solution we advocate a class of hybrid perceptual systems that build a comprehensive model of activity in a large space, such as a building, by merging contextual information from a dense network of ultra-lightweight sensor nodes and video from a sparse network of cameras. In this paper we explore the task of automatically recovering the relative geometry between a pan-tilt-zoom camera and a network of one-bit motion detectors. We present results both for the recovery of geometry alone and also for the recovery of geometry jointly with simple activity models. Because we do not believe a metric calibration is necessary, or even entirely useful, for this task, we formulate and pursue the novel goal we term functional calibration. Functional calibration is a blending of geometry estimation and simple behavioral model discovery. Accordingly, results are evaluated by measuring the ability of the system to automatically foveate targets in a large, non-convex space, rather than by measuring, for example, pixel reconstruction error.