scispace - formally typeset
Search or ask a question

Showing papers in "Multimedia Tools and Applications in 2010"


Journal ArticleDOI
TL;DR: STIMO is a summarization technique designed to produce on-the-fly video storyboards that produces still and moving storyboards and allows advanced users customization and is based on a fast clustering algorithm that selects the most representative video contents using HSV frame color distribution.
Abstract: In the current Web scenario a video browsing tool that produces on-the-fly storyboards is more and more a need. Video summary techniques can be helpful but, due to their long processing time, they are usually unsuitable for on-the-fly usage. Therefore, it is common to produce storyboards in advance, penalizing users customization. The lack of customization is more and more critical, as users have different demands and might access the Web with several different networking and device technologies. In this paper we propose STIMO, a summarization technique designed to produce on-the-fly video storyboards. STIMO produces still and moving storyboards and allows advanced users customization (e.g., users can select the storyboard length and the maximum time they are willing to wait to get the storyboard). STIMO is based on a fast clustering algorithm that selects the most representative video contents using HSV frame color distribution. Experimental results show that STIMO produces storyboards with good quality and in a time that makes on-the-fly usage possible.

255 citations


Journal ArticleDOI
TL;DR: An asynchronous feature level fusion approach is proposed that creates a unified hybrid feature space out of the individual signal measurements of the multimedia content to recognize basic affective states from speech prosody and facial expressions.
Abstract: A multimedia content is composed of several streams that carry information in audio, video or textual channels. Classification and clustering multimedia contents require extraction and combination of information from these streams. The streams constituting a multimedia content are naturally different in terms of scale, dynamics and temporal patterns. These differences make combining the information sources using classic combination techniques difficult. We propose an asynchronous feature level fusion approach that creates a unified hybrid feature space out of the individual signal measurements. The target space can be used for clustering or classification of the multimedia content. As a representative application, we used the proposed approach to recognize basic affective states from speech prosody and facial expressions. Experimental results over two audiovisual emotion databases with 42 and 12 subjects revealed that the performance of the proposed system is significantly higher than the unimodal face based and speech based systems, as well as synchronous feature level and decision level fusion approaches.

142 citations


Journal ArticleDOI
TL;DR: A novel emotion state transition model (ESTM) is proposed to model human emotional states and their transitions by music and the COMUS is music-dedicated ontology in OWL constructed by incorporating domain-specific classes for music recommendation into the Music Ontology, which includes situation, mood, and musical features.
Abstract: Context-based music recommendation is one of rapidly emerging applications in the advent of ubiquitous era and requires multidisciplinary efforts including low level feature extraction and music classification, human emotion description and prediction, ontology-based representation and recommendation, and the establishment of connections among them. In this paper, we contributed in three distinctive ways to take into account the idea of context awareness in the music recommendation field. Firstly, we propose a novel emotion state transition model (ESTM) to model human emotional states and their transitions by music. ESTM acts like a bridge between user situation information along with his/her emotion and low-level music features. With ESTM, we can recommend the most appropriate music to the user for transiting to the desired emotional state. Secondly, we present context-based music recommendation (COMUS) ontology for modeling user's musical preferences and context, and for supporting reasoning about the user's desired emotion and preferences. The COMUS is music-dedicated ontology in OWL constructed by incorporating domain-specific classes for music recommendation into the Music Ontology, which includes situation, mood, and musical features. Thirdly, for mapping low-level features to ESTM, we collected various high-dimensional music feature data and applied nonnegative matrix factorization (NMF) for their dimension reduction. We also used support vector machine (SVM) as emotional state transition classifier. We constructed a prototype music recommendation system based on these features and carried out various experiments to measure its performance. We report some of the experimental results.

128 citations


Journal ArticleDOI
TL;DR: The ViSOR (Video Surveillance Online Repository) framework is presented, designed with the aim of establishing an open platform for collecting, annotating, retrieving, and sharing surveillance videos, as well as evaluating the performance of automatic surveillance systems.
Abstract: The availability of new techniques and tools for Video Surveillance and the capability of storing huge amounts of visual data acquired by hundreds of cameras every day call for a convergence between pattern recognition, computer vision and multimedia paradigms. A clear need for this convergence is shown by new research projects which attempt to exploit both ontology-based retrieval and video analysis techniques also in the field of surveillance. This paper presents the ViSOR (Video Surveillance Online Repository) framework, designed with the aim of establishing an open platform for collecting, annotating, retrieving, and sharing surveillance videos, as well as evaluating the performance of automatic surveillance systems. Annotations are based on a reference ontology which has been defined integrating hundreds of concepts, some of them coming from the LSCOM and MediaMill ontologies. A new annotation classification schema is also provided, which is aimed at identifying the spatial, temporal and domain detail level used. The ViSOR web interface allows video browsing, querying by annotated concepts or by keywords, compressed video previewing, media downloading and uploading. Finally, ViSOR includes a performance evaluation desk which can be used to compare different annotations.

125 citations


Journal ArticleDOI
TL;DR: YouTube workload characteristics provided in this work enabled us to develop a workload generator to evaluate the effectiveness of this approach, and the distribution models, in particular Zipf-like behavior of YouTube popular video files suggests proxy cached videos can reduce network traffic and increase scalability of YouTube Web site.
Abstract: This paper introduces a workload characterization study of the most popular short video sharing service of Web 2.0, YouTube. Based on a vast amount of data gathered in a five-month period, we analyzed characteristics of around 250,000 YouTube popular and regular videos. In particular, we collected lists of related videos for each video clip recursively and analyzed their statistical behavior. Understanding YouTube traffic and similar Web 2.0 video sharing sites is crucial to develop synthetic workload generators. Workload simulators are required for evaluating the methods addressing the problems of high bandwidth usage and scalability of Web 2.0 sites such as YouTube. The distribution models, in particular Zipf-like behavior of YouTube popular video files suggests proxy caching of YouTube popular videos can reduce network traffic and increase scalability of YouTube Web site. YouTube workload characteristics provided in this work enabled us to develop a workload generator to evaluate the effectiveness of this approach.

117 citations


Journal ArticleDOI
TL;DR: The experience in building an experimental similarity search system on a test collection of more than 50 million images and the performance of this technology and its evolvement as the data volume grows by three orders of magnitude is studied.
Abstract: As the number of digital images is growing fast and Content-based Image Retrieval (CBIR) is gaining in popularity, CBIR systems should leap towards Web-scale datasets. In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images. The first big challenge we have been facing was obtaining a collection of images of this scale with the corresponding descriptive features. We have tackled the non-trivial process of image crawling and extraction of several MPEG-7 descriptors. The result of this effort is a test collection, the first of such scale, opened to the research community for experiments and comparisons. The second challenge was to develop indexing and searching mechanisms able to scale to the target size and to answer similarity queries in real-time. We have achieved this goal by creating sophisticated centralized and distributed structures based purely on the metric space model of data. We have joined them together which has resulted in an extremely flexible and scalable solution. In this paper, we study in detail the performance of this technology and its evolvement as the data volume grows by three orders of magnitude. The results of the experiments are very encouraging and promising for future applications.

104 citations


Journal ArticleDOI
TL;DR: A robust image retrieval based on color histogram of local feature regions (LFR) that is robust to some classic transformations (additive noise, affine transformation including translation, rotation and scale effects, partial visibility, etc.).
Abstract: Color histograms lack spatial information and are sensitive to intensity variation, color distortion and cropping. As a result, images with similar histograms may have totally different semantics. The region-based approaches are introduced to overcome the above limitations, but due to the inaccurate segmentation, these systems may partition an object into several regions that may have confused users in selecting the proper regions. In this paper, we present a robust image retrieval based on color histogram of local feature regions (LFR). Firstly, the steady image feature points are extracted by using multi-scale Harris-Laplace detector. Then, the significant local feature regions are ascertained adaptively according to the feature scale theory. Finally, the color histogram of local feature regions is constructed, and the similarity between color images is computed by using the color histogram of LFRs. Experimental results show that the proposed color image retrieval is more accurate and efficient in retrieving the user-interested images. Especially, it is robust to some classic transformations (additive noise, affine transformation including translation, rotation and scale effects, partial visibility, etc.).

100 citations


Journal ArticleDOI
TL;DR: This paper presents an effective approach to handling image repositories providing the user with an intuitive interface of visualising and browsing large collections of pictures, based on the idea of similarity-based organisation of images.
Abstract: Next generation environments will change the way people work and live as they will provide new advances in areas ranging from remote work and education, e-commerce, gaming to information-on-demand. In many of these applications intelligent interpretation of multimedia data such as image, video and audio resources is necessary. In this paper we present an effective approach to handling image repositories providing the user with an intuitive interface of visualising and browsing large collections of pictures. Based on the idea of similarity-based organisation of images where images that are visually similar are located close to each other in visualisation space, images are projected onto a sphere with which the user can interact. Rotating the sphere reveals images of different colours while tilting operations focus on brighter or darker images. Large image collections are handled through a hierarchical approach that brings up similar, previously hidden, images when zooming in on an area. Furthermore, the way images are organised can be interactively changed by the user. Our next generation browsing environment has been successfully tested on a large database of several thousand images.

77 citations


Journal ArticleDOI
TL;DR: This work explores the applicability of semantic concept detection, a method often used within video retrieval, on the domain of visual lifelogs, and applies detection of 27 everyday semantic concepts on a lifelog collection composed of 257,518 SenseCam images from 5 users.
Abstract: The Microsoft SenseCam is a small lightweight wearable camera used to passively capture photos and other sensor readings from a user's day-to-day activities. It captures on average 3,000 images in a typical day, equating to almost 1 million images per year. It can be used to aid memory by creating a personal multimedia lifelog, or visual recording of the wearer's life. However the sheer volume of image data captured within a visual lifelog creates a number of challenges, particularly for locating relevant content. Within this work, we explore the applicability of semantic concept detection, a method often used within video retrieval, on the domain of visual lifelogs. Our concept detector models the correspondence between low-level visual features and high-level semantic concepts (such as indoors, outdoors, people, buildings, etc.) using supervised machine learning. By doing so it determines the probability of a concept's presence. We apply detection of 27 everyday semantic concepts on a lifelog collection composed of 257,518 SenseCam images from 5 users. The results were evaluated on a subset of 95,907 images, to determine the accuracy for detection of each semantic concept. We conducted further analysis on the temporal consistency, co-occurance and relationships within the detected concepts to more extensively investigate the robustness of the detectors within this domain.

67 citations


Journal ArticleDOI
TL;DR: A systematic survey of the state of the art MPEG-7 based multimedia ontologies is presented, and issues that hinder interoperability as well as possible directions towards their harmonisation are highlighted.
Abstract: Machine understandable metadata forms the main prerequisite for the intelligent services envisaged in a Web, which going beyond mere data exchange and provides for effective content access, sharing and reuse. MPEG-7, despite providing a comprehensive set of tools for the standardised description of audiovisual content, is largely compromised by the use of XML that leaves the largest part of the intended semantics implicit. Aspiring to formalise MPEG-7 descriptions and enhance multimedia metadata interoperability, a number of multimedia ontologies have been proposed. Though sharing a common vision, the developed ontologies are characterised by substantial conceptual differences, reflected both in the modelling of MPEG-7 description tools as well as in the linking with domain ontologies. Delving into the principles underlying their engineering, we present a systematic survey of the state of the art MPEG-7 based multimedia ontologies, and highlight issues that hinder interoperability as well as possible directions towards their harmonisation.

62 citations


Journal ArticleDOI
TL;DR: A new method for viewpoint independent markerless gait analysis that does not require camera calibration and works with a wide range of walking directions, which makes the proposed method particularly suitable for gait identification in real surveillance scenarios where people and their behaviour need to be tracked across a set of cameras.
Abstract: Many studies have confirmed that gait analysis can be used as a new biometrics. In this research, gait analysis is deployed for people identification in multi-camera surveillance scenarios. We present a new method for viewpoint independent markerless gait analysis that does not require camera calibration and works with a wide range of walking directions. These properties make the proposed method particularly suitable for gait identification in real surveillance scenarios where people and their behaviour need to be tracked across a set of cameras. Tests on 300 synthetic and real video sequences, with subjects walking freely along different walking directions, have been performed. Since the choice of the cameras' characteristics is a key-point for the development of a smart surveillance system, the performance of the proposed approach is measured with respect to different video properties: spatial resolution, frame-rate, data compression and image quality. The obtained results show that markerless gait analysis can be achieved without any knowledge of camera's position and subject's pose. The extracted gait parameters allow recognition of people walking from different views with a mean recognition rate of 92.2% and confirm that gait can be effectively used for subjects' identification in a multi-camera surveillance scenario.

Journal ArticleDOI
TL;DR: This paper presents a method to introduce temporal information for video event recognition within the bag-of-words approach, modeled as a sequence composed of histograms of visual features, computed from each frame using the traditional BoW.
Abstract: Event recognition is a crucial task to provide high-level semantic description of the video content. The bag-of-words (BoW) approach has proven to be successful for the categorization of objects and scenes in images, but it is unable to model temporal information between consecutive frames. In this paper we present a method to introduce temporal information for video event recognition within the BoW approach. Events are modeled as a sequence composed of histograms of visual features, computed from each frame using the traditional BoW. The sequences are treated as strings (phrases) where each histogram is considered as a character. Event classification of these sequences of variable length, depending on the duration of the video clips, are performed using SVM classifiers with a string kernel that uses the Needlemann-Wunsch edit distance. Experimental results, performed on two domains, soccer videos and a subset of TRECVID 2005 news videos, demonstrate the validity of the proposed approach.

Journal ArticleDOI
TL;DR: A new video quality metric called Foveated Mean Squared Error (FMSE) is proposed that takes into account a variable resolution of the HVS across the visual field, and utilizes the effect of additional spatial acuity reduction due to motion in a video sequence.
Abstract: Efficiency of a video coding process, as well as accuracy of an objective video quality evaluation can be significantly improved by introduction of the human visual system (HVS) characteristics. In this paper we analyze one of these characteristics; namely, visual acuity reduction due to the foveated vision and object movements in a video sequence. We propose a new video quality metric called Foveated Mean Squared Error (FMSE) that takes into account a variable resolution of the HVS across the visual field. The highest visual acuity is at the point of fixation that falls into fovea, an area at retina with the highest density of photoreceptors. Visual acuity decreases rapidly for image regions which are further with respect to the fixation point. FMSE also utilizes the effect of additional spatial acuity reduction due to motion in a video sequence. The quality measures calculated by FMSE have shown a high correlation with experimental results obtained by subjective video quality assessment.

Journal ArticleDOI
TL;DR: A user evaluation in the context of the PHAROS search engine, asking people about the utility, interest and innovation of this technology in real world use cases is reported, demonstrating the usability of this tool to annotate large-scale databases.
Abstract: In the context of content analysis for indexing and retrieval, a method for creating automatic music mood annotation is presented. The method is based on results from psychological studies and framed into a supervised learning approach using musical features automatically extracted from the raw audio signal. We present here some of the most relevant audio features to solve this problem. A ground truth, used for training, is created using both social network information systems (wisdom of crowds) and individual experts (wisdom of the few). At the experimental level, we evaluate our approach on a database of 1,000 songs. Tests of different classification methods, configurations and optimizations have been conducted, showing that Support Vector Machines perform best for the task at hand. Moreover, we evaluate the algorithm robustness against different audio compression schemes. This fact, often neglected, is fundamental to build a system that is usable in real conditions. In addition, the integration of a fast and scalable version of this technique with the European Project PHAROS is discussed. This real world application demonstrates the usability of this tool to annotate large-scale databases. We also report on a user evaluation in the context of the PHAROS search engine, asking people about the utility, interest and innovation of this technology in real world use cases.

Journal ArticleDOI
TL;DR: A cross-layer mapping algorithm to improve the quality of transmission of H.264 (a recently-developed video coding standard of the ITU-T Video Coding Experts Group) video stream over IEEE 802.11e-based wireless networks is presented.
Abstract: The use of wireless networks has spread further than simple data transfer to delay sensitive and loss tolerant multimedia applications. Over the past few years, wireless multimedia transmission across Wireless Local area Networks (WLANs) has gained a lot of attention because of the introduction of technologies such as Bluetooth, IEEE 802.11, 3G, and WiMAX. The IEEE 802.11 WLAN has become a dominating technology due to its low cost and ease of implementation. But, transmitting video over WLANs in real time remains a challenge because it imposes strong demands on video codec, the underlying network, and the Media Access Control (MAC) layer. This paper presents a cross-layer mapping algorithm to improve the quality of transmission of H.264 (a recently-developed video coding standard of the ITU-T Video Coding Experts Group) video stream over IEEE 802.11e-based wireless networks. The major goals of H.264 standard were on improving the rate distortion and the enhanced compression performance. Our proposed cross-layer design involves the mapping of H.264 video slices (packets) to appropriate access categories of IEEE 802.11e according to their information significance. We evaluate the performance of our proposed cross-layer design and the results obtained demonstrate its effectiveness in exploiting characteristics of the MAC and application layers to improve the video transmission quality.

Journal ArticleDOI
TL;DR: A set of semantic traits discernible by humans at a distance are introduced, outlining their psychological validity and working under the premise that similarity of the chosen gait signature implies similarity of certain semantic traits.
Abstract: In order to analyse surveillance video, we need to efficiently explore large datasets containing videos of walking humans. Effective analysis of such data relies on retrieval of video data which has been enriched using semantic annotations. A manual annotation process is time-consuming and prone to error due to subject bias however, at surveillance-image resolution, the human walk (their gait) can be analysed automatically. We explore the content-based retrieval of videos containing walking subjects, using semantic queries. We evaluate current research in gait biometrics, unique in its effectiveness at recognising people at a distance. We introduce a set of semantic traits discernible by humans at a distance, outlining their psychological validity. Working under the premise that similarity of the chosen gait signature implies similarity of certain semantic traits we perform a set of semantic retrieval experiments using popular Latent Semantic Analysis techniques. We perform experiments on a dataset of 2000 videos of people walking in laboratory conditions and achieve promising retrieval results for features such as Sex (mAP ?=? 14% above random), Age (mAP ?=? 10% above random) and Ethnicity (mAP ?=? 9% above random).

Journal ArticleDOI
TL;DR: This work presents a content-aware multi-camera selection technique that uses object- and frame-level features and compares the proposed approach with a maximum score based camera selection criterion and demonstrates a significant decrease in camera flickering.
Abstract: We present a content-aware multi-camera selection technique that uses object- and frame-level features. First objects are detected using a color-based change detector. Next trajectory information for each object is generated using multi-frame graph matching. Finally, multiple features including size and location are used to generate an object score. At frame-level, we consider total activity, event score, number of objects and cumulative object score. These features are used to generate score information using a multivariate Gaussian distribution. The algorithm. The best view is selected using a Dynamic Bayesian Network (DBN), which utilizes camera network information. DBN employs previous view information to select the current view thus increasing resilience to frequent switching. The performance of the proposed approach is demonstrated on three multi-camera setups with semi-overlapping fields of view: a basketball game, an indoor airport surveillance scenario and a synthetic outdoor pedestrian dataset. We compare the proposed view selection approach with a maximum score based camera selection criterion and demonstrate a significant decrease in camera flickering. The performance of the proposed approach is also validated through subjective testing.

Journal ArticleDOI
TL;DR: Experimental results have proved that the system greatly enhances users’ experience, thus encouraging further research in this direction and introducing the concept of multichannel browser, i.e. a browser that allows concurrent browsing of multiple media channels.
Abstract: Despite the great amount of work done in the last decade, retrieving information of interest from a large multimedia repository still remains an open issue In this paper, we propose an intelligent browsing system based on a novel recommendation paradigm Our approach combines usage patters with low-level features and semantic descriptors in order to predict users' behavior and provide effective recommendations The proposed paradigm is very general and can be applied to any type of multimedia data In order to make the recommender system even more flexible, we introduce the concept of multichannel browser, ie a browser that allows concurrent browsing of multiple media channels We implemented a prototype of the proposed system and tested the effectiveness of our approach in a virtual museum scenario Experimental results have proved that the system greatly enhances users' experience, thus encouraging further research in this direction

Journal ArticleDOI
TL;DR: A framework for semantic annotation of soccer videos that exploits an ontology model referred to as Dynamic Pictorially Enriched Ontology, where the ontology, defined using OWL, includes both schema and data.
Abstract: In this paper we present a framework for semantic annotation of soccer videos that exploits an ontology model referred to as Dynamic Pictorially Enriched Ontology, where the ontology, defined using OWL, includes both schema and data. Visual instances are used as matching references for the visual descriptors of the entities to be annotated. Three mechanisms are included to support effective annotation: visual instance clustering--to cluster instances of similar patterns, prototype selection--to select one or more visual representatives of each cluster, dynamic cluster updating--to update clusters and prototypes whenever new knowledge is presented to the ontology. Experimental results show the capability of performing semantic annotation of entities that exhibit a variety of complex changes in visual appearance or of events that show complex motion patterns in the same shot. SWRL rules are used to perform rule-based reasoning over both concepts and concept instances, to improve the quality of the annotation.

Journal ArticleDOI
TL;DR: Experiments show that even at 48 bytes per image, the proposed descriptor demonstrates a high level of accuracy in its results, and this paper proposes a new method of content based radiology medical image retrieval.
Abstract: The rapid advances made in the field of radiology, the increased frequency in which oncological diseases appear, as well as the demand for regular medical checks, led to the creation of a large database of radiology images in every hospital or medical center. There is now an imperative need to create an effective method for the indexing and retrieval of these images. This paper proposes a new method of content based radiology medical image retrieval. The description of images relies on a Fuzzy Rule Based Compact Composite Descriptor (CCD), which includes global image features capturing both brightness and texture characteristics in a 1D Histogram. Furthermore, the proposed descriptor includes the spatial distribution of the information it describes. The most important feature of the proposed descriptor is that its size adapts according to the storage capabilities of the application that uses it. Experiments carried out on a large group of images show that even at 48 bytes per image, the proposed descriptor demonstrates a high level of accuracy in its results. To evaluate the performance of the proposed feature, the mean average precision was used.

Journal ArticleDOI
TL;DR: This paper presents a video summarization tool and demonstrates how it can be successfully used in the domain of arthroscopic videos, taking advantage of several domain-specific aspects without losing its ability to work on general-purpose videos.
Abstract: Arthroscopic surgery is a minimally invasive procedure that uses a small camera to generate video streams, which are recorded and subsequently archived. In this paper we present a video summarization tool and demonstrate how it can be successfully used in the domain of arthroscopic videos. The proposed tool generates a keyframe-based summary, which clusters visually similar frames based on user-selected visual features and appropriate dissimilarity metrics. We discuss how this tool can be used for arthroscopic videos, taking advantage of several domain-specific aspects, without losing its ability to work on general-purpose videos. Experimental results confirm the feasibility of the proposed approach and encourage extending it to other application domains.

Journal ArticleDOI
TL;DR: A parameter degree of abstraction is proposed, which gives a choice to the user about how concisely the extracted concepts should be produced for a specified highlight duration, and compared with the manually-generated highlights by sports television channel.
Abstract: This paper presents a novel approach towards automated highlight generation of broadcast sports video sequences from its extracted events and semantic concepts. A sports video is hierarchically divided into temporal partitions namely, megaslots, slots, and semantic entities, namely concepts, and events. The proposed method extracts event sequence from video and classifies each sequence into a concept by sequential association mining. The extracted concepts and events within the concepts are selected according to their degree of importance to include those in the highlights. A parameter degree of abstraction is proposed, which gives a choice to the user about how concisely the extracted concepts should be produced for a specified highlight duration. We have successfully extracted highlights from recorded video of cricket match and compared our results with the manually-generated highlights by sports television channel.

Journal ArticleDOI
TL;DR: An automatic tracking recovery tool which improves the performance of any tracking algorithm whenever the results are not acceptable is proposed by proposing an innovative algorithm that optimally estimates the non-linear model at an upcoming time instance based on the current non- linear models that have been already approximated.
Abstract: Detection and analysis of events from video sequences is probably one of the most important research issues in computer vision and pattern analysis society. Before, however, applying methods and tools for analyzing actions, behavior or events, we need to implement robust and reliable tracking algorithms able to automatically monitor the movements of many objects in the scene regardless of the complexity of the background, existence of occlusions and illumination changes. Despite the recent research efforts in the field of object tracking, the main limitation of most of the existing algorithms is that they are not enriched with automatic recovery strategies able to re-initialize tracking whenever its performance severely deteriorates. This is addressed in this paper by proposing an automatic tracking recovery tool which improves the performance of any tracking algorithm whenever the results are not acceptable. For the recovery, non-linear object modeling tools are used which probabilistically label image regions to object classes. The models are also time varying. The first property is implemented in our case using concepts from functional analysis which allow parametrization of any arbitrary non-linear function (with some restrictions on its continuity) as a finite series of known functional components but of unknown coefficients. The second property is addressed by proposing an innovative algorithm that optimally estimates the non-linear model at an upcoming time instance based on the current non-linear models that have been already approximated. The architecture is enhanced by a decision mechanism which permits verification of the time instances in which tracking recovery should take place. Experimental results on a set of different video sequences that present complex visual phenomena (full and partial occlusions, illumination variations, complex background, etc) are depicted to demonstrate the efficiency of the proposed scheme in proving tracking in very difficult visual content conditions. Additionally, criteria are proposed to objectively evaluate the tracking performance and compare it with other strategies.

Journal ArticleDOI
TL;DR: Some new harmonic and Zipf based features for better speech emotion characterization in the valence dimension and a multi-stage classification scheme driven by a dimensional emotion model for better emotional class discrimination are proposed.
Abstract: This paper deals with speech emotion analysis within the context of increasing awareness of the wide application potential of affective computing. Unlike most works in the literature which mainly rely on classical frequency and energy based features along with a single global classifier for emotion recognition, we propose in this paper some new harmonic and Zipf based features for better speech emotion characterization in the valence dimension and a multi-stage classification scheme driven by a dimensional emotion model for better emotional class discrimination. Experimented on the Berlin dataset with 68 features and six emotion states, our approach shows its effectiveness, displaying a 68.60% classification rate and reaching a 71.52% classification rate when a gender classification is first applied. Using the DES dataset with five emotion states, our approach achieves an 81% recognition rate when the best performance in the literature to our knowledge is 76.15% on the same dataset.

Journal ArticleDOI
TL;DR: Two new segmentation techniques called projection and middle-axis point separation are proposed for CAPTCHAs with line cluttering and character warping and Experimental results show the proposed techniques can achieve segmentation rates of about 75%.
Abstract: A CAPTCHA is a test designed to distinguish computer programs from human beings, in order to prevent the abuse of networked resources. Academic research into CAPTCHAs includes designing friendly and secure CAPTCHA systems and defeating existing CAPTCHA systems. Traditionally, defeating a CAPTCHA test requires two procedures: segmentation and recognition. Recent research shows that the problem of segmentation is much harder than recognition. In this paper, two new segmentation techniques called projection and middle-axis point separation are proposed for CAPTCHAs with line cluttering and character warping. Experimental results show the proposed techniques can achieve segmentation rates of about 75%.

Journal ArticleDOI
TL;DR: Simulation experiments show the adaptive SPFEC mechanism achieves high recovery performance and low end-to-end delay jitter, and offers an alternative for improved efficiency video streaming that will be of interest to the designers of the next generation environments.
Abstract: Traditional Forward Error Correction (FEC) mechanisms can be divided into Packet level FEC (PFEC) mechanisms and Byte level FEC (BFEC) mechanisms. The PFEC mechanism of recovering from errors in a source packet requires an entire FEC redundant packet even though the error involves a few bit errors. The recovery capability of the BFEC mechanism is only half of the FEC redundancy. Accordingly, an adaptive Sub-Packet FEC (SPFEC) mechanism is proposed in this paper to improve the quality of video streaming data over wireless networks, simultaneously enhancing the recovery performance and reducing the end-to-end delay jitter. The SPFEC mechanism divides a packet into n sub-packets by means of the concept of a virtual packet. The SPFEC mechanism uses a checksum in each sub-packet to identify the position of the error sub-packet. Simulation experiments show the adaptive SPFEC mechanism achieves high recovery performance and low end-to-end delay jitter. The SPFEC mechanism outperforms traditional FEC mechanism in terms of packet loss rate and video Peak Signal-to-Noise Ratio (PSNR). SPFEC offers an alternative for improved efficiency video streaming that will be of interest to the designers of the next generation environments.

Journal ArticleDOI
TL;DR: A protocol for energy-efficient adaptive listen for medium access control in WSN, which adaptively changes the slot-time, which is the time of each slot in the contention window, and shows a prominent decrease in the energy consumption at the nodes in the proposed protocol over the existing SMAC protocol.
Abstract: Wireless Sensor Networks (WSN) have nodes that are small in size and are powered by small batteries having very limited amount of energy. In most applications of WSN, the nodes in the network remain inactive for long periods of time, and intermittently they become active on sensing any change in the environment. The data sensed by the different nodes are sent to the sink node. In contrast to other infrastructure-based wireless networks, higher throughput, lower latency and per-node fairness in WSN are imperative, but their importance is subdued when compared to energy consumption. In this work, we have regarded the amount of energy consumption in the nodes to be of primary concern, while throughput and latency in the network to be secondary. We have proposed a protocol for energy-efficient adaptive listen for medium access control in WSN. Our protocol adaptively changes the slot-time, which is the time of each slot in the contention window. This correspondingly changes the cycle-time, which is the sum of the listen-time and the sleep-time of the sensors, while keeping the duty-cycle, which is the ratio between the listen-time and the cycle-time, constant. Using simulation experiments, we evaluated the performance of the proposed protocol, compared with the popular Sensor Medium Access Control (SMAC) (Ye et al. IEEE/ACM Trans Netw 12(3):493---506, 39) protocol. The results we obtained show a prominent decrease in the energy consumption at the nodes in the proposed protocol over the existing SMAC protocol, at the cost of decreasing the throughput and increasing the latency in the network. Although such an observation is not perfectly what is ideally desired, given the very limited amount of energy with which the nodes in a WSN operate, we advocate that increasing the energy efficiency of the nodes, thereby increasing the network lifetime in WSN, is a more important concern compared to throughput and latency. Additionally, similar observations relating energy efficiency, network lifetime, throughput and latency exist in many other existing protocols, including the popular SMAC protocol (Ye et al. IEEE/ACM Trans Netw 12(3):493---506, 39).

Journal ArticleDOI
TL;DR: A flexible framework to generate SI at the block level in two modes is presented and can be overall advantageous from the rate-distortion point of view, even if some rate has to be invested in the low quality Intra coding blocks, for blocks where the MCI produces SI with lower correlation.
Abstract: One of the most efficient approaches to generate the side information (SI) in distributed video codecs is through motion compensated frame interpolation where the current frame is estimated based on past and future reference frames. However, this approach leads to significant spatial and temporal variations in the correlation noise between the source at the encoder and the SI at the decoder. In such scenario, it would be useful to design an architecture where the SI can be more robustly generated at the block level, avoiding the creation of SI frame regions with lower correlation, largely responsible for some coding efficiency losses. In this paper, a flexible framework to generate SI at the block level in two modes is presented: while the first mode corresponds to a motion compensated interpolation (MCI) technique, the second mode corresponds to a motion compensated quality enhancement (MCQE) technique where a low quality Intra block sent by the encoder is used to generate the SI by doing motion estimation with the help of the reference frames. The novel MCQE mode can be overall advantageous from the rate-distortion point of view, even if some rate has to be invested in the low quality Intra coding blocks, for blocks where the MCI produces SI with lower correlation. The overall solution is evaluated in terms of RD performance with improvements up to 2 dB, especially for high motion video sequences and long Group of Pictures (GOP) sizes.

Journal ArticleDOI
TL;DR: The main contribution of this paper is a discussion of the design and functioning of a fully integrated platform for multimedia adaptation and delivery, called NinSuna, able to efficiently deal with the aforementioned heterogeneity in the present-day multimedia ecosystem.
Abstract: The current multimedia landscape is characterized by a significant heterogeneity in terms of coding and delivery formats, usage environments, and user preferences. The main contribution of this paper is a discussion of the design and functioning of a fully integrated platform for multimedia adaptation and delivery, called NinSuna. This platform is able to efficiently deal with the aforementioned heterogeneity in the present-day multimedia ecosystem, thanks to the use of format-agnostic adaptation engines (i.e., engines independent of the underlying coding format) and format-agnostic packaging engines (i.e., engines independent of the underlying delivery format). Moreover, NinSuna also provides a seamless integration between metadata standards and adaptation processes. Both our format-independent adaptation and packaging techniques rely on a model for multimedia bitstreams, describing the structural, semantic, and scalability properties of these multimedia streams. News sequences were used as a test case for our platform, enabling the user to select news fragments matching his/her specific interests and usage environment characteristics.

Journal ArticleDOI
TL;DR: A fuzzy DLs-based reasoning framework is investigated, which enables the integration of scene and object classifications into a semantically consistent interpretation by capturing and utilising the underlying semantic associations.
Abstract: Recent advances in semantic image analysis have brought forth generic methodologies to support concept learning at large scale. The attained performance however is highly variable, reflecting effects related to similarities and variations in the visual manifestations of semantically distinct concepts, much as to the limitations issuing from considering semantics solely in the form of perceptual representations. Aiming to enhance performance and improve robustness, we investigate a fuzzy DLs-based reasoning framework, which enables the integration of scene and object classifications into a semantically consistent interpretation by capturing and utilising the underlying semantic associations. Evaluation with two sets of input classifiers, configured so as to vary with respect to the wealth of concepts' interrelations, outlines the potential of the proposed approach in the presence of semantically rich associations, while delineating the issues and challenges involved.