scispace - formally typeset
Search or ask a question

Showing papers in "Storage and Retrieval for Image and Video Databases in 2001"


Proceedings ArticleDOI
TL;DR: This paper proposes two new sequence-matching techniques for copy detection and compares the performance with one of the existing techniques.
Abstract: Video copy detection is a complementary approach to watermarking. As opposed to watermarking, which relies on inserting a distinct pattern into the video stream, video copy detection techniques match content-based signatures to detect copies of video. Existing typical content-based copy detection schemes have relied on image matching. This paper proposes two new sequence-matching techniques for copy detection and compares the performance with one of the existing techniques. Motion, intensity and color-based signatures are compared in the context of copy detection. Results are reported on detecting copies of movie clips.

281 citations


Journal Article
TL;DR: This paper will present the architecture of a public automated evaluation service for still images, sound and video, and detail and justify the choice of evaluation profiles, that is the series of tests applied to different types of wa-termarking schemes.
Abstract: One of the main problems, which darkens the future of digital watermarking technologies, is the lack of detailed evaluation of existing marking schemes. This lack of benchmarking of current algorithms is blatant and confuses rights holders as well as software and hardware manufacturers and prevents them from using the solution appropriate to their needs. Indeed basing long-lived protection schemes on badly tested watermarking technology does not make sense. In this paper we will present the architecture of a public automated evaluation service we have developed for still images, sound and video. We will detail and justify our choice of evaluation profiles, that is the series of tests applied to different types of wa-termarking schemes. These evaluation profiles allow us to measure the reliability of a marking scheme to different levels from low to very high. Beside the known StirMark transformations, we will also detail new tests that will be included in this platform. One of them is intended to measure the real size of the key space. Indeed, if one is not careful, two different watermarking keys may produce interfering watermarks and as a consequence the actual space of keys is much smaller than it appears. Another set of tests is related to audio data and addresses the usual equalisation and normalisation but also time stretching, pitch shifting. Finally we propose a set of tests for fingerprinting applications. This includes: averaging of copies with different fingerprint, random ex-change of part between different copies and comparison between copies with selection of most/less frequently used position differences.

98 citations


Proceedings ArticleDOI
Rainer Lienhart1
TL;DR: In this article, the authors presented the first robust and reliable dissolve detection system, which achieved a detection rate of 69 percent while reducing the false alarm rate to an acceptable level of 68 percent on a test video set.
Abstract: Automatic shot boundary detection has been an active research area for nearly a decade and has led to high performance detection algorithms for hard cuts, fades and wipes Reliable dissolve detection, however, is still an unsolved problem In this paper, we present the first robust and reliable dissolve detection system A detection rate of 69 percent was achieved while reducing the false alarm rate to an acceptable level of 68 percent on a test video set for which so far the best reported detection and false alarm rate had been 57 percent and 185 percent, respectively In addition, the temporal extent of the dissolves are estimated by a multi-resolution detection approach The three core ideas of our novel approach are firstly the creation of a dissolve synthesizer capable of creating in principle an infinite number of dissolve examples of any duration form a video database of raw video footage, secondly tow new features for capturing the characteristics of dissolves, and thirdly, the exploitation of machine learning ideas for reliable object detection such as the boostrap-method to improve the set of non-dissolve examples and the search at multiple resolutions as well as the usage of machine learning algorithms such as neural networks, support-vector machines and linear vector quantizer

94 citations


Proceedings ArticleDOI
TL;DR: A framework for event detection and summary generation in football broadcast video is proposed, which proposes both deterministic and probabilistic approaches to the detection of the plays and an audio-based hierarchical summarization method.
Abstract: We propose a framework for event detection and summary generation in football broadcast video. First, we formulate summarization as a play detection problem, with play being defined as the most basic segment of time during which the ball is being played. Then we propose both deterministic and probabilistic approaches to the detection of the plays. The detected plays are concatenated to generate a compact, time-compressed summary of the original video. Such a summary is complete in the sense that it contains every meaningful action of the underlying game, and it also servers as a much better starting point for higher-level summarization and other analyses than the original video does. Based on the summary, we also propose an audio-based hierarchical summarization method. Experimental results show the proposed methods work very well on consumer grade platforms.© (2001) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

69 citations


Proceedings ArticleDOI
TL;DR: A video annotation tool, VideoAnn, to annotate semantic labels associated with video shots to summarize video content, and a video transmission system, Universal Tuner, for wireless video streaming.
Abstract: We have designed and implemented a video semantic summarization system, which includes an MPEG-7 compliant annotation interface, a semantic summarization middleware, a real-time MPEG-1/2 video transcoder on PCs, and an application interface on color/black-and-white Palm-OS PDAs. We designed a video annotation tool, VideoAnn, to annotate semantic labels associated with video shots. Videos are first segmentated into shots based on their visual-audio characteristics. They are played back using an interactive interface, which facilitate and fasten the annotation process. Users can annotate the video content with the units of temporal shots or spatial regions. The annotated results are stored in the MPEG-7 XML format. We also designed and implemented a video transmission system, Universal Tuner, for wireless video streaming. This system transcodes MPEG-1/2 videos or live TV broadcasting videos to the BW or indexed color Palm OS devices. In our system, the complexity of multimedia compression and decompression algorithms is adaptively partitioned between the encoder and decoder. In the client end, users can access the summarized video based on their preferences, time, keywords, as well as the transmission bandwidth and the remaining battery power on the pervasive devices.

68 citations


Proceedings ArticleDOI
Milind Naphade1, Ching-Yung Lin1, John R. Smith1, Belle L. Tseng1, Sankar Basu1 
TL;DR: A video annotation tool that has been developed for the purpose of annotating generic video sequences in the context of a recent video-TREC benchmarking exercise is described and it is shown how active learning strategy can be potentially implemented in this context to further improve the performance of the annotation tool.
Abstract: Model-based approach to video retrieval requires ground-truth data for training the models. This leads to the development of video annotation tools that allow users to annotate each shot in the video sequence as well as to identify and label scenes, events, and objects by applying the labels at the shot-level. The annotation tool considered here also allows the user to associate the object-labels with an individual region in a key-frame image. However, the abundance of video data and diversity of labels make annotation a difficult and overly expensive task. To combat this problem, we formulate the task of annotation in the framework of supervised training with partially labeled data by viewing it as an exercise in active learning. In this scenario, one first trains a classifier with a small set of labeled data, and subsequently updates the classifier by selecting the most informative, or most uncertain subset of the available data-set. Consequently, propagation of labels to yet unlabeled data is automatically achieved as well. The purpose of this paper is primarily twofold. The first is to describe a video annotation tool that has been developed for the purpose of annotating generic video sequences in the context of a recent video-TREC benchmarking exercise. The tool is semi-automatic in that it automatically propagates labels to similar shots, which requires the user to confirm or reject the propagated labels. The second purpose is to show how active learning strategy can be potentially implemented in this context to further improve the performance of the annotation tool. While many versions of active learning could be thought of, we specifically report results on experiments with support vector machine classifiers with polynomial kernels.

55 citations


Proceedings ArticleDOI
TL;DR: It is shown that R-trees can be used to index the multidimensional features so that search will be efficient and scalable to a large database.
Abstract: Our goal is to enable queries about the motion of objects in a video sequence. Tracking objects in video is a difficult task, involving signal analysis, estimation and often semantic information particular to the targets. That is not our focus-rather, we assume that tracking is done, and turn to the task of representing the motion for query. The position over time of an object result in a motion trajectory, i.e., a sequence of locations. We propose a novel representation of trajectories: we use the path and speed curves as the motion representation. The path curve records the position of the object while the speed curve records the magnitude of its velocity. This separates positional information from temporal information, since position may be more important in specifying a trajectory than the actual velocity of a trajectory. Velocity can be recovered from our representation. We derive a local geometric description of the curves invariant under scaling and rigid motion. We adopt a warping method in matching so that it is roust to variation in feature vectors. We show that R-trees can be used to index the multidimensional features so that search will be efficient and scalable to a large database.

46 citations


Proceedings ArticleDOI
TL;DR: Using very simple rules depending on the type of sport, this work is able to provide highlights by skipping over the uninteresting parts of the video and identifying interesting events characterized, for instance, by falling edge or raising edge in the activity domain.
Abstract: We present a technique for rapidly generating highlights of sports videos using temporal patterns of motion activity extracted in the compressed domain. The basic hypothesis of this work is that temporal patterns of motion activity are related with the grammar of the sports video. We present experimental verification of this hypothesis. By using very simple rules depending on the type of sport, we are thus able to provide highlights by skipping over the uninteresting parts of the video and identifying interesting events characterized, for instance, by falling edge or raising edge in the activity domain. Moreover the compressed domain extraction of motion activity intensity is much simpler than the color based summarization calculations. Other compressed domain features or more complex rules can be used to further improve the accuracy.© (2001) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

36 citations


Proceedings ArticleDOI
TL;DR: In this paper, a wavelet-based salient point extraction algorithm is proposed to extract the color and texture information in the locations given by these points, which provides significantly improved results in terms of retrieval accuracy, computational complexity and storage space of feature vectors as compared to the global feature approaches.
Abstract: Content-based Image Retrieval (CBIR) has become one of the most active research areas in the past few years. Most of the attention from the research has been focused on indexing techniques based on global feature distributions. However, these global distributions have limited discriminating power because they are unable to capture local image information. Applying global Gabor texture features greatly improve the retrieval accuracy. But they are computationally complex. In this paper, we present a wavelet-based salient point extraction algorithm. We show that extracting the color and texture information in the locations given by these points provides significantly improved results in terms of retrieval accuracy, computational complexity and storage space of feature vectors as compared to the global feature approaches.

35 citations


Proceedings ArticleDOI
Li Zhao1, Wei Qi2, Yi-Jin Wang1, Shiqiang Yang1, Hong-Jiang Zhang2 
TL;DR: In this paper, the authors presented an effective approach to video scene segmentation based on probabilistic model merging, where they regard the shots in video sequence as hidden state variable and use probablistic clustering to get the best clustering performance.
Abstract: For more efficiently organizing, browsing, and retrieving digital video, it is important to extract video structure information at both scene and shot levels. This paper present an effective approach to video scene segmentation based on probabilistic model merging. In our proposed method, we regard the shots in video sequence as hidden state variable and use probabilistic clustering to get the best clustering performance. The experimental results show that our method produces reasonable clustering results based on the visual content. A project named HomeVideo is introduced to show the application of the proposed method for personal video materials management.

30 citations


Proceedings ArticleDOI
TL;DR: This work proposes a method to automatically generate video summaries for long videos by segmenting the video into small, coherent segments and ranking the resulting segments, and scores segments based on word frequency analysis of speech transcripts.
Abstract: Compact representations of video data can enable efficient video browsing. Such representations provide the user with information about the content of the particular sequence being examined while preserving the essential message. We propose a method to automatically generate video summaries for long videos. Our video summarization approach involves mainly two tasks: first, segmenting the video into small, coherent segments and second, ranking the resulting segments. Our proposed algorithm scores segments based on word frequency analysis of speech transcripts. Then a summary is generated by selecting the segments with the highest score to duration ratios and these are concatenating them. We have designed and performed a user study to evaluate the quality of summaries generated. Comparisons are made using our proposed algorithm and a random segment selection scheme based on statistical analysis of the user study results. Finally we discuss various issues that arise in evaluation of automatically generated video summaries.

Proceedings ArticleDOI
TL;DR: In this article, an adaptation of k-means clustering using a non- Euclidean similarity metric is applied to discover the natural patterns of the data in the low-level feature space; the cluster prototype is designed to summarize the cluster in a manner that is suited for quick human comprehension of its components.
Abstract: Humans tend to use high-level semantic concepts when querying and browsing multimedia databases; there is thus, a need for systems that extract these concepts and make available annotations for the multimedia data. The system presented in this paper satisfies this need by automatically generating semantic concepts for images form their low-level visual features. The proposed system is built in two stages. First, an adaptation of k-means clustering using a non- Euclidean similarity metric is applied to discover the natural patterns of the data in the low-level feature space; the cluster prototype is designed to summarize the cluster in a manner that is suited for quick human comprehension of its components. Second, statistics measuring the variation within each cluster are used to derive a set of mappings between the most significant low-level features and the most frequent keywords of the corresponding cluster. The set of the derived rules could be used further to capture the semantic content and index new untagged images added to the image database. The attachment of semantic concepts to images will also give the system the advantage of handling queries expressed in terms of keywords and thus, it reduces the semantic gap between the user's conceptualization of a query and the query that is actually specified to the system. While the suggested scheme works with any kind of low-level features, our implementation and description of the system is centered on the use of image color information. Experiments using a 21 00 image database are presented to show the efficacy of the proposed system.

Journal Article
TL;DR: In this paper, the ground penetrating radar and thermal IR sensors were used to identify non-metallic landmines in different soil and water content conditions, and the most important of these is water content since it directly influences the three other properties.
Abstract: Land mines are a major problem in many areas of the world. In spite of the fact that many different types of land mines sensors have been developed, the detection of non-metallic land mines remains very difficult. Most landmine detection sensors are affected by soil properties such as water content, temperature, electrical conductivity and dielectric constant. The most important of these is water content since it directly influences the three other properties. In this study, the ground penetrating radar and thermal IR sensors were used to identify non-metallic landmines in different soil and water content conditions

Proceedings ArticleDOI
TL;DR: A summarization system for processing incoming video, extracting and analyzing closed caption text, determining the boundaries of program segments as well as commercial breaks and extracting a program summary from a complete broadcast to enable video transparency is presented.
Abstract: Today the consumers are facing an ever-increasing amount of television programs. The problem, however, is that the content of video programs is opaque. The existing video watching options for consumers are either to watch the whole video, fast forward to try and find the relevant portion, or to use electronic program guides to get additional information. In this paper we will present a summarization system for processing incoming video, extracting and analyzing closed caption text, determining the boundaries of program segments as well as commercial breaks and extracting a program summary from a complete broadcast to enable video transparency. The system consists of: transcript extractor, program type classifier, cue extractor, knowledge database, temporal database, inference engine, and summerizer. The main topics that will be discussed are video summary, video categorization and retrieval tools.

Proceedings ArticleDOI
TL;DR: A set of automatically extractable, known and novel, descriptors of motion activity based on different hypotheses about subjective perception ofmotion activity are presented and it is found that the MPEG-7 motion activity descriptor is one of the best in overall performance over the test set.
Abstract: We present a psycho-visual and analytical framework for automatic measurement of motion activity in view sequences. We construct a test-set of video segments by carefully selecting video segments form the MPEG-7 video test set. We construct a ground truth, based on subjective test with naive subjects. We find that the subjects agree reasonably on the motion activity of video segments, which makes the ground truth reliable. We present a set of automatically extractable, known and novel, descriptors of motion activity based on different hypotheses about subjective perception of motion activity. We show that all the descriptors perform well against the ground truth. We find that the MPEG-7 motion activity descriptor, based on variance of motion vector magnitudes, is one of the best in overall performance over the test set.

Proceedings ArticleDOI
TL;DR: This paper proposes a new relevance feedback approach with progressive leaning capability based on a Bayesian classifier and treats positive and negative feedback examples with different strategies, which can utilitize previous users' feedback information to help the current query.
Abstract: As an effective solution of the content-based image retrieval problems, relevance feedback has been put on many efforts for the past few years. In this paper, we propose a new relevance feedback approach with progressive leaning capability. It is based on a Bayesian classifier and treats positive and negative feedback examples with different strategies. It can utilitize previous users' feedback information to help the current query. Experimental results show that our algorithm achieves high accuracy and effectiveness on real-world image collections.© (2001) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Proceedings ArticleDOI
TL;DR: The novel ability of the approach to use the information content in multiple media coupled with a strong emphasis on temporal similarity differentiates it from the state-of-the-art in content-based retrieval.
Abstract: A necessary capability for content-based retrieval is to support the paradigm of query by example. In the past, there have been several attempts to use low-level features for video retrieval. None of the approaches however uses the multimedia information content of the video. We present an algorithm for matching multi modal patterns for the purpose of content-based video retrieval. The novel ability of our approach to use the information content in multiple media coupled with a strong emphasis on temporal similarity differentiates it from the state-of-the-art in content-based retrieval. At the core of the pattern matching scheme is a dynamic programming algorithm, which leads to a significant improvement in performance. Coupling the use of audio with video this algorithm can be applied to grouping of shots based on audio-visual similarity. This is much more effective in constructing scenes from shots than using only visual content to do the same.

Proceedings ArticleDOI
TL;DR: Investigating the relative effectiveness of several types of global shape feature, both singly and in combination, suggests that a wide variety of shape feature combinations can provide adequate discriminating power for effective shape retrieval in multi-component image collections such as trademark registries.
Abstract: Many different kinds of features have been used as the basis for shape retrieval from image databases. This paper investigates the relative effectiveness of several types of global shape feature, both singly and in combination. The features compared include well-established descriptors such as Fourier coefficients and moment invariants, as well as recently-proposed measures of triangularity and ellipticity. Experiments were conducted within the framework of the ARTISAN shape retrieval system, and retrieval effectiveness assessed on a database of over 10,000 images, using 24 queries and associated ground truth supplied by the UK Patent Office . Our experiments revealed only minor differences in retrieval effectiveness between different measures, suggesting that a wide variety of shape feature combinations can provide adequate discriminating power for effective shape retrieval in multi-component image collections such as trademark registries. Marked differences between measures were observed for some individual queries, suggesting that there could be considerable scope for improving retrieval effectiveness by providing users with an improved framework for searching multi-dimensional feature space.

Proceedings ArticleDOI
TL;DR: Novel classification algorithms are presented for distinguishing photo-like images from graphical images, true photos from only photo- like, but artificial images and presentation slides from comics.
Abstract: Numerous research works about the extraction of low-level features from images and videos have been published. However, only recently the focus has shifted to exploiting low-level features to classify images and videos automatically into semantically meaningful and broad categories. In this paper, novel classification algorithms are presented for three broad and general-purpose categories. In detail, we present algorithms for distinguishing photo-like images from graphical images, true photos from only photo-like, but artificial images and presentation slides from comics. On a large image database, our classification algorithm achieved an accuracy of 97.3% in separating photo-like images from graphical images. In the subset of photo-like images, true photos could be separated from ray-traced/rendered image with an accuracy of 87.3%, while with an accuracy of 93.2% the subset of graphical images was successfully partitioned into presentation slides and comics.

Proceedings ArticleDOI
TL;DR: The algorithm developed in this paper differs from traditional techniques for logo detection and classification that are applicable either to well-structured general text documents or to specialized trademark logo databases, where logos appear isolated on a clear background and where their detection and classified is not disturbed by the surrounding visual detail.
Abstract: This paper presents a novel approach to detecting and classifying a trademark logo in frames of a sport video. In view of the fact that we attempt to detect and recognize a logo in a natural scene, the algorithm developed in this paper differs from traditional techniques for logo detection and classification that are applicable either to well-structured general text documents (e.g. invoices, memos, bank cheques) or to specialized trademark logo databases, where logos appear isolated on a clear background and where their detection and classification is not disturbed by the surrounding visual detail. Although the development of our algorithm is still in its starting phase, experimental results performed so far on a set of soccer TV broadcasts are very encouraging. Keywords: trademark logo detection and recognition, video indexing, video content analysis 1. INTRODUCTION Measuring the statistics of the appearance of a trademark logo in a video is of great importance in the marketing and sponsoring sector. This statistics takes into account the number N of appearances of a certain logo along a video, positions L of a logo in a frame and the duration T of each logo appearance. Based on these three parameters one can compute the likelihood that a logo was noticed by a wide TV audience at home. As such, the parameters N , L and T can be used to determine the optimal position to place a logo on a scene that is to be broadcasted and to determine the advertising price per position. Finally, these parameters can be used as a feedback for a sponsor to check if the visibility of a logo along a video justifies its sponsorship engagement: the visibility of a logo can be seen as an indication of the expected revenue for the sponsor. (a) (b) (c) Figure 1: Examples of frames taken from a soccer video and containing logos that are to be detected and classified: (a) Champion League logo in the middle of the stadium, (b) ABN-AMRO bank logo on player’s shirt, (c) EuroCard, AMSTEL BEER and Canon logos on the boards surrounding the playing field.

Journal Article
TL;DR: In this paper, the authors investigated the influence of soil texture and water content on surface soil temperatures above antitank mines buried at 15 cm depth and away from it, and found that the thermal signature of an anti tank mine strongly depends on the complex interaction between soil texture, water content, and geographical location.
Abstract: The objective of this study is to expand our exploration of the effects of the soil environment on landmine detection by investigating the influence of soil texture and water content on surface soil temperatures above antitank mines buried at 15 cm depth and away from it. Temperature distributions in July were calculated in six soil textures for the climatic conditions of Kuwait and Sarajevo. We evaluated the temperature distributions in typical dry and wet soil profiles. The simulated temperature differences varied from .22-.63 degree Celsius in Kuwait to .16-.37 in Sarajevo. Temperature differences were - with one exception - larger in the wet than in the dry soils which suggests that soil watering may help improve thermal signatures. A major finding of this study is that the thermal signature of an anti tank mine strongly depends on the complex interaction between soil texture, water content, and geographical location. It is very difficult to predict the exact time or even the approximate hour of the appearance or nonappearance of a thermal signature. Therefore, this modeling study indicates that the use of a thermal sensor in a real mine field for instantaneous mine detection carries a high risk. On the other hand if a given area can b monitored constantly with a thermal sensor for twelve hours or longer the thermal signature will be detected if the signal to noise ratio of the mine environment allows so. Field experiments are needed to validate the results of this modeling study.

Proceedings ArticleDOI
TL;DR: There is a substantial body of research on computer methods for managing collections of images and videos as discussed by the authors, but there is little evidence that this research has had important impact on an any community yet.
Abstract: There is a substantial body of research on computer methods for managing collections of images and videos. There is little evidence that this research has had important impact on an any community yet. I use an invitation to speak on a topic on which I am not expert to air some opinions about evaluating image retrieval research. In my opinion, there is little to be gained in measuring current solutions with reference collections, because these solutions differ so widely from user needs that the exercise becomes empty. The user studies literature is not well enough read by the image retrieval community. As a result, we tend to study somewhat artificial problems. A study of the user needs literature suggests that we will need to solve deep problems to produce useful solutions to image retrieval problems, but that there may be a need for a number of technologies that can be built in practice. I believe we should concentrate on these issues, rather than on measuring the performance of current systems.

Proceedings ArticleDOI
TL;DR: This paper presents an image retrieval method based on region shape similarity, which first segment images into primitive regions and then combines some of the primitive regions to generate meaningful composite shapes, which are used as semantic units of the images during the similarity assessment process.
Abstract: This paper presents an image retrieval method based on region shape similarity. In our approach, we first segment images into primitive regions and then combine some of the primitive regions to generate meaningful composite shapes, which are used as semantic units of the images during the similarity assessment process. We employ three global shape features and a set of normalized Fourier descriptors to characterize each meaningful shape. All these features are invariant under similar transformations. Finally, we measure the similarity between two image by finding the most similar pair of shapes in the two images. Our approach has demonstrated good performance in our retrieval experiments on clipart images.

Proceedings ArticleDOI
TL;DR: In this paper, a knowledge-based approach for video content classification is proposed, which is motivated by the fact that videos are rich in semantic contents, which can best be interpreted and analyzed by human experts.
Abstract: A framework for video content classification using a knowledge-based approach is herein proposed. This approach is motivated by the fact that videos are rich in semantic contents, which can best be interpreted and analyzed by human experts. We demonstrate the concept by implementing a prototype video classification system using the rule-based programming language CLIPS 6.05. Knowledge for video classification is encoded as a set of rules in the rule base. The left-hand-sides of rules contain high level and low level features, while the right-hand-sides of rules contain intermediate results or conclusions. Our current implementation includes features computed from motion, color, and text extracted from video frames. Our current rule set allows us to classify input video into one of five classes: news, weather, reporting, commercial, basketball and football. We use MYCIN's inexact reasoning method for combining evidences, and to handle the uncertainties in the features and in the classification results. We obtained good results in a preliminary experiment, and it demonstrated the validity of the proposed approach.© (2001) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Proceedings ArticleDOI
TL;DR: An approach which classifies a segmented video object base by matching curvature features of the contours of these object views to a database containing preprocessed views of prototypical objects using a modified curvature scale space technique.
Abstract: The recognition of objects that appear in a video sequence is an essential aspect of any video content analysis system. We present an approach which classifies a segmented video object base don its appearance in successive video frames. The classification is performed by matching curvature features of the contours of these object views to a database containing preprocessed views of prototypical objects using a modified curvature scale space technique. By integrating the result of an umber of successive frames and by using the modified curvature scale space technique as an efficient representation of object contours, our approach enables the robust, tolerant and rapid object classification of video objects.

Proceedings ArticleDOI
TL;DR: To provide users with an overview of medical video content at various levels of abstraction which can be used for more efficient database browsing and access, a hierarchical video summarization strategy has been developed and is presented in this paper.
Abstract: To provide users with an overview of medical video content at various levels of abstraction which can be used for more efficient database browsing and access, a hierarchical video summarization strategy has been developed and is presented in this paper. To generate an overview, the key frames of a video are preprocessed to extract special frames (black frames, slides, clip art, sketch drawings) and special regions (faces, skin or blood-red areas). A shot grouping method is then applied to merge the spatially or temporally related shots into groups. The visual features and knowledge from the video shots are integrated to assign the groups into predefined semantic categories. Based on the video groups and their semantic categories, video summaries for different levels are constructed by group merging, hierarchical group clustering and semantic category selection. Based on this strategy, a user can select the layer of the summary to access. The higher the layer, the more concise the video summary; the lower the layer, the greater the detail contained in the summary.

Proceedings ArticleDOI
TL;DR: Logistic regression is introduced to model the dependence between image-features and the relevance that is implicitly defined by user-feedback, and the diagnostics that are an integral part of the regression procedure can be harnessed for adaptive feature selection by removing features that have low predictive power.
Abstract: We introduce logistic regression to model the dependence between image-features and the relevance that is implicitly defined by user-feedback. We assume that while browsing, the user can single out images as either examples or counter-examples of the sort of picture he is looking for. Based on this information, the system will construct logistic regression models that generalize this relevance probability to all images in the database. This information is then used to iteratively bias the next sample from the database. Furthermore, the diagnostics that are an integral part of the regression procedure can be harnessed for adaptive feature selection by removing features that have low predictive power.

Proceedings ArticleDOI
TL;DR: In this article, a hierarchical key frame hierarchy is proposed for video summarization and retrieval by introducing a notion of fidelity, which describes how well the key frames at one level are represented by the parent key frame relative to other children of the parent.
Abstract: Recently, a huge amount of the video data available in the digital form has given users to allow more ubiquitous access to visual information than ever. To efficiently manage such huge amount of video data, we need such tools as video summarization and search. In this paper, we propose a novel scheme allowing for both scalable hierarchical video summary and efficient retrieval by introducing a notion of fidelity. The notion of fidelity in the tree-structured key frame hierarchy describes how well the key frames at one level are represented by the parent key frame, relative to the other children of the parent. The experimental results demonstrate the feasibility of our scheme.

Proceedings ArticleDOI
TL;DR: A technique for video summarization that uses motion descriptors computed in the compressed domain to speed up conventional color basedVideo summarization technique by reducing the number of segments on which computationally more expensive color-based computation is needed.
Abstract: We describe a technique for video summarization that uses motion descriptors computed in the compressed domain to speed up conventional color based video summarization technique. The basic hypothesis of the work is that the intensity of motion activity of a video segment is a direct indication of its 'summarizability.' We present experimental verification of this hypothesis. We are thus able to quickly identify easy to summarize segments of a video sequence since they have a low intensity of motion activity. Moreover, the compressed domain extraction of motion activity intensity is much simpler than the color-based calculations. We are able to easily summarize these segments by simply choosing a key-frame at random from each low- activity segment. We can then apply conventional color-based summarization techniques to the remaining segments. We are thus able to speed up color-based summarization techniques by reducing the number of segments on which computationally more expensive color-based computation is needed.© (2001) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Proceedings ArticleDOI
TL;DR: The idea of the MPEG- 7 descriptors presented within this paper is to give by mens of MPEG-7 metadata Transcoding Hints to a transcoder regarding Motion and Encoding Difficulty so that these transcoding hints can be used at the transcoder to preserve the visual quality in terms of PSNR.
Abstract: This paper concerns the extraction methodology and usage of MPEG-7 metadata for video transcoding. The idea of the MPEG- 7 descriptors presented within this paper is to give by mens of MPEG-7 metadata Transcoding Hints to a transcoder regarding Motion and Encoding Difficulty. These transcoding hints can be used at the transcoder (1) to preserve the visual quality in terms of PSNR, (2) to modify the GOP structure for efficient storage and retrieval for fast video browsing, while (3) reducing the overall computational complexity significantly.