Maaike H. T. de Boer
Other affiliations: Radboud University Nijmegen
Bio: Maaike H. T. de Boer is an academic researcher from Netherlands Organisation for Applied Scientific Research. The author has contributed to research in topics: TRECVID & Decision support system. The author has an hindex of 7, co-authored 23 publications receiving 154 citations. Previous affiliations of Maaike H. T. de Boer include Radboud University Nijmegen.
••06 Jun 2016
TL;DR: A state-of-the-art framework for event search without any need of exemplar videos and textual metadata in search corpus is presented, with a large, pre-built bank of concept detectors which can understand the content of a video in the perspective of object, scene, action and activity concepts.
Abstract: Complex video event detection without visual examples is a very challenging issue in multimedia retrieval. We present a state-of-the-art framework for event search without any need of exemplar videos and textual metadata in search corpus. To perform event search given only query words, the core of our framework is a large, pre-built bank of concept detectors which can understand the content of a video in the perspective of object, scene, action and activity concepts. Leveraging such knowledge can effectively narrow the semantic gap between textual query and the visual content of videos. Besides the large concept bank, this paper focuses on two challenges that largely affect the retrieval performance when the size of the concept bank increases: (1) How to choose the right concepts in the concept bank to accurately represent the query; (2) if noisy concepts are inevitably chosen, how to minimize their influence. We share our novel insights on these particular problems, which paves the way for a practical system that achieves the best performance in NIST TRECVID 2015.
TL;DR: In this article, query expansion using common knowledge bases ConceptNet and Wikipedia is compared to an expert description of the topic applied to content-based information retrieval of complex events, and the results show that Query Expansion can improve performance compared to using no query expansion in the case that the main noun of the query could not be matched to a concept detector.
Abstract: A common approach in content based video information retrieval is to perform automatic shot annotation with semantic labels using pre-trained classifiers. The visual vocabulary of state-of-the-art automatic annotation systems is limited to a few thousand concepts, which creates a semantic gap between the semantic labels and the natural language query. One of the methods to bridge this semantic gap is to expand the original user query using knowledge bases. Both common knowledge bases such as Wikipedia and expert knowledge bases such as a manually created ontology can be used to bridge the semantic gap. Expert knowledge bases have highest performance, but are only available in closed domains. Only in closed domains all necessary information, including structure and disambiguation, can be made available in a knowledge base. Common knowledge bases are often used in open domain, because it covers a lot of general information. In this research, query expansion using common knowledge bases ConceptNet and Wikipedia is compared to an expert description of the topic applied to content-based information retrieval of complex events. We run experiments on the Test Set of TRECVID MED 2014. Results show that 1) Query Expansion can improve performance compared to using no query expansion in the case that the main noun of the query could not be matched to a concept detector; 2) Query expansion using expert knowledge is not necessarily better than query expansion using common knowledge; 3) ConceptNet performs slightly better than Wikipedia; 4) Late fusion can slightly improve performance. To conclude, query expansion has potential in complex event detection.
••22 Feb 2018
TL;DR: This thesis aims to assist an analyst in their work on video stream data by providing a search capability that handles ad-hoc textual queries, i.e. queries that include concepts or events that are not pre-trained, and proposes an incremental word2vec (i-w2v) method, which has higher visual search effectiveness compared to k-NN based methods on video level annotations and methods based on concept level annotations.
Abstract: In the modern world, networked sensor technology makes it possible to capture the world around us in real-time. In the security domain cameras are an important source of information. Cameras in public places, bodycams, drones and recordings with smart phones are used for real time monitoring of the environment to prevent crime (monitoring case); and/or for investigation and retrieval of crimes, for example in evidence forensics (forensic case). In both cases it is required to quickly obtain the right information, without having to manually search through the data. Currently, many algorithms are available to index a video with some pre-trained concepts, such as people, objects and actions. These algorithms require a representative and large enough set of examples (training data) to recognize the concept. This training data is, however, not always present. In this thesis, we aim to assist an analyst in their work on video stream data by providing a search capability that handles ad-hoc textual queries, i.e. queries that include concepts or events that are not pre-trained. We use the security domain as inspiration for our work, but the analyst can be working in any application domain that uses video stream data, or even indexed data. Additionally, we do only consider the technical aspects of the search capability and not on the legal, ethical or privacy issues related to video stream data. We focus on the retrieval of high-level events, such as birthday parties. We assume that these events can be composed of smaller pre-trained concepts, such as a group of people, a cake and decorations and relations between those concepts, to capture the essence of that unseen event (decompositionality assumption). Additionally, we hold the open world assumption, i.e. the system does not have complete world knowledge. Although current state of the art systems are able to detect an increasingly large number of concepts, this number still falls far behind the near infinite number of possible (textual) queries that a system needs to be able to handle. In our aim to assist the analyst, we focus on the improvement of the visual search effectiveness (e.g. performance) by a semantic query-to-concept mapping: the mapping from the user query to the set of pre-trained concepts. We use the TRECVID Multimedia Event Detection benchmark, as it contains high-level events inspired by the security domain. In this thesis, we show that the main improvements can be achieved by using a combination of i) queryto- concept mapping based on semantic word embeddings (+12%), ii) exploiting user feedback (+26%) and iii) fusion of different modalities (data sources) (+17%). First, we propose an incremental word2vec (i-w2v) method , which uses word2vec trained on GoogleNews items as a semantic embedding model and incrementally adds concepts to the set of selected concepts for a query in order to deal with query drift. This method improve performance in terms of MAP compared to the state of the art word2vec method and knowledge based techniques. In combination with a state of the art video event retrieval pipeline, we achieve top performance on the TRECVID MED benchmark regarding the zero-example task (MED14Test results). This improvement is, however, dependent on the availability of the concepts in the Concept Bank: without concepts related to or occurring in the event, we cannot detect the event. We, thus, need a properly composed Concept Bank to properly index videos. Second, we propose an Adaptive Relevance Feedback interpretation method named ARF  that not only achieves high retrieval performance, but is also theoretically founded through the Rocchio algorithm from the text retrieval field. This algorithm is adjusted to the event retrieval domain in a way that the weights for the concepts are changed based on the positive and negative annotations on videos. The ARF method has higher visual search effectiveness compared to k-NN based methods on video level annotations and methods based on concept level annotations. Third, we propose blind late fusion methods that are based on state of the art methods , such as average fusion or fusion based on probabilities. Especially the combination of a Joint Ratio (ratio of probabilities) and Extreme Ratio (ratio of minimum and maximum) method (JRER) achieves high performance in cases with reliable detectors, i.e. enough training examples. This method is not only applicable to the video retrieval field, but also in sensor fusion in general. Although future work can be done in the direction of implicit query-to-concept mapping through deep learning methods, smartly combining the concepts and the usage of spatial and temporal information, we have shown that our proposed methods can improve the visual search effectiveness by a semantic query-to-concept mapping which brings us a step closer to a search capability that handles ad-hoc textual queries for analysts.
TL;DR: The Semantic Event Retrieval System is presented which shows the importance of high-level concepts in a vocabulary for the retrieval of complex and generic high- level events and uses a novel concept selection method (i-w2v) based on semantic embeddings.
Abstract: Searching in digital video data for high-level events, such as a parade or a car accident, is challenging when the query is textual and lacks visual example images or videos. Current research in deep neural networks is highly beneficial for the retrieval of high-level events using visual examples, but without examples it is still hard to (1) determine which concepts are useful to pre-train (Vocabulary challenge) and (2) which pre-trained concept detectors are relevant for a certain unseen high-level event (Concept Selection challenge). In our article, we present our Semantic Event Retrieval System which (1) shows the importance of high-level concepts in a vocabulary for the retrieval of complex and generic high-level events and (2) uses a novel concept selection method (i-w2v) based on semantic embeddings. Our experiments on the international TRECVID Multimedia Event Detection benchmark show that a diverse vocabulary including high-level concepts improves performance on the retrieval of high-level events in videos and that our novel method outperforms a knowledge-based concept selection method.
TL;DR: In this paper, a set of modular design patterns for hybrid, neuro-symbolic AI systems is proposed to describe the architecture of a very large number of hybrid systems by composing only a small set of elementary patterns as building blocks.
Abstract: The unification of statistical (data-driven) and symbolic (knowledge-driven) methods is widely recognized as one of the key challenges of modern AI. Recent years have seen a large number of publications on such hybrid neuro-symbolic AI systems. That rapidly growing literature is highly diverse, mostly empirical, and is lacking a unifying view of the large variety of these hybrid systems. In this paper, we analyze a large body of recent literature and we propose a set of modular design patterns for such hybrid, neuro-symbolic systems. We are able to describe the architecture of a very large number of hybrid systems by composing only a small set of elementary patterns as building blocks. The main contributions of this paper are: 1) a taxonomically organised vocabulary to describe both processes and data structures used in hybrid systems; 2) a set of 15+ design patterns for hybrid AI systems organized in a set of elementary patterns and a set of compositional patterns; 3) an application of these design patterns in two realistic use-cases for hybrid AI systems. Our patterns reveal similarities between systems that were not recognized until now. Finally, our design patterns extend and refine Kautz’s earlier attempt at categorizing neuro-symbolic architectures.
15 Oct 2015
TL;DR: In this article, where-CNN is used to learn a feature representation in which matching views are near one another and mismatched views are far apart, which achieves significant improvements over traditional hand-crafted features and existing deep features learned from other large-scale databases.
Abstract: : The recent availability of geo-tagged images and rich geospatial data has inspired a number of algorithms for image based geolocalization. Most approaches predict the location of a query image by matching to ground-level images with known locations (e.g., street-view data). However, most of the Earth does not have ground-level reference photos available. Fortunately, more complete coverage is provided by oblique aerial or bird's eye imagery. In this work, we localize a ground-level query image by matching it to a reference database of aerial imagery. We use publicly available data to build a dataset of 78K aligned crossview image pairs. The primary challenge for this task is that traditional computer vision approaches cannot handle the wide baseline and appearance variation of these cross-view pairs. We use our dataset to learn a feature representation in which matching views are near one another and mismatched views are far apart. Our proposed approach, Where-CNN, is inspired by deep learning success in face verification and achieves significant improvements over traditional hand-crafted features and existing deep features learned from other large-scale databases. We show the effectiveness of Where-CNN in finding matches between street view and aerial view imagery and demonstrate the ability of our learned features to generalize to novel locations.
TL;DR: This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications – bringing out similarities and differences.
Abstract: With the ever increasing size of the web, relevant information extraction on the Internet with a query formed by a few keywords has become a big challenge. Query Expansion (QE) plays a crucial role in improving searches on the Internet. Here, the user’s initial query is reformulated by adding additional meaningful terms with similar significance. QE – as part of information retrieval (IR) – has long attracted researchers’ attention. It has become very influential in the field of personalized social document, question answering, cross-language IR, information filtering and multimedia IR. Research in QE has gained further prominence because of IR dedicated conferences such as TREC (Text Information Retrieval Conference) and CLEF (Conference and Labs of the Evaluation Forum). This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications – bringing out similarities and differences.
••15 Jun 2019
TL;DR: In this paper, a dual deep encoding network is proposed to encode videos and queries into powerful dense representations of their own, achieving state-of-the-art performance for zero-example video retrieval.
Abstract: This paper attacks the challenging problem of zero-example video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Dual encoding is conceptually simple, practically effective and end-to-end. As experiments on three benchmarks, i.e. MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval.
••15 Oct 2018
TL;DR: The proposed model, a language-temporal attention network is utilized to learn the word attention based on the temporal context information in the video and can automatically select "what words to listen to" for localizing the desired moment.
Abstract: In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the localization of moments, but also the multimodal comprehension of textual-temporal information (e.g., "first" and "leaving") that helps to distinguish the desired moment from the others, especially those with the similar visual content. While existing studies treat a given language query as a single unit, we propose to decompose it into two components: the relevant cue related to the desired moment localization and the irrelevant one meaningless to the localization. This allows us to flexibly adapt to arbitrary queries in an end-to-end framework. In our proposed model, a language-temporal attention network is utilized to learn the word attention based on the temporal context information in the video. Therefore, our model can automatically select "what words to listen to" for localizing the desired moment. We evaluate the proposed model on two public benchmark datasets: DiDeMo and Charades-STA. The experimental results verify its superiority over several state-of-the-art methods.
01 Jan 2016
TL;DR: Thank you very much for downloading statistics a guide to the use of statistical methods in the physical sciences, instead of reading a good book with a cup of tea in the afternoon, instead they are facing with some harmful bugs inside their desktop computer.
Abstract: Thank you very much for downloading statistics a guide to the use of statistical methods in the physical sciences. Maybe you have knowledge that, people have search numerous times for their favorite readings like this statistics a guide to the use of statistical methods in the physical sciences, but end up in infectious downloads. Rather than reading a good book with a cup of tea in the afternoon, instead they are facing with some harmful bugs inside their desktop computer.