scispace - formally typeset
Search or ask a question
Author

Roger Zimmermann

Bio: Roger Zimmermann is an academic researcher from National University of Singapore. The author has contributed to research in topics: Mobile device & Server. The author has an hindex of 41, co-authored 374 publications receiving 6236 citations. Previous affiliations of Roger Zimmermann include University of Pennsylvania & University of Southern California.


Papers
More filters
Proceedings ArticleDOI
01 Jun 2018
TL;DR: A deep neural framework is proposed, termed conversational memory network, which leverages contextual information from the conversation history to recognize utterance-level emotions in dyadic conversational videos.
Abstract: Emotion recognition in conversations is crucial for the development of empathetic machines. Present methods mostly ignore the role of inter-speaker dependency relations while classifying emotions in conversations. In this paper, we address recognizing utterance-level emotions in dyadic conversational videos. We propose a deep neural framework, termed conversational memory network, which leverages contextual information from the conversation history. The framework takes a multimodal approach comprising audio, visual and textual features with gated recurrent units to model past utterances of each speaker into memories. Such memories are then merged using attention-based hops to capture inter-speaker dependencies. Experiments show an accuracy improvement of 3-4% over the state of the art.

308 citations

Journal ArticleDOI
TL;DR: This survey provides an overview of the different methods proposed over the last several years of bitrate adaptation algorithms for HTTP adaptive streaming, leaving it to system builders to innovate and implement their own method.
Abstract: In this survey, we present state-of-the-art bitrate adaptation algorithms for HTTP adaptive streaming (HAS). As a key distinction from other streaming approaches, the bitrate adaptation algorithms in HAS are chiefly executed at each client, i.e. , in a distributed manner. The objective of these algorithms is to ensure a high quality of experience (QoE) for viewers in the presence of bandwidth fluctuations due to factors like signal strength, network congestion, network reconvergence events, etc. While such fluctuations are common in public Internet, they can also occur in home networksor even managed networks where there is often admission control and QoS tools. Bitrate adaptation algorithms may take factors like bandwidth estimations, playback buffer fullness, device features, viewer preferences, and content features into account, albeit with different weights. Since the viewer’s QoE needs to be determined in real-time during playback, objective metrics are generally used including number of buffer stalls, duration of startup delay, frequency and amount of quality oscillations, and video instability. By design, the standards for HAS do not mandate any particular adaptation algorithm, leaving it to system builders to innovate and implement their own method. This survey provides an overview of the different methods proposed over the last several years.

289 citations

Proceedings ArticleDOI
01 Jan 2018
TL;DR: Interactive COnversational memory Network (ICON), a multi-modal emotion detection framework that extracts multimodal features from conversational videos and hierarchically models the self- and inter-speaker emotional influences into global memories to aid in predicting the emotional orientation of utterance-videos.
Abstract: Emotion recognition in conversations is crucial for building empathetic machines. Present works in this domain do not explicitly consider the inter-personal influences that thrive in the emotional dynamics of dialogues. To this end, we propose Interactive COnversational memory Network (ICON), a multimodal emotion detection framework that extracts multimodal features from conversational videos and hierarchically models the self- and inter-speaker emotional influences into global memories. Such memories generate contextual summaries which aid in predicting the emotional orientation of utterance-videos. Our model outperforms state-of-the-art networks on multiple classification and regression tasks in two benchmark datasets.

251 citations

Proceedings ArticleDOI
20 Apr 2016
TL;DR: A dynamic video stream processing scheme is proposed to meet the requirements of real-time information processing and decision making and the potential to enable multi-target tracking function using a simpler single target tracking algorithm is explored.
Abstract: The recent rapid development of urbanization and Internet of things (IoT) encourages more and more research on Smart City in which computing devices are widely distributed and huge amount of dynamic real-time data are collected and processed. Although vast volume of dynamic data are available for extracting new living patterns and making urban plans, efficient data processing and instant decision making are still key issues, especially in emergency situations requesting quick response with low latency. Fog Computing, as the extension of Cloud Computing, enables the computing tasks accomplished directly at the edge of the network and is characterized as low latency and real time computing. However, it is non-trivial to coordinate highly heterogeneous Fog Computing nodes to function as a homogeneous platform. In this paper, taking urban traffic surveillance as a case study, a dynamic video stream processing scheme is proposed to meet the requirements of real-time information processing and decision making. Furthermore, we have explored the potential to enable multi-target tracking function using a simpler single target tracking algorithm. A prototype is built and the performance is evaluated. The experimental results show that our scheme is a promising solution for smart urban surveillance applications.

164 citations

Journal ArticleDOI
TL;DR: A new photo aesthetics evaluation framework is proposed, focusing on learning the image descriptors that characterize local and global structural aesthetics from multiple visual channels, which significantly outperforms state-of-the-art algorithms in photo aesthetics prediction.
Abstract: Photo aesthetic quality evaluation is a fundamental yet under addressed task in computer vision and image processing fields. Conventional approaches are frustrated by the following two drawbacks. First, both the local and global spatial arrangements of image regions play an important role in photo aesthetics. However, existing rules, e.g., visual balance, heuristically define which spatial distribution among the salient regions of a photo is aesthetically pleasing. Second, it is difficult to adjust visual cues from multiple channels automatically in photo aesthetics assessment. To solve these problems, we propose a new photo aesthetics evaluation framework, focusing on learning the image descriptors that characterize local and global structural aesthetics from multiple visual channels. In particular, to describe the spatial structure of the image local regions, we construct graphlets small-sized connected graphs by connecting spatially adjacent atomic regions. Since spatially adjacent graphlets distribute closely in their feature space, we project them onto a manifold and subsequently propose an embedding algorithm. The embedding algorithm encodes the photo global spatial layout into graphlets. Simultaneously, the importance of graphlets from multiple visual channels are dynamically adjusted. Finally, these post-embedding graphlets are integrated for photo aesthetics evaluation using a probabilistic model. Experimental results show that: 1) the visualized graphlets explicitly capture the aesthetically arranged atomic regions; 2) the proposed approach generalizes and improves four prominent aesthetic rules; and 3) our approach significantly outperforms state-of-the-art algorithms in photo aesthetics prediction.

160 citations


Cited by
More filters