scispace - formally typeset
Search or ask a question

Showing papers presented at "ACM SIGMM Conference on Multimedia Systems in 2021"


Proceedings ArticleDOI
24 Jun 2021
TL;DR: Meshroom as discussed by the authors is a photogrammetry pipeline for reconstructing 3D scenes from a set of unordered images, which allows the user to customize the different pipelines to adjust them to their domain specific needs.
Abstract: This paper introduces the Meshroom software and its underlying 3D computer vision framework AliceVision. This solution provides a photogrammetry pipeline to reconstruct 3D scenes from a set of unordered images. It also features other pipelines for fusing multi-bracketing low dynamic range images into high dynamic range, stitching multiple images into a panorama and estimating the motion of a moving camera. Meshroom's node-graph architecture allows the user to customize the different pipelines to adjust them to their domain specific needs. The user can interactively add other processing nodes to modify a pipeline, export intermediate data to analyze the result of the algorithms and easily compare the outputs given by different sets of parameters. The software package is released in open source and relies on open file formats. These features enable researchers to conveniently run the pipelines, access and visualize the data at each step, thus promoting the sharing and the reproducibility of the results.

77 citations


Proceedings ArticleDOI
15 Jul 2021
TL;DR: In this article, the authors present a VR communication framework that enables remote communication in virtual environments with real-time photorealistic user representation based on RGBD cameras and web browser clients, deployed on common off-the-shelf hardware devices.
Abstract: Tools and platforms that enable remote communication and collaboration provide a strong contribution to societal challenges. Virtual meetings and conferencing, in particular, can help to reduce commutes and lower our ecological footprint, and can alleviate physical distancing measures in case of global pandemics. In this paper, we outline how to bridge the gap between common video conferencing systems and emerging social VR platforms to allow immersive communication in Virtual Reality (VR). We present a novel VR communication framework that enables remote communication in virtual environments with real-time photorealistic user representation based on colour-and-depth (RGBD) cameras and web browser clients, deployed on common off-the-shelf hardware devices. The paper's main contribution is threefold: (a) a new VR communication framework, (b) a novel approach for real-time depth data transmitting as a 2D grayscale for 3D user representation, including a central MCU-based approach for this new format and (c) a technical evaluation of the system with respect to processing delay, CPU and GPU usage.

14 citations


Proceedings ArticleDOI
15 Jul 2021
TL;DR: CrossRoI is presented, a resource-efficient system that enables real time video analytics at scale via harnessing the videos content associations and redundancy across a fleet of cameras to drastically reduce the communication and computation costs.
Abstract: Video cameras are pervasively deployed in city scale for public good or community safety (i.e. traffic monitoring or suspected person tracking). However, analyzing large scale video feeds in real time is data intensive and poses severe challenges to today's network and computation systems. We present CrossRoI, a resource-efficient system that enables real time video analytics at scale via harnessing the videos content associations and redundancy across a fleet of cameras. CrossRoI exploits the intrinsic physical correlations of cross-camera viewing fields to drastically reduce the communication and computation costs. CrossRoI removes the repentant appearances of same objects in multiple cameras without harming comprehensive coverage of the scene. CrossRoI operates in two phases - an offline phase to establish cross-camera correlations, and an efficient online phase for real time video inference. Experiments on real-world video feeds show that CrossRoI achieves 42% ~ 65% reduction for network overhead and 25% ~ 34% reduction for response delay in real time video analytics applications with more than 99% query accuracy, when compared to baseline methods. If integrated with SotA frame filtering systems, the performance gains of CrossRoI reaches 50% ~ 80% (network overhead) and 33% ~ 61% (end-to-end delay).

14 citations


Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this paper, the authors leverage the aforementioned modern networking paradigms and design network-assistance for/by HAS clients to improve HAS systems performance and CDN/network utilization.
Abstract: Video streaming has become one of the most prevailing, bandwidth-hungry, and latency-sensitive Internet applications. HTTP Adaptive Streaming (HAS) has become the dominant video delivery mechanism over the Internet. Lack of coordination among the clients and lack of awareness of the network in pure client-based adaptive video bitrate approaches have caused problems, such as sub-optimal data throughput from Content Delivery Network (CDN) or origin servers, high CDN costs, and non-satisfactory users' experience. Recent studies have shown that network-assisted HAS techniques by utilizing modern networking paradigms, e.g., Software Defined Networking (SDN), Network Function Virtualization(NFV), and edge computing can significantly improve HAS system performance. In this doctoral study, we leverage the aforementioned modern networking paradigms and design network-assistance for/by HAS clients to improve HAS systems performance and CDN/network utilization. We present four fundamental research questions to target different challenges in devising a network-assisted HAS system.

13 citations


Proceedings ArticleDOI
15 Jul 2021
TL;DR: Li et al. as mentioned in this paper proposed CEVAS, a Cloud-Edge collaborative Video Analytics system empowered by fine-grained Serverless pipelines, which builds flexible serverless-based infrastructures to facilitate finegrained and adaptive partitioning of cloud-edge workloads for multiple concurrent query pipelines.
Abstract: The ever-growing deployment scale of surveillance cameras and the users' increasing appetite for real-time queries have urged online video analytics. Synergizing the virtually unlimited cloud resources with agile edge processing would deliver an ideal online video analytics system; yet, given the complex interaction and dependency within and across video query pipelines, it is easier said than done. This paper starts with a measurement study to acquire a deep understanding of video query pipelines on real-world camera streams. We identify the potentials and practical challenges towards cloud-edge collaborative video analytics. We then argue that the newly emerged serverless computing paradigm is the key to achieve fine-grained resource partitioning with minimum dependency. We accordingly propose CEVAS, a Cloud-Edge collaborative Video Analytics system empowered by fine-grained Serverless pipelines. It builds flexible serverless-based infrastructures to facilitate fine-grained and adaptive partitioning of cloud-edge workloads for multiple concurrent query pipelines. With the optimized design of individual modules and their integration, CEVAS achieves real-time responses to highly dynamic input workloads. We have developed a prototype of CEVAS over Amazon Web Services (AWS) and conducted extensive experiments with real-world video streams and queries. The results show that by judiciously coordinating the fine-grained serverless resources in the cloud and at the edge, CEVAS reduces 86.9% cloud expenditure and 74.4% data transfer overhead of a pure cloud scheme and improves the analysis throughput of a pure edge scheme by up to 20.6%. Thanks to the fine-grained video content-aware forecasting, CEVAS is also more adaptive than the state-of-the-art cloud-edge collaborative scheme.

12 citations


Proceedings ArticleDOI
15 Jul 2021
TL;DR: In this paper, the authors present a networked, high-performance graphics system that combines dynamic, high quality, ray traced global illumination computed on a server with direct illumination and primary visibility computed on the client.
Abstract: We present a networked, high-performance graphics system that combines dynamic, high-quality, ray traced global illumination computed on a server with direct illumination and primary visibility computed on a client. This approach provides many of the image quality benefits of real-time ray tracing on low-power and legacy hardware, while maintaining a low latency response and mobile form factor. As opposed to streaming full frames from rendering servers to end clients, our system distributes the graphics pipeline over a network by computing diffuse global illumination on a remote machine. Diffuse global illumination is computed using a recent irradiance volume representation combined with a new lossless, HEVC-based, hardware-accelerated encoding, and a perceptually-motivated update scheme. Our experimental implementation streams thousands of irradiance probes per second and requires less than 50 Mbps of throughput, reducing the consumed bandwidth by 99.4% when streaming at 60 Hz compared to traditional lossless texture compression. The bandwidth reduction achieved with our approach allows higher quality and lower latency graphics than state-of-the-art remote rendering via video streaming. In addition, our split-rendering solution decouples remote computation from local rendering and so does not limit local display update rate or display resolution.

10 citations


Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this article, a dynamic point cloud dataset that depicts humans interacting in social XR settings is presented, using commodity hardware to capture a total of 45 unique sequences, according to several use cases.
Abstract: Real-time, immersive telecommunication systems are quickly becoming a reality, thanks to the advances in acquisition, transmission, and rendering technologies. Point clouds in particular serve as a promising representation in these type of systems, offering photorealistic rendering capabilities with low complexity. Further development of transmission, coding, and quality evaluation algorithms, though, is currently hindered by the lack of publicly available datasets that represent realistic scenarios of remote communication between people in real-time. In this paper, we release a dynamic point cloud dataset that depicts humans interacting in social XR settings. Using commodity hardware, we capture a total of 45 unique sequences, according to several use cases for social XR. As part of our release, we provide annotated raw material, resulting point cloud sequences, and an auxiliary software toolbox to acquire, process, encode, and visualize data, suitable for real-time applications. The dataset can be accessed via the following link: https://www.dis.cwi.nl/cwipc-sxr-dataset/.

10 citations


Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this article, the authors present a 10 x 10 LF capture matrix composed of 100 cameras, each with a 1920 x 1056 resolution. And they used this matrix to record videos in real and varying illumination and scene dynamics conditions.
Abstract: We present a 4D Light Field (LF) video dataset, collected by a custom-made camera matrix, to be used for designing and testing algorithms and systems for LF video coding, processing, and streaming. Compared to existing LF datasets, ours provides LF videos, as opposed to only images, and at higher frame resolution, higher number of viewpoints, and/or higher framerate, offering the best visual quality LF video dataset. To achieve this, we built a 10 x 10 LF capture matrix composed of 100 cameras, each with a 1920 x 1056 resolution. We used this matrix to record videos in real and varying illumination and scene dynamics conditions. The dataset contains a total of nine groups of LF videos: eight groups collected with a fixed camera matrix position and orientation recording indoor potted plants, furniture, etc., and the last group collected by rotating around an outdoor environment with roadside vehicles, pedestrians, etc. Each group of LF videos consists of 100 video streams encoded with H.265/HEVC. Scene changes vary from static to slightly dynamic to highly dynamic, providing a good level of diversity. As an example, we present the results of a depth estimation method and show that our dataset can be used for applications such as objection detection, 3D modeling, and others.

10 citations


Proceedings ArticleDOI
Liyang Sun1, Tongyu Zong1, Siquan Wang1, Yong Liu1, Yao Wang1 
15 Jul 2021
TL;DR: In this article, a detailed chunk-level dynamic model was developed to characterize how video rate and playback speed jointly control the evolution of a live streaming session, and the optimal joint video rate-playback speed adaptation was studied as a non-linear optimal control problem.
Abstract: It is highly challenging to simultaneously achieve high-rate and low-latency in live video streaming. Chunk-based streaming and playback speed adaptation are two promising new trends to achieve high user Quality-of-Experience (QoE). To thoroughly understand their potentials, we develop a detailed chunk-level dynamic model that characterizes how video rate and playback speed jointly control the evolution of a live streaming session. Leveraging on the model, we first study the optimal joint video rate-playback speed adaptation as a non-linear optimal control problem. We further develop model-free joint adaptation strategies using deep reinforcement learning. Through extensive experiments, we demonstrate that our proposed joint adaptation algorithms significantly outperform rate-only adaptation algorithms and the recently proposed low-latency video streaming algorithms that separately adapt video rate and playback speed without joint optimization. In a wide-range of network conditions, the model-based and model-free algorithms can achieve close-to-optimal trade-offs tailored for users with different QoE preferences.

10 citations


Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this paper, the authors investigate the rate-distortion characteristics of full ultra-high definition (UHD) 360° videos and capture corresponding head movement navigation data of virtual reality (VR) headsets.
Abstract: We investigate the rate-distortion (R-D) characteristics of full ultra-high definition (UHD) 360° videos and capture corresponding head movement navigation data of virtual reality (VR) headsets. We use the navigation data to analyze how users explore the 360° look-around panorama for such content and formulate related statistical models. The developed R-D characteristics and modeling capture the spatiotemporal encoding efficiency of the content at multiple scales and can be exploited to enable higher operational efficiency in key use cases. The high quality expectations for next generation immersive media necessitate the understanding of these intrinsic navigation and content characteristics of full UHD 360° videos.

9 citations


Proceedings ArticleDOI
15 Jul 2021
TL;DR: LiveROI as discussed by the authors employs an action recognition algorithm to analyze the video content and uses the analysis results as the basis of viewport prediction to eliminate the need of historical video/user data and employs adaptive user preference modeling and word embedding to dynamically select the video viewport at runtime based on the user head orientation.
Abstract: Virtual reality (VR) streaming can provide immersive video viewing experience to the end users but with huge bandwidth consumption. Recent research has adopted selective streaming to address the bandwidth challenge, which predicts and streams the user's viewport of interest with high quality and the other portions of the video with low quality. However, the existing viewport prediction mechanisms mainly target the video-on-demand (VOD) scenario relying on historical video and user trace data to build the prediction model. The community still lacks an effective viewport prediction approach to support live VR streaming, the most engaging and popular VR streaming experience. We develop a region of interest (ROI)-based viewport prediction approach, namely LiveROI, for live VR streaming. LiveROI employs an action recognition algorithm to analyze the video content and uses the analysis results as the basis of viewport prediction. To eliminate the need of historical video/user data, LiveROI employs adaptive user preference modeling and word embedding to dynamically select the video viewport at runtime based on the user head orientation. We evaluate LiveROI with 12 VR videos viewed by 48 users obtained from a public VR head movement dataset. The results show that LiveROI achieves high prediction accuracy and significant bandwidth savings with real-time processing to support live VR streaming.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this article, the authors proposed four methods to improve the detection accuracy of COSMOS, which range from differential sensing and fake-or-fact checking that detect contradicting or fake captions to object-caption matching.
Abstract: The growing prevalence of visual disinformation has become an important problem to solve nowadays. Cheapfake is a new term used for the altered media generated by non-AI techniques. In their recent COSMOS work, the authors developed a self-supervised training strategy that detected whether different captions for a given image were out-of-context, meaning that even though pointing to the same object(s) in the image, the captions implied different meanings. In this paper, we propose four methods to improve the detection accuracy of COSMOS. These methods range from differential sensing and fake-or-fact checking that detect contradicting or fake captions to object-caption matching and threshold adjustment that modify the baseline algorithm for improved accuracy.

Proceedings ArticleDOI
15 Jul 2021
TL;DR: In this paper, the authors leverage a simple and intuitive method to resolve the fundamental problem of bandwidth estimation for low latency live streaming through the use of a hybrid of an existing chunk parser and proposed filtering of downloaded chunk data.
Abstract: More users have a growing interest in low latency over-the-top (OTT) applications such as online video gaming, video chat, online casino, sports betting, and live auctions. OTT applications face challenges in delivering low latency live streams using Dynamic Adaptive Streaming over HTTP (DASH) due to large playback buffer and video segment duration. A potential solution to this issue is the use of HTTP chunked transfer encoding (CTE) with the common media application format (CMAF). This combination allows the delivery of each segment in several chunks to the client, starting before the segment is fully available in real-time. However, CTE and CMAF alone are not sufficient as they do not address other limitations and challenges at the client-side, including inaccurate bandwidth measurement, latency control, and bitrate selection. In this paper, we leverage a simple and intuitive method to resolve the fundamental problem of bandwidth estimation for low latency live streaming through the use of a hybrid of an existing chunk parser and proposed filtering of downloaded chunk data. Next, we model the playback buffer as a M/D/1/K queue to limit the playback delay. The combination of these techniques is collectively called QLive. QLive uses the relationship between the estimated bandwidth, total buffer capacity, instantaneous playback speed, and buffer occupancy to decide the playback speed and the bitrate of the representation to download. We evaluated QLive under a diverse set of scenarios and found that it controls the latency to meet the given latency requirement, with an average latency up to 21 times lower than the compared methods. The average playback speed of QLive ranges between 1.01 - 1.26X and it plays back at 1X speed up to 97% longer than the compared algorithms, without sacrificing the quality of the video. Moreover, the proposed bandwidth estimator has a 94% accuracy and is unaffected by a spike in instantaneous playback latency, unlike the compared state-of-the-art counterparts.

Proceedings ArticleDOI
Bo Wang1, Yuan Zhang1, Size Qian1, Zipeng Pan1, Yuhong Xie1 
24 Jun 2021
TL;DR: In this article, a hybrid receiver-side congestion control (HRCC) framework was proposed, which combines a heuristic congestion control scheme with an RL-Agent that periodically generates a gain coefficient to tune the bandwidth estimated by the heuristic scheme.
Abstract: Web real-time communication (WebRTC) employs congestion control to ensure the quality of experience (QoE). Different from congestion control schemes for TCP, WebRTC keeps a low-level playback buffer that considers excessively delayed packets as losses, which makes the congestion control for WebRTC more challenging. Existing heuristic schemes estimate the network conditions based on hand-crafted rules that may be suboptimal, leading to under-utilization or over-utilization of link capacity in many cases. On the other hand, the existing learning-based schemes train a model that acts in a large action space, which is hard to converge to a stable status and has low performance over unpredictable network conditions. In this paper, we propose a hybrid receiver-side congestion control (HRCC) framework, which combines a heuristic congestion control scheme with an RL-Agent that periodically generates a gain coefficient to tune the bandwidth estimated by the heuristic scheme. Extensive simulation experiments demonstrate that the HRCC's RL-Agent effectively tunes the bandwidth estimate of the heuristic scheme. The hybrid scheme achieves higher bandwidth utilization than the fully heuristic scheme with similar queuing delay and packet loss and outperforms the fully RL-based scheme on overall performance.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: Venctester as discussed by the authors is an open-source test automation framework for video encoder performance and conformance testing with the desired set of test video sequences, which includes support for the popular AVC, HEVC, VVC, VP9 and AV1 video coding formats and the state-of-the-art HM, Kvazaar, x265, VTM, VVenC, SVT-VP9, and SVT1 video encoders.
Abstract: The agile and efficient development of modern video encoders calls for automated testing methodologies. This paper presents the first-of-its-kind open-source test automation framework called uvgVenctester (github.com/ultravideo/uvgVenctester) that is designed for comprehensive performance and conformance testing of video encoders with the desired set of test video sequences. Our framework comes with built-in support for the popular AVC, HEVC, VVC, VP9, and AV1 video coding formats and the state-of-the-art HM, Kvazaar, x265, VTM, VVenC, SVT-VP9, and SVT-AV1 video encoders. Furthermore, there are no technical limitations of adopting other formats or encoders. The developers can evaluate the encoder of interest under the three primary usage scenarios: 1) conformance testing of the encoded bitstream; 2) rate-distortion-complexity comparison with the other encoders; and 3) systematic exploration of encoding parameters. The framework provides commonly used analysis tools to quantify encoding quality, speed, and bitrate with versatile set of absolute and comparative results such as Bjontegaard Delta (BD)-Rate for PSNR, SSIM, and VMAF quality metrics. The supported output formats include CSV, graph, and comparison table. They ensure that the results are available in human and machine-readable formats. To the best of our knowledge, the proposed framework is currently the most comprehensive and modular open-source software toolset for video encoder benchmarking.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this article, the authors evaluate the performance of low-latency HTTP Live Streaming (LL-HLS) and Low-Latency Dynamic Adaptive Streaming over HTTP (DASH) based players.
Abstract: Reducing end-to-end streaming latency is critical to HTTP-based live video streaming. There are currently two technologies in this domain: Low-Latency HTTP Live Streaming (LL-HLS) and Low-Latency Dynamic Adaptive Streaming over HTTP (LL-DASH). Many players support LL-HLS and/or LL-DASH protocols, including Apple's AVPlayer, Shaka player, HLS.js Dash.js, and others. This paper is dedicated to the analysis of the performance of low-latency players and streaming protocols. The evaluation is based on a series of live streaming experiments, repeated using identical video content, encoders, encoding profiles, and network conditions, emulated by using traces of real-world networks. Several performance metrics, such as average stream bitrate, the amounts of downloaded media data, streaming latency, as well as buffering and stream switching statistics are captured and reported in our experiments. These results are subsequently used to describe the observed differences in the performance of LL-HLS and LL-DASH-based players.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: QPlane as discussed by the authors is an alternative toolkit for RL training of fixed wing aircraft, which is easily modifiable for different scenarios and is replicable and flexible for ease of implementation to high performance computing.
Abstract: Reinforcement Learning (RL) is a fast-growing field of research that is mostly applied in the realm of video games due to the compatibility of RL and game tasks. AI Gym has established itself as the gold standard toolkit for Reinforcement Learning research. Unfortunately, toolkits like AI Gym are very optimized for benchmark purposes and may not always be suitable for real world type problems. Additionally, fixed wing flight simulation has specific requirements and may need other solutions. In this paper, we propose QPlane as an alternative toolkit for RL training of fixed wing aircraft. QPlane was developed in an effort to create a RL toolkit for fixed wing aircraft simulation that is easily modifiable for different scenarios. QPlane is replicable and flexible for ease of implementation to high performance computing, and is modular for quick environment and algorithm replacement. In this paper we present and discuss details of QPlane, as well as proof of concept results.

Proceedings ArticleDOI
Stefan Pham1, Mariana Avelino1, Daniel Silhavy1, Troung-Sinh An1, Stefan Arbanowski1 
24 Jun 2021
TL;DR: In this paper, the authors consider SAND (Server and Network Assisted DASH), CMCD (Common Media Client Data) and Streaming Quality of Experience Events, Properties and Metrics (CTA-2066) as standards to enable interoperable, standard-based streaming analytics for the predominant streaming formats MPEG-DASH and HLS.
Abstract: As OTT (over-the-top) media streaming and underlying technologies have matured, streaming analytics has become more important, especially in a heterogeneous device ecosystem, where new devices or software updates can potentially cause streaming issues. In this paper we consider SAND (Server and Network Assisted DASH), CMCD (Common Media Client Data) and Streaming Quality of Experience Events, Properties and Metrics (CTA-2066) as standards to enable interoperable, standard-based streaming analytics for the predominant streaming formats MPEG-DASH and HLS. We focus on the visualization aspect of streaming metrics in UI (user interface) dashboards.

Proceedings ArticleDOI
15 Jul 2021
TL;DR: AMP as discussed by the authors is a system that ensures the authentication of media via certifying provenance by creating one or more publisher-signed manifests for a media instance uploaded by a content provider, which are stored in a database allowing fast lookup from applications such as browsers.
Abstract: Advances in graphics and machine learning have led to the general availability of easy-to-use tools for modifying and synthesizing media. The proliferation of these tools threatens to cast doubt on the veracity of all media. One approach to thwarting the flow of fake media is to detect modified or synthesized media through machine learning methods. While detection may help in the short term, we believe that it is destined to fail as the quality of fake media generation continues to improve. Soon, neither humans nor algorithms will be able to reliably distinguish fake versus real content. Thus, pipelines for assuring the source and integrity of media will be required---and increasingly relied upon. We present AMP, a system that ensures the authentication of media via certifying provenance. AMP creates one or more publisher-signed manifests for a media instance uploaded by a content provider. These manifests are stored in a database allowing fast lookup from applications such as browsers. For reference, the manifests are also registered and signed by a permissioned ledger, implemented using the Confidential Consortium Framework (CCF). CCF employs both software and hardware techniques to ensure the integrity and transparency of all registered manifests. AMP, through its use of CCF, enables a consortium of media providers to govern the service while making all its operations auditable. The authenticity of the media can be communicated to the user via visual elements in the browser, indicating that an AMP manifest has been successfully located and verified.

Proceedings ArticleDOI
15 Jul 2021
TL;DR: In this paper, a framework for quality-aware adaptive bitrate (ABR) streaming involving a per-session data budget constraint is proposed, where fine-grained perceptual quality information is known to the planning scheme, and another for such information is not available.
Abstract: Over-the-top video (OTT) streaming accounts for the majority of traffic on cellular networks, and also places a heavy demand on users' limited monthly cellular data budgets. In contrast to much of traditional research that focuses on improving the quality, we explore a different direction---using data budget information to better manage the data usage of mobile video streaming, while minimizing the impact on users' quality of experience (QoE). Specifically, we propose a novel framework for quality-aware Adaptive Bitrate (ABR) streaming involving a per-session data budget constraint. Under the framework, we develop two planning based strategies, one for the case where fine-grained perceptual quality information is known to the planning scheme, and another for the case where such information is not available. Evaluations for a wide range of network conditions, using different videos covering a variety of content types and encodings, demonstrate that both these strategies use much less data compared to state-of-the-art ABR schemes, while still providing comparable QoE. Our proposed approach is designed to work in conjunction with existing ABR streaming workflows, enabling ease of adoption.

Proceedings ArticleDOI
15 Jul 2021
TL;DR: In this article, the authors propose Livelyzer, a generalized active measurement and black-box testing framework for analyzing the performance of this component in popular live streaming software and services under controlled settings.
Abstract: Over-the-top (OTT) live video traffic has grown significantly, fueled by fundamental shifts in how users consume video content (e.g., increased cord-cutting) and by improvements in camera technologies, computing power, and wireless resources. A key determining factor for the end-to-end live streaming QoE is the design of the first-mile upstream ingest path that captures and transmits the live content in real-time, from the broadcaster to the remote video server. This path often involves either a Wi-Fi or cellular component, and is likely to be bandwidth-constrained with time-varying capacity, making the task of high-quality video delivery challenging. Today, there is little understanding of the state of the art in the design of this critical path, with existing research focused mainly on the downstream distribution path, from the video server to end viewers. To shed more light on the first-mile ingest aspect of live streaming, we propose Livelyzer, a generalized active measurement and black-box testing framework for analyzing the performance of this component in popular live streaming software and services under controlled settings. We use Livelyzer to characterize the ingest behavior and performance of several live streaming platforms, identify design deficiencies that lead to poor performance, and propose best practice design recommendations to improve the same.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this paper, the authors show how SVC Scalable Video can be adaptated in the network in an effective way, when the Big Packet Protocol (BPP) is used.
Abstract: The essence of this work is to show how SVC Scalable Video can be adaptated in the network in an effective way, when the Big Packet Protocol (BPP) is used. This demo shows the advantages of BPP, which is a recently proposed transport protocol devised for real-time applications. We will show that in-network adaption can be provided using this new protocol. We show how a network node can change the packets during their transmission, but still present a very usable video stream to the client. The preliminary results show that BPP is a good alternative transport for video transmission.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this article, the authors proposed a dataset capturing statistics of several large-scale real-world streaming events, delivering videos to different devices (TVs, desktops, mobiles, tablets, etc.), and over different networks (from 2.5G, 3G, and other early generation mobile networks to 5G and broadband).
Abstract: We propose dataset capturing statistics of several large-scale real-world streaming events, delivering videos to different devices (TVs, desktops, mobiles, tablets, etc.), and over different networks (from 2.5G, 3G, and other early generation mobile networks to 5G and broadband). The data we capture include network-related statistics, playback statistics (session- and player-event-level), and characteristics of the encoded streams. Such data should enable a broad level of possible applications and uses in the research community: from analysis of the effectiveness of algorithms in streaming players to studies of QoE metrics, and end-to-end system optimizations. Examples of such possible studies based on the proposed datasets are also provided.

Proceedings ArticleDOI
Filip Lemic1, Jakob Struye1, Jeroen Famaey1
24 Jun 2021
TL;DR: In this article, the authors present a simulator for enabling seamless mobility of the VR users in the virtual worlds, while simultaneously constraining them inside shared physical spaces through redirected walking, which can capture a set of performance metrics characterizing the number of perceivable resets and the distances between such resets for each user.
Abstract: Full-immersive multiuser Virtual Reality (VR) setups envision supporting seamless mobility of the VR users in the virtual worlds, while simultaneously constraining them inside shared physical spaces through redirected walking. For enabling high data rate and low latency delivery of video content in such setups, the supporting wireless networks will have to utilize highly directional communication links, where these links will ideally have to “track” the mobile VR users for maintaining the Line-of-Sight (LoS) connectivity. The design decisions about the mobility patterns of the VR users in the virtual worlds will thus have a substantial effect on the mobility of these users in the physical environments, and therefore also on performance of the underlying networks. Hence, there is a need for a tool that can provide a mapping between design decisions about the users' mobility in the virtual words, and their effects on the mobility in constrained physical setups. To address this issue, we have developed and in this paper present a simulator for enabling this functionality. Given a set of VR users with their virtual movement trajectories, the outline of the physical deployment environment, and a redirected walking algorithm for avoiding physical collisions, the simulator is able to derive the physical movements of the users. Based on the derived physical movements, the simulator can capture a set of performance metrics characterizing the number of perceivable resets and the distances between such resets for each user. The simulator is also able to indicate the predictability of the physical movement trajectories, which can serve as an indication of the complexity of supporting a given virtual movement pattern by the underlying networks.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: The authorsFT-360 The authors is a real-time emulation framework that captures tile-quality adaptation under time-varying bandwidth conditions and a multi-step evaluation process that allows the calculation of MS-SSIM scores and other frame-based metrics, while accounting for user's head movements.
Abstract: With 360° video streaming, the user's field of view (a.k.a. viewport) is at all times determined by the user's current viewing direction. Since any two users are unlikely to look in the exact same direction as each other throughout the viewing of a video, the frame-by-frame video sequence displayed during a playback session is typically unique. This complicates the direct comparison of the perceived Quality of Experience (QoE) using popular metrics such as the Multiscale-Structural Similarity (MS-SSIM). Furthermore, there is an absence of light-weight emulation frameworks for tiled-based 360° video streaming that allow easy testing of different algorithm designs and tile sizes. To address these challenges, we present REEFT-360, which consists of (1) a real-time emulation framework that captures tile-quality adaptation under time-varying bandwidth conditions and (2) a multi-step evaluation process that allows the calculation of MS-SSIM scores and other frame-based metrics, while accounting for the user's head movements. Importantly, the framework allows speedy implementation and testing of alternative head-movement prediction and tile-based prefetching solutions, allows testing under a wide range of network conditions, and can be used either with a human user or head-movement traces. The developed software tool is shared with the paper. We also present proof-of-concept evaluation results that highlight the importance of including a human subject in the evaluation.

Proceedings ArticleDOI
15 Jul 2021
TL;DR: In this paper, the authors study three different methods to produce a foveated video stream of real-time rendered graphics in a remote rendered system: (1) Foveated shading as part of the rendering pipeline, (2) foveation as post processing step after rendering and before video encoding, and (3) video encoding.
Abstract: Remote rendering systems comprise powerful servers that render graphics on behalf of low-end client devices and stream the graphics as compressed video, enabling high end gaming and Virtual Reality on those devices. One key challenge with them is the amount of bandwidth required for streaming high quality video. Humans have spatially non-uniform visual acuity: We have sharp central vision but our ability to discern details rapidly decreases with angular distance from the point of gaze. This phenomenon called foveation can be taken advantage of to reduce the need for bandwidth. In this paper, we study three different methods to produce a foveated video stream of real-time rendered graphics in a remote rendered system: 1) foveated shading as part of the rendering pipeline, 2) foveation as post processing step after rendering and before video encoding, 3) foveated video encoding. We report results from a number of experiments with these methods. They suggest that foveated rendering alone does not help save bandwidth. Instead, the two other methods decrease the resulting video bitrate significantly but they also have different quality per bit and latency profiles, which makes them desirable solutions in slightly different situations.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this article, a content-aware playback speed control (CAPSC) algorithm is proposed for live streaming of sports content, which allows the streaming client to slow down the playback when there is a risk of stalling.
Abstract: There are two main factors that determine the viewer experience during the live streaming of sports content: latency and stalls. Latency should be low and stalls should not occur. Yet, these two factors work against each other and it is not trivial to strike the best trade-off between them. One of the best tools we have today to manage this trade-off is the adaptive playback speed control. This tool allows the streaming client to slow down the playback when there is a risk of stalling and increase the playback when there is no risk of stalling but the live latency is higher than desired. While adaptive playback generally works well, the artifacts due to the changes in the playback speed should preferably be unnoticeable to the viewers. However, this mostly depends on the portion of the audio/video content subject to the playback speed change. In this paper, we advance the state-of-the-art by developing a content-aware playback speed control (CAPSC) algorithm and demonstrate a number of examples showing its significance. We make the running code available and provide a demo page hoping that it will be a useful tool for the developers and content providers.

Proceedings ArticleDOI
15 Jul 2021
TL;DR: EScALation as mentioned in this paper proposes a frame sampling technique that utilizes the temporal correlation between frames and selects key frame(s) from a temporally correlated set of frames to perform bounding box detection.
Abstract: Spatio-temporal action localization aims to detect the spatial location and the start/end time of the action in a video. The state-of-the-art approach uses convolutional neural networks to extract possible bounding boxes for the action in each frame and then link bounding boxes into action tubes based on the location and the class-specific score of each bounding box. Though this approach has been successful at achieving a good localization accuracy, it is computation-intensive. High-end GPUs are usually demanded for it to achieve real-time performance. In addition, this approach does not scale well on a large number of action classes. In this work, we present a framework, EScALation, for making spatio-temporal action localization efficient and scalable. Our framework involves two main strategies. One is the frame sampling technique that utilizes the temporal correlation between frames and selects key frame(s) from a temporally correlated set of frames to perform bounding box detection. The other is the class filtering technique that exploits bounding box information to predict the action class prior to linking bounding boxes. We compare EScALation with the state-of-the-art approach on UCF101-24 and J-HMDB-21 datasets. One of our experiments shows EScALation is able to save 72.2% of the time with only 6.1% loss of mAP. In addition, we show that EScALation scales better to a large number of action classes than the state-of-the-art approach.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: In this article, the authors present a visualization tool that helps assessing the visual quality of a 3D representation employing various coding schemes, allowing for subjective testing by showing the differences between the selected encoding parameters.
Abstract: Recent years have seen a new uptake in immersive media and eXtended Reality (XR). And due to a global pandemic, computer-mediated communication over video conferencing tools became a new normal of everyday remote collaboration and virtual meetings. Social XR leverages XR technologies for remote communication and collaboration. But in order for XR to facilitate a high level of (social) presence and thus high-quality mediated social contact between users, we need high-quality 3D representation of users. One approach to providing detailed 3D user representations as new immersive media is to use point clouds or meshes, but these representation formats come with complexity on compression bitrate and processing time. In the example of virtual meetings, compression has to fulfill stringent requirements such as low latency and high quality. As the compression techniques for 3D immersive media steadily advance, it is important to be able to easily compare different compression techniques on their technical and visual merits in an easy way. The proposed demonstrator in this paper is a visualization tool that helps assessing the visual quality of a 3D representation employing various coding schemes. The complete end-to-end rendering/encoding chain can be easily assessed, allowing for subjective testing by showing the differences between the selected encoding parameters. The tool presented in this demo paper offers an improved and easy visual process for the comparison of encoders of immersive media.

Proceedings ArticleDOI
24 Jun 2021
TL;DR: EvLag as discussed by the authors is a tool for adding latency to user input devices in Linux, regardless of the application being run, enabling user studies for systems and software that cannot be modified (e.g., commercial games).
Abstract: Understanding the effects of latency on interaction is important for building software, such as computer games, that perform well over a range of system configurations. Unfortunately, user studies evaluating latency must each write their own code to add latency to user input and, even worse, must limit themselves to open source applications. To address these shortcomings, this paper presents EvLag, a tool for adding latency to user input devices in Linux. EvLag provides a custom amount of latency for each device regardless of the application being run, enabling user studies for systems and software that cannot be modified (e.g., commercial games). Evaluation shows EvLag has low overhead and accurately adds the expected amount of latency to user input. In addition, EvLag can log user input events for post study analysis with several utilities provided to facilitate output event parsing.