scispace - formally typeset
Search or ask a question

Showing papers on "Temporal isolation among virtual machines published in 2019"


Journal ArticleDOI
TL;DR: Numerical results show that the convergence of an iRSS satisfies online resource scheduling requirements and can significantly improve resource utilization while guaranteeing performance isolation between slices, compared with other benchmark algorithms.
Abstract: It is widely acknowledged that network slicing can tackle the diverse use cases and connectivity services of the forthcoming next-generation mobile networks (5G). Resource scheduling is of vital importance for improving resource-multiplexing gain among slices while meeting specific service requirements for radio access network (RAN) slicing. Unfortunately, due to the performance isolation, diversified service requirements, and network dynamics (including user mobility and channel states), resource scheduling in RAN slicing is very challenging. In this paper, we propose an intelligent resource scheduling strategy (iRSS) for 5G RAN slicing. The main idea of an iRSS is to exploit a collaborative learning framework that consists of deep learning (DL) in conjunction with reinforcement learning (RL). Specifically, DL is used to perform large time-scale resource allocation, whereas RL is used to perform online resource scheduling for tackling small time-scale network dynamics, including inaccurate prediction and unexpected network states. Depending on the amount of available historical traffic data, an iRSS can flexibly adjust the significance between the prediction and online decision modules for assisting RAN in making resource scheduling decisions. Numerical results show that the convergence of an iRSS satisfies online resource scheduling requirements and can significantly improve resource utilization while guaranteeing performance isolation between slices, compared with other benchmark algorithms.

138 citations


Journal ArticleDOI
TL;DR: Containers, enabling lightweight environment and performance isolation, fast and flexible deployment, and fine-grained resource sharing, have gained popularity in better application management and deployment in addition to hardware virtualization as mentioned in this paper.
Abstract: Containers, enabling lightweight environment and performance isolation, fast and flexible deployment, and fine-grained resource sharing, have gained popularity in better application management and deployment in addition to hardware virtualization. They are being widely used by organizations to deploy their increasingly diverse workloads derived from modern-day applications such as web services, big data, and internet of things in either proprietary clusters or private and public cloud data centers. This has led to the emergence of container orchestration platforms, which are designed to manage the deployment of containerized applications in large-scale clusters. These systems are capable of running hundreds of thousands of jobs across thousands of machines. To do so efficiently, they must address several important challenges including scalability, fault tolerance and availability, efficient resource utilization, and request throughput maximization among others. This paper studies these management systems and proposes a taxonomy that identifies different mechanisms that can be used to meet the aforementioned challenges. The proposed classification is then applied to various state-of-the-art systems leading to the identification of open research challenges and gaps in the literature intended as future directions for researchers.

66 citations


Proceedings ArticleDOI
16 Apr 2019
TL;DR: This paper presents Fractional GPUs (FGPUs), a software-only mechanism to partition both compute and memory resources of a GPU to allow parallel execution of GPU workloads with performance isolation.
Abstract: GPUs are increasingly being used in real-time systems, such as autonomous vehicles, due to the vast performance benefits that they offer. As more and more applications use GPUs, more than one application may need to run on the same GPU in parallel. However, real-time systems also require predictable performance from each individual applications which GPUs do not fully support in a multi-tasking environment. Nvidia recently added a new feature in their latest GPU architecture that allows limited resource provisioning. This feature is provided in the form of a closed-source kernel module called the Multi-Process Service (MPS). However, MPS only provides the capability to partition the compute resources of GPU and does not provide any mechanism to avoid inter-application conflicts within the shared memory hierarchy. In our experiments, we find that compute resource partitioning alone is not sufficient for performance isolation. In the worst case, due to interference from co-running GPU tasks, read/write transactions can observe a slowdown of more than 10x. In this paper, we present Fractional GPUs (FGPUs), a software-only mechanism to partition both compute and memory resources of a GPU to allow parallel execution of GPU workloads with performance isolation. As many details of GPU memory hierarchy are not publicly available, we first reverse-engineer the information through various micro-benchmarks. We find that the GPU memory hierarchy is different from that of the CPU, making it well-suited for page coloring. Based on our findings, we were able to partition both the L2 cache and DRAM for multiple Nvidia GPUs. Furthermore, we show that a better strategy exists for partitioning compute resources than the one used by MPS. An FGPU combines both this strategy and memory coloring to provide superior isolation. We compare our FGPU implementation with Nvidia MPS. Compared to MPS, FGPU reduces the average variation in application runtime, in a multi-tenancy environment, from 135% to 9%. To allow multiple applications to use FGPUs seamlessly, we ported Caffe, a popular framework used for machine learning, to use our FGPU API.

38 citations


Proceedings ArticleDOI
19 Aug 2019
TL;DR: PicNIC builds on three constructs to quickly detect isolation breakdown and to enforce it when necessary: CPU-fair weighted fair queues at receivers, receiver-driven congestion control for backpressure, and sender-side admission control with shaping.
Abstract: Network virtualization stacks are the linchpins of public clouds. A key goal is to provide performance isolation so that workloads on one Virtual Machine (VM) do not adversely impact the network experience of another VM. Using data from a major public cloud provider, we systematically characterize how performance isolation can break in current virtualization stacks and find a fundamental tradeoff between isolation and resource multiplexing for efficiency. In order to provide predictable performance, we propose a new system called PicNIC that shares resources efficiently in the common case while rapidly reacting to ensure isolation. PicNIC builds on three constructs to quickly detect isolation breakdown and to enforce it when necessary: CPU-fair weighted fair queues at receivers, receiver-driven congestion control for backpressure, and sender-side admission control with shaping. Based on an extensive evaluation, we show that this combination ensures isolation for VMs at sub-millisecond timescales with negligible overhead.

37 citations


Proceedings ArticleDOI
25 Mar 2019
TL;DR: The evaluation results show that the proposed architecture can achieve data integrity, performance isolation, data privacy, configuration flexibility, availability, cost efficiency and scalability.
Abstract: Blockchain has attracted a broad range of interests from start-ups, enterprises and governments to build next generation applications in a decentralized manner. Similar to cloud platforms, a single blockchain-based system may need to serve multiple tenants simultaneously. However, design of multi-tenant blockchain-based systems is challenging to architects in terms of data and performance isolation, as well as scalability. First, tenants must not be able to read other tenants' data and tenants with potentially higher workload should not affect read/write performance of other tenants. Second, multi-tenant blockchain-based systems usually require both scalability for each individual tenant and scalability with number of tenants. Therefore, in this paper, we propose a scalable platform architecture for multi-tenant blockchain-based systems to ensure data integrity while maintaining data privacy and performance isolation. In the proposed architecture, each tenant has an individual permissioned blockchain to maintain their own data and smart contracts. All tenant chains are anchored into a main chain, in a way that minimizes cost and load overheads. The proposed architecture has been implemented in a proof-of-concept prototype with our industry partner, Laava ID Pty Ltd (Laava). We evaluate our proposal in a three-fold way: fulfilment of the identified requirements, qualitative comparison with design alternatives, and quantitative analysis. The evaluation results show that the proposed architecture can achieve data integrity, performance isolation, data privacy, configuration flexibility, availability, cost efficiency and scalability.

32 citations


Proceedings ArticleDOI
01 Apr 2019
TL;DR: This paper analytically establishes that the proposed MAC scheduler achieves optimal overall throughput utility subject to the various slicing constraints and provides transparency with respect to other scheduling modules, such as link adaptation and beam-forming.
Abstract: Network slicing provides a key functionality in emerging 5G networks, and offers flexibility in creating customized virtual networks and supporting different services on a common physical infrastructure. This capability critically relies on a MAC scheduler to deliver performance targets in terms of aggregate rates or resource shares for the various slices.A crucial challenge is to enforce such guarantees and performance isolation while allowing flexible sharing to avoid resource fragmentation and fully harness channel variations. In the present paper we propose a MAC scheduler which meets these objectives and preserves the basic structure of utility-based schedulers such as the Proportional Fair algorithm in terms of per-user scheduling metrics. Specifically, the proposed scheme involves counters tracking the aggregate rate or resource allocations for the various slices against pre-specified targets, and computes offsets to the scheduling metrics accordingly. This design provides transparency with respect to other scheduling modules, such as link adaptation and beam-forming.We analytically establish that the proposed scheme achieves optimal overall throughput utility subject to the various slicing constraints. In addition, extensive 3GPP-compliant simulation experiments are conducted to assess the impact on best-effort applications and demonstrate substantial gains in overall throughput utility over baseline approaches.

30 citations


Posted Content
TL;DR: In this paper, the authors proposed a scalable platform architecture for multi-tenant blockchain-based systems to ensure data integrity while maintaining data privacy and performance isolation, in which each tenant has an individual permissioned blockchain to maintain their own data and smart contracts.
Abstract: Blockchain has attracted a broad range of interests from start-ups, enterprises and governments to build next generation applications in a decentralized manner. Similar to cloud platforms, a single blockchain-based system may need to serve multiple tenants simultaneously. However, design of multi-tenant blockchain-based systems is challenging to architects in terms of data and performance isolation, as well as scalability. First, tenants must not be able to read other tenants' data and tenants with potentially higher workload should not affect read/write performance of other tenants. Second, multi-tenant blockchain-based systems usually require both scalability for each individual tenant and scalability with number of tenants. Therefore, in this paper, we propose a scalable platform architecture for multi-tenant blockchain-based systems to ensure data integrity while maintaining data privacy and performance isolation. In the proposed architecture, each tenant has an individual permissioned blockchain to maintain their own data and smart contracts. All tenant chains are anchored into a main chain, in a way that minimizes cost and load overheads. The proposed architecture has been implemented in a proof-of-concept prototype with our industry partner, Laava ID Pty Ltd (Laava). We evaluate our proposal in a three-fold way: fulfilment of the identified requirements, qualitative comparison with design alternatives, and quantitative analysis. The evaluation results show that the proposed architecture can achieve data integrity, performance isolation, data privacy, configuration flexibility, availability, cost efficiency and scalability.

20 citations


Proceedings ArticleDOI
15 Jul 2019
TL;DR: The authors' measurements show that the Jailhouse hypervisor provides performance isolation of local computing resources such as CPU and cache isolation, which implies that running Jailhouse in a memory saturated system will not be harmful.
Abstract: In this paper we present a methodology to be used for quantifying the level of performance isolation for a multi-core system. We have devised a test that can be applied to breaches of isolation in different computing resources that may be shared between different cores. We use this test to determine the level of isolation gained by using the Jailhouse hypervisor compared to a regular Linux system in terms of CPU isolation, cache isolation and memory bus isolation. Our measurements show that the Jailhouse hypervisor provides performance isolation of local computing resources such as CPU. We have also evaluated if any isolation could be gained for shared computing resources such as the system wide cache and the memory bus controller. Our tests show no measurable difference in partitioning between a regular Linux system and a Jailhouse partitioned system for shared resources. Using the Jailhouse hypervisor provides only a small noticeable overhead when executing multiple shared-resource intensive tasks on multiple cores, which implies that running Jailhouse in a memory saturated system will not be harmful. However, contention still exist in the memory bus and in the system-wide cache.

15 citations


Proceedings ArticleDOI
08 Apr 2019
TL;DR: It is found that Docker engine deployments that run in host mode exhibit negligible performance overhead in comparison to native OpenStack deployments and virtual IP networking introduces a substantial overhead in Docker Swarm and Kubernetes due to virtual network bridges when compared to docker engine deployments.
Abstract: The most preferred approach in the literature on service-level objectives for multi-tenant databases is to group tenants according to their SLA class in separate database processes and find optimal co-placement of tenants across a cluster of nodes. To implement performance isolation between co-located database processes, request scheduling is preferred over hypervisor-based virtualization that introduces a significant performance overhead. A relevant question is whether the more light-weight container technology such as Docker is a viable alternative for running high-end performance database workloads. Moreover, the recent uprise and industry adoption of container orchestration (CO) frameworks for the purpose of automated placement of cloud-based applications raises the question what is the additional performance overhead of CO frameworks in this context. In this paper, we evaluate the performance overhead introduced by Docker engine and two representative CO frameworks, Docker Swarm and Kubernetes, when running and managing a CPU-bound Cassandra workload in OpenStack. Firstly, we have found that Docker engine deployments that run in host mode exhibit negligible performance overhead in comparison to native OpenStack deployments. Secondly, we have found that virtual IP networking introduces a substantial overhead in Docker Swarm and Kubernetes due to virtual network bridges when compared to Docker engine deployments. This demands for service networking approaches that run in true host mode but offer support for network isolation between containers. Thirdly, volume plugins for persistent storage have a large impact on the overall resource model of a database workload; more specifically, we show that a CPU-bound Cassandra workload changes into an I/O-bound workload in both Docker Swarm and Kubernetes because their local volume plugins introduce a disk I/O performance bottleneck that does not appear in Docker engine deployments. These findings imply that solved placement decisions for native or Docker engine deployments cannot be reused for Docker Swarm and Kubernetes.

14 citations


Journal ArticleDOI
TL;DR: PINE is proposed, a performance isolation optimization solution in container environments, which can adaptively allocate the storage resource for each service according to their performance behaviors through dynamical resource management and I/O concurrency configuration.
Abstract: With the development of virtualization technologies, containers are widely used to provide a light-weight isolated runtime environment. Compared with virtual machines, containers can achieve high-resource utilization and provide a more convenient way of sharing, but there are significant security challenges due to potential resource contention among services. When the regular services that share the storage system with the key services are over-used, a lot of resources may be preempted, which breakdowns the whole system and delays other services. This paper addressed the performance isolation in fierce resource competition situations, that is, when different services (latency-sensitive services and throughput-first services) are deployed on a host machine using container technology. We proposed PINE, a performance isolation optimization solution in container environments, which can adaptively allocate the storage resource for each service according to their performance behaviors through dynamical resource management and I/O concurrency configuration. The experimental results show that PINE can effectively optimize the performance of the running services and achieve the optimal result in a relatively short time.

13 citations


Proceedings ArticleDOI
20 May 2019
TL;DR: An implementation on Intel Xeon E5 v4 processor shows that combining LLC partitioning and prefetch throttling provides a significant improvement in performance and fairness.
Abstract: Modern commercial multi-core processors are equipped with multiple hardware prefetchers on each core. The prefetchers can significantly improve application performance. However, shared resources, such as last-level cache (LLC) and off-chip memory bandwidth and controller, can lead to prefetch interference. Multiple techniques have been proposed to reduce such interference and improve the performance isolation across cores, such as coordinated control among prefetchers and cache partitioning (CP). Each of them has its advantages and disadvantages. This paper proposes combining these two techniques in a coordinated way. Prefetchers and LLC are treated as separate resources and a multi-resource management mechanism is proposed to control prefetching and cache partitioning. This control mechanism is implemented as a Linux kernel module and can be applied to a wide variety of prefetch architectures. An implementation on Intel Xeon E5 v4 processor shows that combining LLC partitioning and prefetch throttling provides a significant improvement in performance and fairness.

Proceedings ArticleDOI
05 Aug 2019
TL;DR: CostPI is presented, which presents an SLO-aware arbitration mechanism which fetches requests from NVMe queues at different granularities according to workload SLOs, and can increase resource utilization and reduce wear-imbalance for the shared NVMe SSD.
Abstract: NVMe SSDs have been wildly adopted to provide storage services in cloud platforms where diverse workloads (including latency-sensitive, throughput-oriented and capacity-oriented workloads) are colocated. To achieve performance isolation, existing solutions partition the shared SSD into multiple isolated regions and assign each workload a separate region. However, these isolation solutions could result in inefficient resource utilization and imbalanced wear. More importantly, they cannot reduce the interference caused by embedded cache contention. In this paper, we present CostPI to improve isolation and resource utilization by providing latency-sensitive workloads with dedicated resources (including data cache, mapping table cache and NAND flash), and providing throughput-oriented and capacity-oriented workloads with shared resources. Specifically, at the NVMe queue level, we present an SLO-aware arbitration mechanism which fetches requests from NVMe queues at different granularities according to workload SLOs. At the embedded cache level, we use an asymmetric allocation scheme to partition the cache (including data cache and mapping table cache). For different data cache partitions, we adopt different cache polices to meet diverse workload requirements while reducing the imbalanced wear. At the NAND flash level, we partition the hardware resources at the channel granularity to enable the strongest isolation. Our experiments show that CostPI can reduce the average response time by up to 44.2%, the 99% response time by up to 89.5%, and the 99.9% by up to 88.5% for latency-sensitive workloads. Meanwhile, CostPI can increase resource utilization and reduce wear-imbalance for the shared NVMe SSD.

Proceedings ArticleDOI
20 Nov 2019
TL;DR: Perphon is presented, a runtime agent on a per node basis, that decouples ML-based performance prediction and resource inference from centralized scheduler and shows that throughput of Kafka data-streaming application in PerphOn is 2.0x and 1.82x times that of isolation execution schemes in native YARN and pure cgroup cpu subsystem.
Abstract: Cluster administrators are facing great pressures to improve cluster utilization through workload co-location. Guaranteeing performance of long-running applications (LRAs), however, is far from settled as unpredictable interference across applications is catastrophic to QoS [2]. Current solutions such as [1] usually employ sandboxed and offline profiling for different workload combinations and leverage them to predict incoming interference. However, the time complexity restricts the applicability to complex co-locations. Hence, this issue entails a new framework to harness runtime performance and mitigate the time cost with machine intelligence: i) It is desirable to explore a quantitative relationship between allocated resource and consequent workload performance, not relying on analyzing interference derived from different workload combinations. The majority of works, however, depend on offline profiling and training which may lead to model aging problem. Moreover, multi-resource dimensions (e.g., LLC contention) that are not completely included by existing works but have impact on performance interference need to be considered [3]. ii) Workload co-location also necessitates fine-grained isolation and access control mechanism. Once performance degradation is detected, dynamic resource adjustment will be enforced and application will be assigned an access to specific slices of each resources. Inferring a "just enough" amount of resource adjustment ensures the application performance can be secured whilst improving cluster utilization. We present Perphon, a runtime agent on a per node basis, that decouples ML-based performance prediction and resource inference from centralized scheduler. Figure 1 outlines the proposed architecture. We initially exploit sensitivity of applications to multi-resources to establish performance prediction. To achieve this, Metric Monitor aggregates application fingerprint and system-level performance metrics including CPU, memory, Last Level Cache (LLC), memory bandwidth (MBW) and number of running threads, etc. They are enabled by Intel-RDT and precisely obtained from resource group manager. Perphon employs an Online Gradient Boost Regression Tree (OGBRT) approach to resolve model aging problem. Res-Perf Model warms up via offline learning that merely relies on a small volume of profiling in the early stage, but evolves with arrival of workloads. Consequently, parameters will be automatically updated and synchronized among agents. Anomaly Detector can timely pinpoint a performance degradation via LSTM time-series analysis and determine when and which application need to be re-allocated resources. Once abnormal performance counter or load is detected, Resource Inferer conducts a gradient ascend based inference to work out a proper slice of resources, towards dynamically recovering targeted performance. Upon receiving an updated re-allocation, Access Controller re-assigns a specific portion of the node resources to the affected application. Eventually, Isolation Executor enforces resource manipulation and ensures performance isolation across applications. Specifically, we use cgroup cpuset and memory subsystem to control usage of CPU and memory while leveraging Intel-RDT technology to underpin the manipulation of LLC and MBW. For fine-granularity management, we create different groups for LRA and batch jobs when the agent starts. Our prototype integration with Node Manager of Apache YARN shows that throughput of Kafka data-streaming application in Perphon is 2.0x and 1.82x times that of isolation execution schemes in native YARN and pure cgroup cpu subsystem.

Book ChapterDOI
28 Oct 2019
TL;DR: The proposed algorithm utilizes the predictive capabilities of a Machine Learning (ML) model in an attempt to classify VM workloads to make more informed consolidation decisions and demonstrates how it improves energy efficiency by 31% while also reducing service violations by 69%.
Abstract: Inefficient resource usage is one of the greatest causes of high energy consumption in cloud data centers. Virtual Machine (VM) consolidation is an effective method for improving energy related costs and environmental sustainability for modern data centers. While dynamic VM consolidation can improve energy efficiency, virtualisation technologies cannot guarantee performance isolation between co-located VMs resulting in interference issues. We address the problem by introducing a energy and interference aware VM consolidation algorithm. The proposed algorithm utilizes the predictive capabilities of a Machine Learning (ML) model in an attempt to classify VM workloads to make more informed consolidation decisions. Furthermore, using recent workload data from Microsoft Azure we present a comparative study of two popular classification algorithms and select the model with the best performance to incorporate into our proposed approach. Our empirical results demonstrate how our approach improves energy efficiency by 31% while also reducing service violations by 69%.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: The stochastic analysis enables a service operator to provision CPU resources for aperiodic services to achieve a desired tail latency on computing platforms that schedue time-critical services as deferrable servers.
Abstract: There is increasing interest in supporting time-critical services in cloud computing environments. Those cloud services differ from traditional hard real-time systems in three aspects. First, cloud services usually involve latency requirements in terms of probabilistic tail latency instead of hard deadlines. Second, some cloud services need to handle aperiodic requests for stochastic arrival processes instead of traditional periodic or sporadic models. Finally, the computing platform must provide performance isolation between time-critical services and other workloads. It is therefore essential to provision resources to meet different tail latency requirements. As a step towards cloud services with stochastic latency guarantees, this paper presents a stochastic response time analysis for aperiodic services following a Poisson arrival process on computing platforms that schedue time-critical services as deferrable servers. The stochastic analysis enables a service operator to provision CPU resources for aperiodic services to achieve a desired tail latency. We evaluated the method in two case studies, one involving a synthetic service and another involving a Redis service, both on a testbed based on Xen 4.10. The results demonstrate the validity and efficacy of our method in a practical setting.

Posted Content
TL;DR: Justitia is a software-only, host-based, and easy-to-deploy solution that maximizes RNIC utilization while guaranteeing performance isolation via shaping, rate limiting, and pacing at senders that significantly improves latency and throughput of real-world RDMA-based applications without compromising low CPU usage or modifying the applications.
Abstract: Despite its increasing popularity, most of RDMA's benefits such as ultra-low latency can be achieved only when running an application in isolation. Using microbenchmarks and real open-source RDMA applications, we identify a series of performance anomalies when multiple applications coexist and show that such anomalies are pervasive across InfiniBand, RoCEv2, and iWARP. They arise due to a fundamental tradeoff between performance isolation and work conservation, which the state-of-the-art RDMA congestion control protocols such as DCQCN cannot resolve. We present Justitia to address these performance anomalies. Justitia is a software-only, host-based, and easy-to-deploy solution that maximizes RNIC utilization while guaranteeing performance isolation via shaping, rate limiting, and pacing at senders. Our evaluation of Justitia on multiple RDMA implementations show that Justitia effectively isolates different types of traffic and significantly improves latency (by up to 56.9x) and throughput (by up to 9.7x) of real-world RDMA-based applications without compromising low CPU usage or modifying the applications.

Proceedings ArticleDOI
07 Jul 2019
TL;DR: A Load-Aware Cache Sharing scheme (LACS) to enforce isolation between users and achieves performance isolation in the presence of elephants, while improving the mean read latency by up to 80.4% over the state-of-the-art load balancing technique.
Abstract: Cluster caching has been increasingly deployed in front of cloud storage to improve I/O performance. In shared, multi-tenant environments such as cloud datacenters, cluster caches are constantly contended by many users. Enforcing performance isolation between users hence becomes imperative to cluster caching. A user's caching performance critically depends on two factors: (1) the amount of cache allocation and (2) the load of servers in which its files are cached. However, existing cache sharing policies only provide guarantees on the amount of cache allocation, while remaining agnostic to the load of cache servers. Consequently, "mice" users having files co-located with "elephants" contributing heavy data accesses may experience extremely long latency, hence receiving no isolation. In this paper, we propose a Load-Aware Cache Sharing scheme (LACS) to enforce isolation between users. LACS keeps track of the load contributed by each user and reins back the congestions caused by elephant users by throttling their cache usage and network bandwidth. We have implemented LACS atop Alluxio, a popular cluster caching system. EC2 deployment shows that LACS achieves performance isolation in the presence of elephants, while improving the mean read latency by up to 80.4% (25.3% on average) over the state-of-the-art load balancing technique.


Journal ArticleDOI
TL;DR: This paper proposes an efficient user-based resource scheduling scheme for multi-user surface computing systems, called URS, and confirms that URS effectively allocates system resources to multiple users by providing the aforementioned three features.
Abstract: Multi-user surface computing systems are promising for the next generation consumer electronics devices due to their convenience and excellent usability. However, conventional resource scheduling schemes can cause severe performance issues in surface computing systems, especially when they are adopted in multi-user environments, because they do not consider the characteristics of multi-user surface computing systems. In this paper, we propose an efficient user-based resource scheduling scheme for multi-user surface computing systems, called URS. URS provides three different features to effectively support multi-user surface computing systems. First, URS distributes system resources to users, according to their real-world priorities, rather than processes or tasks. Second, URS provides performance isolation among multiple users to prevent resource monopoly by a single user. Finally, URS prioritizes a foreground application of each user via retaining pages used by the application in the page cache, in order to enhance multi-user experience. Our experimental results confirm that URS effectively allocates system resources to multiple users by providing the aforementioned three features.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: The host-level WA-BC (hWA-BC) scheduler is proposed, which aims to achieve performance isolation between multiple processes sharing an open-channel SSD.
Abstract: In datacenters and cloud computing, Quality of Service (QoS) is an essential concept as access to shared resources, including solid state drives (SSDs), must be ensured. The previously proposed workload-aware budget compensation (WA-BC) scheduling algorithm is a device I/O scheduler for guaranteeing performance isolation among multiple virtual machines sharing an SSD. This paper aims to resolve the following three shortcomings of WA-BC: (1) it is applicable to only SR-IOV supporting SSDs, (2) it is unfit for various types of workloads, and (3) it manages flash memory blocks separately in an inappropriate manner. We propose the host-level WA-BC (hWA-BC) scheduler, which aims to achieve performance isolation between multiple processes sharing an open-channel SSD.

Journal ArticleDOI
TL;DR: An optimized memory bandwidth management approach for ensuring quality of service (QoS) and high server utilization and experimentally found that the proposed approach can achieve up to 99% SLO assurance and improve the server utilization up to 6.5×.
Abstract: Latency-critical workloads such as web search engines, social networks and finance market applications are sensitive to tail latencies for meeting service level objectives (SLOs). Since unexpected tail latencies are caused by sharing hardware resources with other co-executing workloads, a service provider executes the latency-critical workload alone. Thus, the data center for the latency-critical workloads has exceedingly low hardware resource utilization. For improving hardware resource utilization, the service provider has to co-locate the latency-critical workloads and other batch processing ones. However, because the memory bandwidth cannot be provided in isolation unlike the cores and cache memory, the latency-critical workloads experience poor performance isolation even though the core and cache memory are allocated in isolation to the workloads. To solve this problem, we propose an optimized memory bandwidth management approach for ensuring quality of service (QoS) and high server utilization. By providing isolated shared resources including the memory bandwidth to the latency-critical workload and co-executing batch processing ones, firstly, our proposed approach performs few pre-profilings under the assumption that memory bandwidth contention is the worst with a divide and conquer method. Second, we predict the memory bandwidth to meet the SLO for all queries per seconds (QPSs) based on results of the pre-profilings. Then, our approach allocates the amount of the isolated memory bandwidth that guarantees the SLO to the latency-critical workload and the rest of the memory bandwidth to co-executing batch processing ones. It is experimentally found that our proposed approach can achieve up to 99% SLO assurance and improve the server utilization up to 6.5×.

Proceedings ArticleDOI
20 May 2019
TL;DR: It is shown that cloud application performance may appear unpredictable if the network hypervisor is not accounted for, and hypervisors should be included in performance models, and their performance benchmarked and compared similarly to other crucial software components such as the SDN controller.
Abstract: Virtualization and multi-tenancy are attractive paradigms to improve the utilization of computing infrastructures and hence to reduce costs. In order to provide a high degree of resource sharing without sacrificing predictable cloud application performance, strict performance isolation needs to be ensured. This is non-trivial and requires models which account for all components where applications may interfere: similarly to security, the predictability of cloud application performance can only be as good as the least predictable component in the model. This paper identifies a new source of potential performance interference that has been overlooked so far: the network hypervisor - a critical component in any multi-tenant network. We present a first measurement study of the performance implications of the network hypervisor in Software-Defined Networks (SDNs). For the purpose of our study, we developed a new open-source benchmarking tool for OpenFlow control and data planes. We show that cloud application performance may appear unpredictable if the network hypervisor is not accounted for: the performance does not only depend on the specific hypervisor implementation and workload (e.g., OpenFlow message types), but also on the number of tenants and the size of the network. Hence, our results suggest that hypervisors should be included in our performance models, and their performance benchmarked and compared similarly to other crucial software components such as the SDN controller.

Posted Content
TL;DR: XOS is presented, an application-defined OS for modern DC servers that leverages modern hardware support for virtualization to move resource management functionality out of the conventional kernel and into user space, which lets applications achieve near bare-metal performance.
Abstract: Rapid growth of datacenter (DC) scale, urgency of cost control, increasing workload diversity, and huge software investment protection place unprecedented demands on the operating system (OS) efficiency, scalability, performance isolation, and backward-compatibility. The traditional OSes are not built to work with deep-hierarchy software stacks, large numbers of cores, tail latency guarantee, and increasingly rich variety of applications seen in modern DCs, and thus they struggle to meet the demands of such workloads. This paper presents XOS, an application-defined OS for modern DC servers. Our design moves resource management out of the OS kernel, supports customizable kernel subsystems in user space, and enables elastic partitioning of hardware resources. Specifically, XOS leverages modern hardware support for virtualization to move resource management functionality out of the conventional kernel and into user space, which lets applications achieve near bare-metal performance. We implement XOS on top of Linux to provide backward compatibility. XOS speeds up a set of DC workloads by up to 1.6X over our baseline Linux on a 24-core server, and outperforms the state-of-the-art Dune by up to 3.3X in terms of virtual memory management. In addition, XOS demonstrates good scalability and strong performance isolation.

Patent
17 Dec 2019
TL;DR: In this paper, a resource management method and system based on multi-tenant cloud storage is presented, and the method comprises the steps: obtaining the tenant performance demands of each tenant, and correspondingly recording the tenant performances in metadata of a virtual machine mirror image file used by the tenant; adding a token bucket to the metadata of each VM mirror imagefile; in a page cache layer of the IO stack, obtaining an index node object of the accessed VM mirror file, and scheduling memory resources by utilizing token bucket algorithm; converting the file IO request into a corresponding block IO request
Abstract: The invention discloses a resource management method and system based on multi-tenant cloud storage, and the method comprises the steps: obtaining the tenant performance demands of each tenant, and correspondingly recording the tenant performance demands in metadata of a virtual machine mirror image file used by the tenant; adding a token bucket to the metadata of each virtual machine mirror imagefile; in a page cache layer of the IO stack, obtaining an index node object of the accessed virtual machine mirror image file, and scheduling memory resources by utilizing a token bucket algorithm; converting the file IO request into a corresponding block IO request in a file system layer of an IO stack; for each file IO request, obtaining an index node object of a virtual machine mirror image file accessed by the file IO request, obtaining a tenant performance demand from the index node object, and adding the tenant performance demand to each block IO request obtained by converting the fileIO request; and in a block layer of the IO stack, obtaining tenant performance requirements from the block IO request, and scheduling hard disk resources by utilizing a token bucket algorithm. According to the invention, performance isolation between tenants can be effectively realized.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: A low overhead method is proposed to predict the worst execution time degradation caused by the replacement policy adaptation and ensure the quality of service (QoS) for this program can be quaranteed no matter how the cache replacement policies alternate.
Abstract: As the shared last level cache (LLC) in multicore processors has been shown to be a critical resource for system performance, much work has been proposed for improving the quality of service (QoS) and throughput on LLC. Cache Allocation Technology (CAT) and Adaptive Cache Replacement Policies (ACRP) are two of the techniques that are featured in recent Intel processors. CAT implements way partitioning and provides the ability to control the cache space allocation among cores. ACRP works with multiple replacement policies and enables the cache to adapt to the cache replacement policy with less cache misses. In this paper, we first show an interesting finding that ACRP technique can violate the performance isolation provided by CAT. We find the cause for this problem is that the ACRP chooses the cache replacement policy upon the global information even although the cache space partitioning is being enabled by CAT. As the result, the cache/performance isolation can be impaired by the interference on cache replacement policy. To deal with this problem, we propose a low overhead method to predict the worst execution time degradation caused by the replacement policy adaptation. Thus, in the partitioned cache space, if the worst execution time estimated by our method is not beyond the response time required for this program, the QoS for this program can be quaranteed no matter how the cache replacement policies alternate.