scispace - formally typeset
Search or ask a question

Showing papers on "Temporal isolation among virtual machines published in 2018"


Proceedings Article
09 Apr 2018
TL;DR: This paper presents the design and experience with Andromeda, Google Cloud Platform's network virtualization stack, and demonstrates that the Andromeda datapath achieves performance that is competitive with hardware while maintaining the flexibility and velocity of a software-based architecture.
Abstract: This paper presents our design and experience with Andromeda, Google Cloud Platform’s network virtualization stack. Our production deployment poses several challenging requirements, including performance isolation among customer virtual networks, scalability, rapid provisioning of large numbers of virtual hosts, bandwidth and latency largely indistinguishable from the underlying hardware, and high feature velocity combined with high availability. Andromeda is designed around a flexible hierarchy of flow processing paths. Flows are mapped to a programming path dynamically based on feature and performance requirements. We introduce the Hoverboard programming model, which uses gateways for the long tail of low bandwidth flows, and enables the control plane to program network connectivity for tens of thousands of VMs in seconds. The on-host dataplane is based around a highperformance OS bypass software packet processing path. CPU-intensive per packet operations with higher latency targets are executed on coprocessor threads. This architecture allows Andromeda to decouple feature growth from fast path performance, as many features can be implemented solely on the coprocessor path. We demonstrate that the Andromeda datapath achieves performance that is competitive with hardware while maintaining the flexibility and velocity of a software-based architecture.

99 citations


Proceedings ArticleDOI
01 Feb 2018
TL;DR: KPart is presented, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems, and achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware.
Abstract: Cache partitioning is now available in commercial hardware. In theory, software can leverage cache partitioning to use the last-level cache better and improve performance. In practice, however, current systems implement way-partitioning, which offers a limited number of partitions and often hurts performance. These limitations squander the performance potential of smart cache management. We present KPart, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems. KPart first groups applications into clusters, then partitions the cache among these clusters. To build clusters, KPart relies on a novel technique to estimate the performance loss an application suffers when sharing a partition. KPart automatically chooses the number of clusters, balancing the isolation benefits of way-partitioning with its potential performance impact. KPart uses detailed profiling information to make these decisions. This information can be gathered either offline, or online at low overhead using a novel profiling mechanism. We evaluate KPart in a real system and in simulation. KPart improves throughput by 24% on average (up to 79%) on an Intel Broadwell-D system, whereas prior per-application partitioning policies improve throughput by just 1.7% on average and hurt 30% of workloads. Simulation results show that KPart achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware.

96 citations


Proceedings Article
09 Apr 2018
TL;DR: ResQ is presented, a resource manager for NFV that enforces performance SLOs for multi-tenant NFV clusters in a resource efficient manner and achieves 60%-236% better resource efficiency for enforcing SLOs that contain contention-sensitive NFs compared to previous work.
Abstract: Network Function Virtualization is allowing carriers to replace dedicated middleboxes with Network Functions (NFs) consolidated on shared servers, but the question of how (and even whether) one can achieve performance SLOs with software packet processing remains open. A key challenge is the high variability and unpredictability in throughput and latency introduced when NFs are consolidated. We show that, using processor cache isolation and with careful sizing of I/O buffers, we can directly enforce a high degree of performance isolation among consolidated NFs - for a wide range of NFs, our technique caps the maximum throughput degradation to 2.9% (compared to 44.3%), and the 95th percentile latency degradation to 2.5% (compared to 24.5%). Building on this, we present ResQ, a resource manager for NFV that enforces performance SLOs for multi-tenant NFV clusters in a resource efficient manner. ResQ achieves 60%-236% better resource efficiency for enforcing SLOs that contain contention-sensitive NFs compared to previous work.

96 citations


Journal ArticleDOI
TL;DR: This paper focuses on the efficient online live migration of multiple correlated VMs in VDC requests, and proposes an efficient VDC migration algorithm (VDC-M), which uses the US-wide US National Science Foundation (NSF) network as substrate network to conduct extensive simulation experiments.
Abstract: With the development of cloud computing, virtual machine migration is emerging as a promising technique to save energy, enhance resource utilizations, and guarantee Quality of Service (QoS) in cloud datacenters. Most of existing studies on the virtual machine migration, however are based on a single virtual machine migration. Although there are some researches on multiple virtual machines migration, the author usually does not consider the correlation among these virtual machines. In practice, in order to save energy and maintain system performance, cloud providers usually need to migrate multiple correlated virtual machines or migrate the entire virtual datacenter (VDC) request. In this paper, we focus on the efficient online live migration of multiple correlated VMs in VDC requests, for optimizing the migration performance. To solve this problem, we propose an efficient VDC migration algorithm (VDC-M). We use the US-wide US National Science Foundation (NSF) network as substrate network to conduct extensive simulation experiments. Simulation results show that the performance of the proposed algorithm is promising in terms of the total VDC remapping cost, the blocking ratio, the average migration time and the average downtime.

83 citations


Proceedings Article
11 Jul 2018
TL;DR: It is shown that colocating CPU-intensive jobs with latency-sensitive services increases average CPU utilization from 21% to 66% for off-peak load without impacting tail latency.
Abstract: Large commercial latency-sensitive services, such as web search, run on dedicated clusters provisioned for peak load to ensure responsiveness and tolerate data center outages. As a result, the average load is far lower than the peak load used for provisioning, leading to resource under-utilization. The idle resources can be used to run batch jobs, completing useful work and reducing overall data center provisioning costs. However, this is challenging in practice due to the complexity and stringent tail-latency requirements of latency-sensitive services. Left unmanaged, the competition for machine resources can lead to severe response-time degradation and unmet service-level objectives (SLOs). This work describes PerfIso, a performance isolation framework which has been used for nearly three years in Microsoft Bing, a major search engine, to colocate batch jobs with production latency-sensitive services on over 90,000 servers. We discuss the design and implementation of PerfIso, and conduct an experimental evaluation in a production environment. We show that colocating CPU-intensive jobs with latency-sensitive services increases average CPU utilization from 21% to 66% for off-peak load without impacting tail latency.

80 citations


Journal ArticleDOI
TL;DR: The simulation results demonstrate that the EC-VMC method effectively overcomes the deficiencies of some existing heuristic algorithms and is highly effective in reducing VM migrations and energy consumption of data centers and in improving QoS.

49 citations


Proceedings ArticleDOI
23 Apr 2018
TL;DR: dCat is proposed, a new dynamic cache management technology to provide strong cache isolation with better performance and requires no modifications to applications so that it can be applied to all cloud workloads.
Abstract: In the modern multi-tenant cloud, resource sharing increases utilization but causes performance interference between tenants. More generally, performance isolation is also relevant in any multi-workload scenario involving shared resources. Last level cache (LLC) on processors is shared by all CPU cores in x86, thus the cloud tenants inevitably suffer from the cache flush by their noisy neighbors running on the same socket. Intel Cache Allocation Technology (CAT) provides a mechanism to assign cache ways to cores to enable cache isolation, but its static configuration can result in underutilized cache when a workload cannot benefit from its allocated cache capacity, and/or lead to sub-optimal performance for workloads that do not have enough assigned capacity to fit their working set. In this work, we propose a new dynamic cache management technology (dCat) to provide strong cache isolation with better performance. For each workload, we target a consistent, minimum performance bound irrespective of others on the socket and dependent only on its rightful share of the LLC capacity. In addition, when there is spare capacity on the socket, or when some workloads are not obtaining beneficial performance from their cache allocation, dCat dynamically reallocates cache space to cache-intensive workloads. We have implemented dCat in Linux on top of CAT to dynamically adjust cache mappings. dCat requires no modifications to applications so that it can be applied to all cloud workloads. Based on our evaluation, we see an average of 25% improvement over shared cache and 15.7% over static CAT for selected, memory intensive, SPEC CPU2006 workloads. For typical cloud workloads, with Redis we see 57.6% improvement (over shared LLC) and 26.6% improvement (over static partition) and with ElasticSearch we see 11.9% improvement over both.

44 citations


Posted Content
TL;DR: Containers, enabling lightweight environment and performance isolation, fast and flexible deployment, and fine-grained resource sharing, have gained popularity in better application management and deployment in addition to hardware virtualization as discussed by the authors.
Abstract: Containers, enabling lightweight environment and performance isolation, fast and flexible deployment, and fine-grained resource sharing, have gained popularity in better application management and deployment in addition to hardware virtualization. They are being widely used by organizations to deploy their increasingly diverse workloads derived from modern-day applications such as web services, big data, and IoT in either proprietary clusters or private and public cloud data centers. This has led to the emergence of container orchestration platforms, which are designed to manage the deployment of containerized applications in large-scale clusters. These systems are capable of running hundreds of thousands of jobs across thousands of machines. To do so efficiently, they must address several important challenges including scalability, fault-tolerance and availability, efficient resource utilization, and request throughput maximization among others. This paper studies these management systems and proposes a taxonomy that identifies different mechanisms that can be used to meet the aforementioned challenges. The proposed classification is then applied to various state-of-the-art systems leading to the identification of open research challenges and gaps in the literature intended as future directions for researchers.

34 citations


Proceedings ArticleDOI
16 Apr 2018
TL;DR: NeuroViNE is a novel approach to speed up and improve a wide range of existing VNE algorithms: it is based on a search space reduction mechanism and preprocesses a problem instance by extracting relevant subgraphs, i.e., good combinations of substrate nodes and links.
Abstract: Network virtualization enables increasingly diverse network services to cohabit and share a given physical infrastructure and its resources, with the possibility to rely on different network architectures and protocols optimized towards specific requirements. In order to ensure a predictable performance despite shared resources, network virtualization requires a strict performance isolation and hence, resource reservations. Moreover, the creation of virtual networks should be fast and efficient. The underlying NP-hard algorithmic problem is known as the Virtual Network Embedding (VNE) problem and has been studied intensively over the last years. This paper presents NeuroViNE, a novel approach to speed up and improve a wide range of existing VNE algorithms: NeuroViNE is based on a search space reduction mechanism and preprocesses a problem instance by extracting relevant subgraphs, i.e., good combinations of substrate nodes and links. These subgraphs can then be fed to an existing algorithm for faster and more resource-efficient embeddings. NeuroViNE relies on a Hopfield network, and its performance benefits are investigated in simulations for random networks, real substrate networks, and data center networks.

33 citations


Journal ArticleDOI
TL;DR: An autonomic resource slicing (virtualization) scheme is introduced, which realizes autonomic management and configuration of virtual APs, in a LiFi attocell access network, based on SPs and their users service requirements.
Abstract: LiFi attocell access networks will be deployed everywhere to support diverse applications and service provisioning to various end-users. The LiFi infrastructure providers will need to offer LiFi access points (APs) resources as a service. This, however, requires a research challenge to be solved to dynamically and effectively allocate resources among service providers (SPs) while guaranteeing performance isolation among them and their respective users. This paper introduces an autonomic resource slicing (virtualization) scheme, which realizes autonomic management and configuration of virtual APs, in a LiFi attocell access network, based on SPs and their users service requirements. The developed scheme comprises of traffic analysis and classification, a local AP controller, downlink and uplink slice resources manager, traffic measurement, and information collection modules. It also contains a hybrid medium access protocol and an extended token bucket fair queueing algorithm to support uplink access virtualization and spectrum slicing. The proposed resource slicing scheme collects and analyzes the traffic statistics of the different applications supported on the slices defined in each LiFi AP and distributes the available resources fairly and proportionally among them. It uses a control algorithm to adjust the minimum contention window of user devices to achieve the target throughput and ensure airtime fairness among SPs and their users. The developed scheme has been extensively evaluated using OMNeT++. The obtained results show various resource slicing capabilities to support differentiated services and performance isolation.

32 citations


Journal ArticleDOI
25 Aug 2018
TL;DR: Four system properties are addressed by focusing on elasticity of the cloud service, to accommodate large variations in the amount of service requested, performance isolation between the tenants of shared cloud systems and resulting performance variability, availability of cloud services and systems, and the operational risk of running a production system in a cloud environment.
Abstract: In only a decade, cloud computing has emerged from a pursuit for a service-driven information and communication technology (ICT), becoming a significant fraction of the ICT market. Responding to the growth of the market, many alternative cloud services and their underlying systems are currently vying for the attention of cloud users and providers. To make informed choices between competing cloud service providers, permit the cost-benefit analysis of cloud-based systems, and enable system DevOps to evaluate and tune the performance of these complex ecosystems, appropriate performance metrics, benchmarks, tools, and methodologies are necessary. This requires re-examining old system properties and considering new system properties, possibly leading to the re-design of classic benchmarking metrics such as expressing performance as throughput and latency (response time). In this work, we address these requirements by focusing on four system properties: (i) elasticity of the cloud service, to accommodate large variations in the amount of service requested, (ii) performance isolation between the tenants of shared cloud systems and resulting performance variability, (iii) availability of cloud services and systems, and (iv) the operational risk of running a production system in a cloud environment. Focusing on key metrics for each of these properties, we review the state-of-the-art, then select or propose new metrics together with measurement approaches. We see the presented metrics as a foundation toward upcoming, future industry-standard cloud benchmarks.

Proceedings ArticleDOI
12 Jun 2018
TL;DR: Lasagna is introduced, a novel end-to-end solution that enables flexible management of slices encompassing both the wired and the wireless segments of an Enterprise WLAN and can ensure both functional and performance isolation between the different slices and efficient radio resource utilization.
Abstract: Current 802.11-based WLANs are asked to support an ever increasing number of services and applications, each of them characterized by a diverse set of requirements in terms of bitrate, latency, and reliability. Network virtualization and programmability are two emerging trends that can support the realization of such a vision in a cost-effective fashion. In this paper we introduce Lasagna, a novel end-to-end solution that enables flexible management of slices encompassing both the wired and the wireless segments of an Enterprise WLAN. Lasagna allows flexible management of network slices to meet their respective service requirements. An experimental evaluation carried out over a real-world testbed shows that Lasagna can ensure both functional and performance isolation between the different slices and efficient radio resource utilization. We release the entire implementation including the controller and the datapath under a permissive license for academic use.

Proceedings ArticleDOI
01 Aug 2018
TL;DR: It is demonstrated that, by following a single-pipeline design principle, it is possible to control each tenant's share of network bandwidth and computational resources even for complex, distributed operations.
Abstract: FPGAs can be used to speed up computation and data management tasks in various application domains. In cloud settings, however, high utilization is as important as high performance. In software it is common to co-locate different tenants' workloads on the same servers to increase utilization. Sharing an FPGA is more complex because applications take up physical space on the chip. Even though it is possible to physically partition the FPGA, tenants can have widely different requirements and their needs can also fluctuate over time. In this paper, we take a different approach and provide flexibility to the tenants who are interested in the same type of application but have different workloads and quality of service requirements. We demonstrate our approach of multi-tenant design using a key-value store service but the ideas generalize to other network-facing services as well. A key challenge of multi-tenancy is to efficiently share the underlying hardware while enforcing strict data and performance isolation between tenants. In this paper we demonstrate that, by following a single-pipeline design principle, it is possible to control each tenant's share of network bandwidth and computational resources even for complex, distributed operations. Furthermore, we show how state-machine based logic on the FPGA can be made tenant-aware without introducing significant context-switching overhead. Finally, our hardware design provides flexibility for changing per-tenant shares, allowing the same circuit to be used by one or multiple tenants without performance loss.

Proceedings ArticleDOI
10 Dec 2018
TL;DR: This work proposes NBWGuard, a design for network bandwidth management and evaluates its implementation, which lets Kubernetes manage network bandwidth as a resource while still using plug-ins for realizing the network specification desired by users.
Abstract: Kubernetes is a very popular and fast-growing container orchestration platform that automates the process of deploying and managing multi-container applications at scale. Users can specify required and maximum values of resources they need for their containers and Kubernetes realizes them by interfacing with lower levels (container runtime which in turn can use OS capabilities) of the stack for enforcing them. Kubernetes supports differentiated QoS classes - Guaranteed, Burstable, and Best-effort - in order of decreasing priority based on the resource size specifications for CPU and memory capacity. This allows many applications to obtain a desired level of QoS (performance isolation and throughput) when CPU or memory capacity management can provide them. However, when workloads may be critically dependent for their performance on another resource, namely network bandwidth, Kubernetes has no means to meet their QoS needs. Networking between pods in Kubernetes is supported with plug-ins and the network resource is not managed directly.In this work, we propose NBWGuard, a design for network bandwidth management and evaluate its implementation. NBWGuard lets Kubernetes manage network bandwidth as a resource (like CPU or memory capacity) while still using plug-ins for realizing the network specification desired by users. Consistent with Kubernetes approach to application QoS based on resource allocation NBWGuard also supports the 3 QoS classes: Guaranteed, Burstable, and Best-effort with respect to network bandwidth. NBWGuard is evaluated with iperf benchmark on real cloud environment, and the evaluation results demonstrate that it is able to provide network bandwidth isolation without impact on overall throughput.

Proceedings ArticleDOI
21 May 2018
TL;DR: Experimental results with Hadoop MapReduce and Spark benchmarks show that PerfCloud effectively reduces their job completion time, decreases performance variability, and improves resource utilization efficiency while minimizing the performance degradation of other colocated VMs.
Abstract: Data-intensive applications often suffer from performance variability and degradation in the cloud due to intrinsically complex problem of performance interference that arises from multi-tenancy. Although application-level approach of straggler mitigation for scale-out data processing frameworks such as MapReduce and Spark, address the issue to some extent, they incur extra resource and often react after tasks have already slowed down. In this paper, we present PerfCloud, a novel system software that utilizes system level performance metrics for early detection of performance interference in a multi-tenant cloud, and provides non-invasive performance isolation through fine-grained resource control. Unlike existing works, PerfCloud does not require time-consuming workload profiling, or intrusive modification of the application framework and the operating system. We implemented PerfCloud on NSF Cloud's Chameleon testbed using KVM for virtualization, and OpenStack for cloud management. Experimental results with Hadoop MapReduce and Spark benchmarks show that PerfCloud effectively reduces their job completion time, decreases performance variability, and improves resource utilization efficiency while minimizing the performance degradation of other colocated VMs.

Proceedings ArticleDOI
01 Apr 2018
TL;DR: Proctor, a real time, lightweight and scalable analytics fabric that detects performance intrusive VMs and identifies its root causes from among the arbitrary VMs running in shared datacenters across 4 key hardware resources – network, I/O, cache, and CPU, is introduced.
Abstract: Cloud-scale datacenter management systems utilize virtualization to provide performance isolation while maximizing the utilization of the underlying hardware infrastructure. However, virtualization does not provide complete performance isolation as Virtual Machines (VMs) still compete for nonreservable shared resources (like caches, network, I/O bandwidth etc.) This becomes highly challenging to address in datacenter environments housing tens of thousands of VMs, causing degradation in application performance. Addressing this problem for production datacenters requires a non-intrusive scalable solution that 1) detects performance intrusion and 2) investigates both the intrusive VMs causing interference, as well as the resource(s) for which the VMs are competing for. To address this problem, this paper introduces Proctor, a real time, lightweight and scalable analytics fabric that detects performance intrusive VMs and identifies its root causes from among the arbitrary VMs running in shared datacenters across 4 key hardware resources – network, I/O, cache, and CPU. Proctor is based on a robust statistical approach that requires no special profiling phases, standing in stark contrast to a wide body of prior work that assumes pre-acquisition of application level information prior to its execution. By detecting performance degradation and identifying the root cause VMs and their metrics, Proctor can be utilized to dramatically improve the performance outcomes of applications executing in large-scale datacenters. From our experiments, we are able to show that when we deploy Proctor in a datacenter housing a mix of I/O, network, compute and cache-sensitive applications, it is able to effectively pinpoint performance intrusive VMs. Further, we observe that when Proctor is applied with migration, the application-level Quality-of-Service improves by an average of 2.2× as compared to systems which are unable to detect, identify and pinpoint performance intrusion and their root causes.

Journal ArticleDOI
TL;DR: A service curve-based QoS algorithm to support latency guarantee applications, IOPS guarantee applications and best-effort applications at the same storage system, which not only provides a QoS guarantee for applications, but also pursues better system utilization.
Abstract: With the growing popularity of cloud storage, more and more diverse applications with diverse service level agreements (SLAs) are being accommodated into it. The quality of service (QoS) support for applications in a shared cloud storage becomes important. However, performance isolation, diverse performance requirements, especially harsh latency guarantees and high system utilization, are all challenging and desirable for QoS design. In this paper, we propose a service curve-based QoS algorithm to support latency guarantee applications, IOPS guarantee applications and best-effort applications at the same storage system, which not only provides a QoS guarantee for applications, but also pursues better system utilization. Three priority queues are exploited and different service curves are applied for different types of applications. I/O requests from different applications are scheduled and dispatched among the three queues according to their service curves and I/O urgency status, so that QoS requirements of all applications can be guaranteed on the shared storage system. Our experimental results show that our algorithm not only simultaneously guarantees the QoS targets of latency and throughput (IOPS), but also improves the utilization of storage resources.

Journal ArticleDOI
TL;DR: Simulation results showed that the proposed self-adaptive network-aware virtual machine clustering and consolidation algorithm considerably reduced the amount of high-delay jobs, lowered the average traffic passed through aggregate switches and improved the communication ability among virtual machines.
Abstract: Modern data center consists of thousands of servers, racks and switches. Complicated structure means it requires well-designed algorithms to utilize resources of data centers efficiently. Current virtual machine scheduling algorithms mainly focus on the initial allocation of virtual machines based on the CPU, memory and network bandwidth requirements. However, when tasks finished or lease expired, related virtual machines would be deleted from the system which would generate resource fragments. Such fragments lead to unbalanced resource utilization and decline of communication performance. This paper investigates the network influence on typical applications in data centers and proposed a self-adaptive network-aware virtual machine clustering and consolidation algorithm to maintain an optimal system-wide status. Our consolidation algorithm periodically checks whether consolidation is necessary and then clusters and consolidates virtual machines to lower communication cost with an online heuristic. We used two benchmarks in a real environment to examine network influence on different tasks. To evaluate the advantages of the proposed algorithm, we also built a cloud computing testbed. Real workload trace-driven simulations and testbed-based experiments showed that, our algorithm greatly shortened the average finish time of map-reduce tasks and reduced time delay of web applications. Simulation results showed that our algorithm considerably reduced the amount of high-delay jobs, lowered the average traffic passed through aggregate switches and improved the communication ability among virtual machines.

Proceedings Article
09 Jul 2018
TL;DR: This paper proposes a utilitarian performance isolation (UPI) scheme for shared SSD settings that exploits SSD's abundant parallelism to maximize the utility of all tenants while providing performance isolation.
Abstract: This paper proposes a utilitarian performance isolation (UPI) scheme for shared SSD settings. UPI exploits SSD's abundant parallelism to maximize the utility of all tenants while providing performance isolation. Our approach is in contrast to static resource partitioning techniques that bind parallelism, isolation, and capacity altogether. We demonstrate that our proposed scheme reduces the 99th percentile response time by 38.5% for a latency-critical workload, and the average response time by 16.1% for a high-throughput workload compared to the static approaches.

Proceedings ArticleDOI
19 Mar 2018
TL;DR: Sugar is presented, a novel operating system solution that enhances the security of GPU acceleration for web apps by design and achieves high performance despite GPU virtualization overhead.
Abstract: Modern personal computers have embraced increasingly powerful Graphics Processing Units (GPUs). Recently, GPU-based graphics acceleration in web apps (i.e., applications running inside a web browser) has become popular. WebGL is the main effort to provide OpenGL-like graphics for web apps and it is currently used in 53% of the top-100 websites. Unfortunately, WebGL has posed serious security concerns as several attack vectors have been demonstrated through WebGL. Web browsers» solutions to these attacks have been reactive: discovered vulnerabilities have been patched and new runtime security checks have been added. Unfortunately, this approach leaves the system vulnerable to zero-day vulnerability exploits, especially given the large size of the Trusted Computing Base of the graphics plane. We present Sugar, a novel operating system solution that enhances the security of GPU acceleration for web apps by design. The key idea behind Sugar is using a dedicated virtual graphics plane for a web app by leveraging modern GPU virtualization solutions. A virtual graphics plane consists of a dedicated virtual GPU (or vGPU) as well as all the software graphics stack (including the device driver). Sugar enhances the system security since a virtual graphics plane is fully isolated from the rest of the system. Despite GPU virtualization overhead, we show that Sugar achieves high performance. Moreover, unlike current systems, Sugar is able to use two underlying physical GPUs, when available, to co-render the User Interface (UI): one GPU is used to provide virtual graphics planes for web apps and the other to provide the primary graphics plane for the rest of the system. Such a design not only provides strong security guarantees, it also provides enhanced performance isolation.

Proceedings ArticleDOI
01 Sep 2018
TL;DR: PerfGreen is presented, a dynamic auto-tuning resource management system for improving energy efficiency with minimal performance impact in heterogeneous clouds through a combination of admission control, scheduling, and online resource allocation methods with performance isolation and application priority techniques.
Abstract: Improving energy efficiency in a cloud environment is challenging because of poor energy proportionality, low resource utilization, interference as well as workload, application, and hardware dynamism. In this paper we present PerfGreen, a dynamic auto-tuning resource management system for improving energy efficiency with minimal performance impact in heterogeneous clouds. PerfGreen achieves this through a combination of admission control, scheduling, and online resource allocation methods with performance isolation and application priority techniques. Scheduling in PerfGreen is energy aware and power management capabilities such as CPU frequency adaptation and hard CPU power limiting are exploited. CPU scaling is combined with performance isolation techniques, including CPU pinning and quota enforcement, for prioritized virtual machines to improve energy efficiency. An evaluation based on our prototype implementation shows that PerfGreen with its energy-aware scheduler and resource allocator on average reduces energy usage by 53%, improves performance per watt by 64%, and server density by 25% while keeping performance deviations to a minimum.

Journal ArticleDOI
TL;DR: This paper presents a framework for a self-adaptive network architecture for HPC clouds based on lossless interconnection networks, demonstrated by means of the implemented IB prototype, and presents IBAdapt, a simplified ruled-based language for the service providers to specify adaptation strategies used by the framework.
Abstract: Clouds offer flexible and economically attractive compute and storage solutions for enterprises. However, the effectiveness of cloud computing for high-performance computing (HPC) systems still remains questionable. When clouds are deployed on lossless interconnection networks, like InfiniBand (IB), challenges related to load-balancing, low-overhead virtualization, and performance isolation hinder full potential utilization of the underlying interconnect. Moreover, cloud data centers incorporate a highly dynamic environment rendering static network reconfigurations, typically used in IB systems, infeasible. In this paper, we present a framework for a self-adaptive network architecture for HPC clouds based on lossless interconnection networks, demonstrated by means of our implemented IB prototype. Our solution, based on a feedback control and optimization loop, enables the lossless HPC network to dynamically adapt to the varying traffic patterns, current resource availability, workload distributions, and also in accordance with the service provider-defined policies. Furthermore, we present IBAdapt, a simplified ruled-based language for the service providers to specify adaptation strategies used by the framework. Our developed self-adaptive IB network prototype is demonstrated using state-of-the-art industry software. The results obtained on a test cluster demonstrate the feasibility and effectiveness of the framework when it comes to improving Quality-of-Service compliance in HPC clouds.

Journal ArticleDOI
TL;DR: The proposed approach employs a novel architectural framework, named DPIM, which enables service providers to realize different isolation methods and enforces performance isolation transparently and demonstrates the practicality and effectiveness of the proposed approach and related framework for performance isolation management in different service environments, with different operating entities.
Abstract: Unmanaged resource contention in cloud computing environments can cause problems such as performance interference, service quality degradation, and consequently service agreements violation. Performance isolation is an indispensable remedy solution for the mentioned challenges. Dynamic analysis and monolithic management of the performance isolation from the perspective of cloud computing services with different operating entities is a challenging problem. This issue has not been addressed in previous studies, despite its significance. Most previous researches have focused on particular algorithms and methods for specific application scenarios, and lack sufficient descriptions about analysis and management aspects of the performance isolation. Due to the importance of this issue, this paper aims to make an in-depth investigation of this problem and propose a novel approach in order to dynamic analysis and management of the performance isolation for cloud computing services. Proposed approach employs a novel architectural framework, named DPIM, which enables service providers to realize different isolation methods and enforces performance isolation transparently. The experimental results demonstrate the practicality and effectiveness of the proposed approach and related framework for performance isolation management in different service environments, with different operating entities.

Journal ArticleDOI
TL;DR: This paper proposes a method to mitigate the performance degradation of other VMs by dynamically allocating the resource usage time of the VM and preventing the priority preemption of the GPGPU-intensive VM.
Abstract: As the size of data increases and computation becomes complicated in fog computing environments, the need for highperformance computation is increasing. One of the most popular ways to improve the performance of a virtual machine (VM) is to allocate a graphic processing unit (GPU) to the VM for supporting general purpose computing on graphic processing unit (GPGPU) operations. The direct pass-through, often used for GPUs in VMs, is popular in the cloud because VMs can use the full functionality of the GPU and experience virtually no performance degradation owing to virtualization. Direct pass-through is very useful for improving the performance of VMs. However, since the GPU usage time is not considered in the VM scheduler that operates based on the central processing unit (CPU) usage time of the VM, the VM performing the GPGPU operation degrades the performance of other VMs. In this paper, we analyze the effect of the VM performing the GPGPU operation (GPGPU-intensive VM) on other VMs through experiments. Then, we propose a method to mitigate the performance degradation of other VMs by dynamically allocating the resource usage time of the VM and preventing the priority preemption of the GPGPU-intensive VM.

Proceedings ArticleDOI
28 Mar 2018
TL;DR: This paper shows that an operating system-like management layer for modules in a function-based data plane can offer OS-like constructs such as performance and memory isolation and uses new Intel CPU extensions to create coarse-grained heap and stack protection.
Abstract: Existing software dataplanes that run network functions inside VMs or containers can provide either performance (by dedicating CPU cores) or multiplexing (by context switching), but not both at once. Function-based dataplane architectures by replacing VMs and containers with function calls promise to achieve multiplexing and performance at the same time. However, they compromise memory isolation between tenants by forcing them to use a shared memory address space. In this paper, we show that an operating system-like management layer for modules in a function-based data plane can offer OS-like constructs such as performance and memory isolation. To provide memory isolation, we leverage new Intel CPU extensions (MPX) to create coarse-grained heap and stack protection even for legacy code written in unsafe native languages such as C. In addition, we use programmable NIC offloads to distribute load across cores as well as to prevent batch fragmentation when processing complex service graphs. Our preliminary evaluation shows the limitations of existing techniques that require heavy weight memory isolation or incur cross-core overheads.

Proceedings ArticleDOI
01 Dec 2018
TL;DR: XOS is presented, an application-defined OS for modern DC servers that leverages modern hardware support for virtualization to move resource management functionality out of the conventional kernel and into user space, which lets applications achieve near bare-metal performance.
Abstract: Rapid growth of datacenter (DC) scale, urgency of cost control, increasing workload diversity, and huge software investment protection place unprecedented demands on the operating system (OS) efficiency, scalability, performance isolation, and backward-compatibility. The traditional OSes are not built to work with deep-hierarchy software stacks, large numbers of cores, tail latency guarantee, and increasingly rich variety of applications seen in modern DCs, and thus they struggle to meet the demands of such workloads.This paper presents XOS, an application-defined OS for modern DC servers. Our design moves resource management out of the OS kernel, supports customizable kernel subsystems in user space, and enables elastic partitioning of hardware resources. Specifically, XOS leverages modern hardware support for virtualization to move resource management functionality out of the conventional kernel and into user space, which lets applications achieve near bare-metal performance. We implement XOS on top of Linux to provide backward compatibility. XOS speeds up a set of DC workloads by up to 1.6× over our baseline Linux on a 24-core server, and outperforms the state-of-the-art Dune by up to 3.3× in terms of virtual memory management. In addition, XOS demonstrates good scalability and strong performance isolation.

Proceedings ArticleDOI
11 Jun 2018
TL;DR: The scheduler adds hard real-time threads both in their classic, individual form, and in a group form in which a group of parallel threads execute in near lock-step using only scalable, per-hardware-thread scheduling.
Abstract: High performance parallel computing demands careful synchronization, timing, performance isolation and control, as well as the avoidance of OS and other types of noise. The employment of soft real-time systems toward these ends has already shown considerable promise, particularly for distributed memory machines. As processor core counts grow rapidly, a natural question is whether similar promise extends to the node. To address this question, we present the design, implementation, and performance evaluation of a hard real-time scheduler specifically for high performance parallel computing on shared memory nodes built on x64 processors, such as the Xeon Phi. Our scheduler is embedded in a kernel framework that is already specialized for high performance parallel run-times and applications, and that meets the basic requirements needed for a real-time OS (RTOS). The scheduler adds hard real-time threads both in their classic, individual form, and in a group form in which a group of parallel threads execute in near lock-step using only scalable, per-hardware-thread scheduling. On a current generation Intel Xeon Phi, the scheduler is able to handle timing constraints down to resolution of ∼13,000 cycles (∼10 μs), with synchronization to within ∼4,000 cycles (∼3 μs) among 255 parallel threads. The scheduler isolates a parallel group and is able to provide resource throttling with commensurate application performance. We also show that in some cases such fine-grain control over time allows us to eliminate barrier synchronization, leading to performance gains, particularly for fine-grain BSP workloads.

Proceedings ArticleDOI
06 Apr 2018
TL;DR: This scheme allocates resources efficiently and reduces the response time as compared to static allocation, and predicted values are padded with proper value and immediately resource caps are raised to avoid underestimation prediction errors.
Abstract: Virtualization is the main technology in the large scale data centers with which resources are shared among different application running on different VMs. Virtualization through virtual machine monitor (VMM) like Xen only provides resource isolation among co-located VMs. However, it has been shown that resource isolation does not imply performance isolation between VMs. Hence it necessitates on-demand allocation of the physical shared resources to individual VM as per their dynamic requirements to satisfy the SLA between customer and cloud provider. To do this efficiently future resource utilization is predicted using fuzzy logic based prediction. To avoid underestimation prediction errors due to spikes in the workload, the predicted values are padded with proper value and immediately resource caps are raised. The resource conflict is resolved locally if resources are available otherwise migration is triggered. This scheme allocates resources efficiently and reduces the response time as compared to static allocation. The resource saving with proposed method is around 30-40% and around 10-20% performance improvement in terms of response time of an application.

Proceedings ArticleDOI
01 Sep 2018
TL;DR: This paper aims to propose a model for Radio Resource Management among multiple Virtual Network Operators (VNOs) to address all their Service Level Agreements (SLAs) independently, as well as satisfying their customised service requirements to the highest achievable level.
Abstract: One of the major potential benefits in wireless network virtualisation is enabling on-demand sharing of resources among different tenants in an isolated manner. This paper aims to propose a model for Radio Resource Management (RRM) among multiple Virtual Network Operators (VNOs) to address all their Service Level Agreements (SLAs) independently, as well as satisfying their customised service requirements to the highest achievable level. Performance isolation is attained by realising a centralised virtualisation platform called Virtual-RRM. This entity is responsible for sharing the total capacity obtained by aggregation of all the radio resource units from different access technologies, according to the service demands of the VNOs and considering service priorities. In order to evaluate model performance, a practical scenario with 3 types of SLA contracts is proposed, while all VNOs are in charge of providing the 4 service classes to their users. Results under different traffic loads and network parameters show that all VNOs' SLAs are satisfied to the maximum feasible point, 100% of the available capacity is utilised independently from the variation of network parameters, and when possible capacity is shared among the different services according to the concept of proportional fairness.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: OC-Cache is proposed, an open-channel SSD cache framework which utilizes SSD'd internal parallelism to adaptively allocate cache to tenants for both good performance isolation and high SSD utilization.
Abstract: In a multi-tenant cloud environment, tenants are usually hosted by virtual machines. Cloud providers deploy multiple virtual machines on a physical server to better utilize physical resources including CPU, memory, and storage devices. SSDs are often used as an I/O cache shared among the tenants for large storage systems using hard disk drives (HDDs) as their main storage devices, which can receive much of SSD's performance benefit and HDD's cost advantage. A key challenge in the use of the shared cache is to ensure strong performance isolation and maintain its high utilization at the same time. However, conventional SSD cache management approaches cannot effectively address this challenge. In this paper, we propose OC-Cache, an open-channel SSD cache framework which utilizes SSD'd internal parallelism to adaptively allocate cache to tenants for both good performance isolation and high SSD utilization. In particular, OC-Cache uses a tenant's miss ratio curve to determine the amount of cache space allocation and where the allocation is (in dedicated or shared SSD channels) and dynamically manages cache space according to the workload characteristics. Experiments show that OC-Cache significantly reduces interference among tenants, and maintains high utilization of the SSD cache.