Showing papers on "Service level objective published in 2021"

PDF

Open Access

Journal Article•DOI•

COSCO: Container Orchestration using Co-Simulation and Gradient Based Optimization for Fog Computing Environments

[...]

Shreshth Tuli¹, Shivananda R. Poojara², Satish Narayana Srirama³, Giuliano Casale¹, Nicholas R. Jennings¹ - Show less +1 more•Institutions (3)

Imperial College London¹, University of Tartu², University UCINF³

29 Apr 2021-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this article, a hybrid simulation and container orchestration framework is proposed to optimize Quality of Service (QoS) parameters in large-scale fog platforms, using a gradient-based optimization strategy using back-propagation of gradients with respect to input.

...read moreread less

Abstract: Intelligent task placement and management of tasks in large-scale fog platforms is challenging due to the highly volatile nature of modern workload applications and sensitive user requirements of low energy consumption and response time. Container orchestration platforms have emerged to alleviate this problem with prior art either using heuristics to quickly reach scheduling decisions or AI driven methods like reinforcement learning and evolutionary approaches to adapt to dynamic scenarios. The former often fail to quickly adapt in highly dynamic environments, whereas the latter have run-times that are slow enough to negatively impact response time. Therefore, there is a need for scheduling policies that are both reactive to work efficiently in volatile environments and have low scheduling overheads. To achieve this, we propose a Gradient Based Optimization Strategy using Back-propagation of gradients with respect to Input (GOBI). Further, we leverage the accuracy of predictive digital-twin models and simulation capabilities by developing a Coupled Simulation and Container Orchestration Framework (COSCO). Using this, we create a hybrid simulation driven decision approach, GOBI*, to optimize Quality of Service (QoS) parameters. Co-simulation and the back-propagation approaches allow these methods to adapt quickly in volatile environments. Experiments conducted using real-world data on fog applications using the GOBI and GOBI* methods, show a significant improvement in terms of energy consumption, response time, Service Level Objective and scheduling time by up to 15, 40, 4, and 82 percent respectively when compared to the state-of-the-art algorithms.

...read moreread less

33 citations

Proceedings Article•DOI•

SLO Script: A Novel Language for Implementing Complex Cloud-Native Elasticity-Driven SLOs

[...]

Thomas Pusztai¹, Andrea Morichetta¹, Victor Casamayor Pujol¹, Schahram Dustdar¹, Stefan Nastic, Xiaoning Ding, Deepak Vij, Ying Xiong - Show less +4 more•Institutions (1)

Vienna University of Technology¹

01 Sep 2021

TL;DR: SLO'Neill et al. as mentioned in this paper present SLO Script, a language and accompanying framework, motivated by real-world, industrial needs to define complex, high-level SLOs in an orchestrator-independent manner.

...read moreread less

Abstract: Service Level Objectives (SLOs) allow defining expected performance of cloud services, such that cloud service providers know what they guarantee and service consumers know what to expect. Most approaches focus on low-level SLOs, closely related to resources, e.g., average CPU or memory usage, and are usually bound to specific elasticity controllers. We present SLO Script, a language and accompanying framework, motivated by real-world, industrial needs to allow service providers to define complex, high-level SLOs in an orchestrator-independent manner. The main features of SLO Script include: i) novel abstractions (StronglyTypedSLO) with type safety features, ensuring compatibility between SLOs and elasticity strategies, ii) abstractions that enable decoupling of SLOs from elasticity strategies, iii) a strongly typed metrics API, and iv) an orchestrator-independent object model that enables language extensibility. We present a case study about a real-world, cloud-native application and evaluate our language while implementing a realistic Cost Efficiency SLO.

...read moreread less

22 citations

Proceedings Article•DOI•

A Novel Middleware for Efficiently Implementing Complex Cloud-Native SLOs

[...]

Thomas Pusztai¹, Andrea Morichetta¹, Victor Casamayor Pujol¹, Schahram Dustdar¹, Stefan Nastic, Xiaoning Ding, Deepak Vij, Ying Xiong - Show less +4 more•Institutions (1)

Vienna University of Technology¹

01 Sep 2021

TL;DR: In this paper, the authors present a middleware that provides an orchestrator-independent SLO controller for periodically evaluating SLO and triggering elasticity strategies, while decoupling SLOs from the elasticity strategy to increase flexibility, and provider-independent services for obtaining low-level metrics and composing them into higher level metrics.

...read moreread less

Abstract: Service Level Objectives (SLOs) guide the elasticity of cloud applications, e.g., by deciding when and how much the resources provisioned to an application should be changed. Evaluating SLOs requires metrics, which can be directly measured on the application or system, or, more elaborately, be composed from multiple low-level metrics. The implementation of such metrics and SLOs, the triggering of elasticity strategies, and allowing configurability by the user deploying an application, requires a flexible middleware. In this paper, we present a middleware that provides an orchestrator-independent SLO controller for periodically evaluating SLOs and triggering elasticity strategies, while decoupling SLOs from the elasticity strategies to increase flexibility, and provider-independent services for obtaining low-level metrics and composing them into higher-level metrics. We evaluate our middleware by implementing a motivating use case, featuring a cost efficiency SLO for an application deployed on Kubernetes.

...read moreread less

19 citations

Journal Article•DOI•

Automated Validation of Compensable SLAs

[...]

Carlos Müller¹, Antonio Manuel Gutiérrez¹, Pablo Fernandez¹, Octavio Martín-Díaz¹, Manuel Resinas¹, Antonio Ruiz-Cortés¹ - Show less +2 more•Institutions (1)

University of Seville¹

01 Sep 2021-IEEE Transactions on Services Computing

TL;DR: An abstract model of CGs is described and a technique that leverages constraint satisfaction problem solvers to automatically validate them is provided, which helps to pave the way for using CGs in a safer and more reliable way.

...read moreread less

Abstract: A Service Level Agreement (SLA) regulates the provisioning of a service by defining a set of guarantees. Each guarantee sets a Service Level Objective (SLO) on some service metrics, and optionally a compensation that is applied when the SLO is unfulfilled or overfulfilled. Currently, there are software tools and research proposals that use the information about compensations to automate and optimise certain parts of the service management. However, they assume that compensations are well defined, which is too optimistic in some circumstances and can lead to undesirable situations. In this article we discuss about the notion of validity of guarantees with a compensation, which we refer to as compensable guarantees (CG). We describe an abstract model of CGs and we provide a technique that leverages constraint satisfaction problem solvers to automatically validate them. We also present a materialisation of the model of CGs in iAgree, a language to specify SLAs and a tooling support that implements our whole approach. An assessment over 319 CGs taken from 24 real-world SLAs suggests that the expressiveness and effectiveness of our proposal can pave the way for using CGs in a safer and more reliable way.

...read moreread less

13 citations

Journal Article•DOI•

Enforcing trustworthy cloud SLA with witnesses: A game theory–based model using smart contracts

[...]

Huan Zhou¹, Huan Zhou², Xue Ouyang¹, Jinshu Su¹, Cees de Laat², Zhiming Zhao² - Show less +2 more•Institutions (2)

National University of Defense Technology¹, University of Amsterdam²

25 Jul 2021-Concurrency and Computation: Practice and Experience

TL;DR: A witness model is proposed implemented with smart contracts to solve the trust issue between cloud customer and provider and define three types of malicious behaviors and propose quantitative indicators to audit and detect these behaviors.

...read moreread less

Abstract: There lacks trust between the cloud customer and provider to enforce traditional cloud SLA (Service Level Agreement) where the blockchain technique seems a promising solution. However, current explorations still face challenges to prove that the off-chain SLO (Service Level Objective) violations really happen before recorded into the on-chain transactions. In this paper, a witness model is proposed implemented with smart contracts to solve this trust issue. The introduced role, “Witness”, gains rewards as an incentive for performing the SLO violation report, and the payoff function is carefully designed in a way that the witness has to tell the truth, for maximizing the rewards. This fact that the witness has to be honest is analyzed and proved using the Nash Equilibrium principle of game theory. For ensuring the chosen witnesses are random and independent, an unbiased selection algorithm is proposed to avoid possible collusions. An auditing mechanism is also introduced to detect potential malicious witnesses. Specifically, we define three types of malicious behaviors and propose quantitative indicators to audit and detect these behaviors. Moreover, experimental studies based on Ethereum blockchain demonstrate the proposed model is feasible, and indicate that the performance, ie, transaction fee, of each interface follows the design expectations.

...read moreread less

12 citations

Journal Article•DOI•

Combining task scheduling and data replication for SLA compliance and enhancement of provider profit in clouds

[...]

Amel Khelifa¹, Tarek Hamrouni¹, Riad Mokadem², Faouzi Ben Charrada¹•Institutions (2)

Tunis El Manar University¹, Paul Sabatier University²

12 Mar 2021-Applied Intelligence

TL;DR: A novel scheduling algorithm called Bottleneck and Cost Value Scheduling (BCVS) coupled with a novel dynamic data replication strategy called Correlation and Economic Model-based Replication (CEMR) to improve data access effectiveness in order to meet service level objectives.

...read moreread less

Abstract: Task scheduling and data replication are highly coupled resource management techniques that are widely used by cloud providers to improve the overall system performance and ensure service level agreement (SLA) compliance while preserving their own economic profit. However, balancing the trade-off between system performance and provider profit is very challenging. In this paper, we propose a novel scheduling algorithm called Bottleneck and Cost Value Scheduling (BCVS) algorithm coupled with a novel dynamic data replication strategy called Correlation and Economic Model-based Replication (CEMR). The main goal is to improve data access effectiveness in order to meet service level objectives in terms of response time SLORT and minimum availability SLOMA, while preserving the provider profit. The BCVS algorithm focuses on reducing system bottleneck situations caused by data transfer when the CEMR focuses on preventing future SLA violations and guaranteeing a minimum availability. An economic model is also proposed to estimate the cloud provider profit. Simulation results indicate that the proposed combination of scheduling and replication algorithms offers higher monetary profit for the cloud provider by up to 30% compared to existing strategies. Moreover, it allows better performance.

...read moreread less

12 citations

Journal Article•DOI•

Blockchain-enabled Real-time SLA Monitoring for Cloud-hosted Services

[...]

Kashif Mehboob Khan¹, Junaid Arshad², Waheed Iqbal³, Sidrah Abdullah¹, Hassan Zaib³ - Show less +1 more•Institutions (3)

NED University of Engineering and Technology¹, Birmingham City University², University of the Punjab³

04 Oct 2021-Cluster Computing

TL;DR: An in-depth empirical investigation for the scalability of the proposed system is carried out in order to address the challenge of transparently enforcing real-time monitoring of cloud-hosted services leveraging blockchain technology.

...read moreread less

Abstract: Cloud computing is an important technology for businesses and individual users to obtain computing resources over the Internet on-demand and exibly. Although cloud computing has been adopted across diverse applications, the owners of time-and-performance critical applications require cloud service providers' guarantees about their services, such as availability and response times. Service Level Agreements (SLAs) are a mechanism to communicate and enforce such guarantees typically represented as service level objectives (SLOs), and financial penalties are imposed on SLO violations. Due to delays and inaccuracies caused by manual processing, an automatic method to periodically verify SLA terms in a transparent and trustworthy manner is fundamental to effective SLA monitoring, leading to the acceptance and credibility of such service to the customers of cloud services. This paper presents a blockchain-based distributed infrastructure that leverages fundamental blockchain properties to achieve immutable and trustworthy SLA monitoring within cloud services. The paper carries out an in-depth empirical investigation for the scalability of the proposed system in order to address the challenge of transparently enforcing real-time monitoring of cloud-hosted services leveraging blockchain technology. This will enable all the stakeholders to enforce accurate execution of SLA without any imprecisions and delays by maintaining an immutable ledger publicly across blockchain network. The experimentation takes into consideration several attributes of blockchain which are critical in achieving optimum performance. The paper also investigates key characteristics of these factors and their impact to the behaviour of the system for further scaling it up under various cases for increased service utilization.

...read moreread less

10 citations

Journal Article•DOI•

CEDULE+: Resource Management for Burstable Cloud Instances Using Predictive Analytics

[...]

Riccardo Pinciroli, Ahsan Ali¹, Feng Yan¹, Evgenia Smirni²•Institutions (2)

University of Nevada, Reno¹, College of William & Mary²

01 Mar 2021-IEEE Transactions on Network and Service Management

TL;DR: CEDULE+ is a data-driven framework that enables efficient resource management for burstable cloud instances by analyzing the system workload and latency data and is evaluated on Amazon EC2, where its efficiency and high accuracy are assessed through real-case scenarios.

...read moreread less

Abstract: Nearly all principal cloud providers now provide burstable instances in their offerings. The main attraction of this type of instance is that it can boost its performance for a limited time to cope with workload variations. Although burstable instances are widely adopted, it is not clear how to efficiently manage them to avoid waste of resources. In this article, we use predictive data analytics to optimize the management of burstable instances. We design CEDULE+, a data-driven framework that enables efficient resource management for burstable cloud instances by analyzing the system workload and latency data. CEDULE+ selects the most profitable instance type to process incoming requests and controls CPU, I/O, and network usage to minimize the resource waste without violating Service Level Objectives (SLOs). CEDULE+ uses lightweight profiling and quantile regression to build a data-driven prediction model that estimates system performance for all combinations of instance type, resource type, and system workload. CEDULE+ is evaluated on Amazon EC2, and its efficiency and high accuracy are assessed through real-case scenarios. CEDULE+ predicts application latency with errors less than 10%, extends the maximum performance period of a burstable instance up to 2.4 times, and decreases deployment costs by more than 50%.

...read moreread less

9 citations

Proceedings Article•DOI•

Parslo: A Gradient Descent-based Approach for Near-optimal Partial SLO Allotment in Microservices

[...]

Amirhossein Mirhosseini¹, Sameh Elnikety², Thomas F. Wenisch¹•Institutions (2)

University of Michigan¹, Microsoft²

01 Nov 2021

TL;DR: Parslo as discussed by the authors proposes a Gradient Descent-based approach to assign partial SLO among nodes in a microservice graph under an end-to-end latency SLO.

...read moreread less

Abstract: Modern cloud services are implemented as graphs of loosely-coupled microservices to improve programmability, reliability, and scalability. Service Level Objectives (SLOs) define end-to-end latency targets for the entire service to ensure user satisfaction. In such environments, each microservice is independently deployed and (auto-)scaled. However, it is unclear how to optimally scale individual microservices when end-to-end SLOs are violated or underutilized, and how to size each microservice to meet the end-to-end SLO at minimal total cost. In this paper, we propose Parslo---a Gradient Descent-based approach to assign partial SLOs among nodes in a microservice graph under an end-to-end latency SLO. At a high level, the Parslo algorithm breaks the end-to-end SLO budget into small incremental "SLO units", and iteratively allocates one marginal SLO unit to the best candidate microservice to achieve the highest total cost savings until the entire end-to-end SLO budget is exhausted. Parslo achieves a near-optimal solution, seeking to minimize the total cost for the entire service deployment, and is applicable to general microservice graphs that comprise patterns like dynamic branching, parallel fan-out, and microservice dependencies. Parslo reduces service deployment costs by more than 6x in real microservice-based applications, compared to a state-of-the-art partial SLO assignment scheme.

...read moreread less

9 citations

Journal Article•DOI•

On QoE-Oriented Cloud Service Orchestration for Application Providers

[...]

Dmitrii Chemodanov¹, Prasad Calyam¹, Samaikya Valluripally¹, Huy Trinh¹, Jon Patman¹, Kannappan Palaniappan¹ - Show less +2 more•Institutions (1)

University of Missouri¹

01 Jul 2021-IEEE Transactions on Services Computing

TL;DR: In this article, the authors proposed a QoE-oriented cloud service orchestration algorithm that can guide ASPs on how to plan their budget to enhance satisfactory QoEs to end-users.

...read moreread less

Abstract: New virtualization technologies allow Infrastructure Providers (InPs) to lease their resources to Application Service Providers (ASPs) for highly scalable delivery of cloud services to end-users. However, existing literature lacks knowledge on Quality of Experience (QoE)-oriented cloud service orchestration algorithms that can guide ASPs on how to plan their budget to enhance satisfactory QoE delivery to end-users. In contrast to the InP’s cloud service orchestration, the ASP’s orchestration should not rely on expensive infrastructure control mechanisms such as Software-Defined Networking (SDN), or require apriori knowledge on the number of services to be instantiated and their anticipated placement location within InP’s infrastructure. In this paper, we address this issue of delivering satisfactory user QoE by synergistically optimizing both ASP’s management and data planes . The optimization within the ASP management plane first maximizes Service Level Objective (SLO) coverage of users when application services are being deployed, and are not yet operational. The optimization of the ASP data plane then enhances satisfactory user QoE delivery when applications services are operational with real user access. Our evaluation of QoE-oriented algorithms using realistic numerical simulations, real-world cloud testbed experiments with actual users and ASP case studies show notably improved performance over existing cloud service orchestration solutions.

...read moreread less

7 citations

Proceedings Article•DOI•

OneEdge: An Efficient Control Plane for Geo-Distributed Infrastructures

[...]

Enrique Saurez¹, Harshit Gupta¹, Alexandros Daglis¹, Umakishore Ramachandran¹•Institutions (1)

Georgia Institute of Technology¹

01 Nov 2021

Abstract: Resource management for geo-distributed infrastructures is challenging due to the scarcity and non-uniformity of edge resources, as well as the high client mobility and workload surges inherent to situation awareness applications. Due to their centralized nature, state-of-the-art schedulers that work well in datacenters lack the performance and feature requirements of such applications. We present OneEdge, a hybrid control plane that enables autonomous decision-making at edge sites for localized, rapid single-site application deployment. Edge sites handle mobility, churn, and load spikes, by cooperating with a centralized controller that allows coordinated multi-site scheduling and dynamic reconfiguration. OneEdge's scheduling decisions are driven by each application's end-to-end service level objective (E2E SLO) as well as the specific requirements of situation awareness applications. OneEdge's novel distributed state management combines autonomous decision-making at the edge sites for rapid localized resource allocations with decision-making at the central controller when multi-site application deployment is needed. Using a mix of applications on multi-region Azure instances, we show that, in contrast to centralized or fully distributed control planes, OneEdge caters to the unique requirements of situation awareness applications. Compared to a centralized control plane, OneEdge reduces deployment latency by 66% for single-site applications, without compromising E2E SLOs.

...read moreread less

Journal Article•DOI•

AI-Driven Provisioning in the 5G Core

[...]

Amit Sheoran¹, Sonia Fahmy¹, Lianjie Cao², Puneet Sharma²•Institutions (2)

Purdue University¹, Hewlett-Packard²

01 Mar 2021-IEEE Internet Computing

TL;DR: In this article, the authors discuss the challenges encountered by network orchestrators in allocating resources to disparate 5G network slices, and propose the use of artificial intelligence to make core placement and scaling decisions that meet the requirements of network slices deployed on shared infrastructure.

...read moreread less

Abstract: Network slicing enables communication service providers to partition physical infrastructure into logically independent networks. Network slices must be provisioned to meet the service-level objectives (SLOs) of disparate offerings, such as enhanced mobile broadband, ultrareliable low-latency communications, and massive machine-type communications. Network orchestrators must customize service placement and scaling to achieve the SLO of each network slice. In this article, we discuss the challenges encountered by network orchestrators in allocating resources to disparate 5G network slices, and propose the use of artificial intelligence to make core placement and scaling decisions that meet the requirements of network slices deployed on shared infrastructure. We explore how artificial intelligence-driven scaling algorithms, coupled with functionality-aware placement, can enable providers to design closed-loop solutions to meet the disparate SLOs of future network slices.

...read moreread less

Proceedings Article•DOI•

Causal Modeling based Fault Localization in Cloud Systems using Golden Signals

[...]

Pooja Aggarwal¹, Seema Nagar¹, Ajay Gupta¹, Larisa Shwartz¹, Prateeti Mohapatra¹, Qing Wang¹, Amit Paradkar¹, Atri Mandal¹ - Show less +4 more•Institutions (1)

IBM¹

01 Sep 2021

TL;DR: In this article, a lightweight fault localization system, that establishes causal relationships among the golden signal service errors and error logs, and further leverages PageRank centrality of the derived causal graph for generating a ranked list of faulty microservices.

...read moreread less

Abstract: In cloud-native applications, a large fraction of operational failures, known as outages, result in violations of Service Level Objectives (SLOs). SLOs are defined around specific measurable characteristics: availability, throughput, frequency, response time, and quality. Four metrics, latency, traffic, errors, and saturation, ensure coverage for most outages of an application. These are often called golden signals. The dynamicity and complexity of cloud-native applications complicate Site Reliability Engineers’ (SREs) efforts in problem determination, in particular in its fault localization. The fault localization is often a try-and-error process in which SREs rely on their domain knowledge and experience. It is laborious and frequently results in long Mean Time To Resolution (MTTR) for outages. This paper describes a lightweight fault localization system, that establishes causal relationships among the golden signal service errors and error logs, and further leverages PageRank centrality of the derived causal graph for generating a ranked list of faulty microservices.

...read moreread less

Journal Article•DOI•

Surviving the Pandemic: Remote Working in the Maltese Public Service During the Covid-19 Outbreak

[...]

Frank Bezzina¹, Vincent Cassar¹, Vincent Marmarà¹, Emanuel Said¹•Institutions (1)

University of Malta¹

22 Mar 2021

TL;DR: In this article, the authors examine how human resources in the Maltese public service adopt new work practices in response to COVID-19 public health measures during the first wave of the pandemic.

...read moreread less

Abstract: This study examines how human resources in the Maltese Public Service adopt new work practices in response to COVID-19 public health measures during the first wave of the pandemic. We analyze the data we collected through seven focus group discussions and ten in-depth interviews with Public Service employees and managers in a diversity of ministries and roles. Our study reveals that Public Service policies promoting remote working relied exclusively on the service's IT infrastructure. However, the ability to respond to customer needs effectively in a time of surging demand relied entirely on effective employees' access to responsive and efficient ICT support as well as employees' prior experience with remote work modes and their predisposition to change to remote working. Adopting remote working modes uncovered inherent weaknesses in the Public Service IT infrastructure that put additional strain on the Government's centralized IT support function, especially when Public Service employees adopted tools not supported by the centralized IT support. In circumstances where centralized IT support was ineffective, Public Service employees relied on their own knowledge resources which they informally shared in groups of practice or employed operant resources (or tacit knowledge) to achieve service level objectives. These observations suggest that in times when organizations respond to immediate and unprecedented change, human resources seek to adapt by relying on tacit knowledge that is shared among people in known (often informal) groups of people with a common interest or role.

...read moreread less

Posted Content•

Putting Data Science Pipelines on the Edge.

[...]

Ali Akoglu¹, Genoveva Vargas-Solar•Institutions (1)

University of Arizona¹

27 Mar 2021-arXiv: Databases

TL;DR: In this article, a composable Just in Time Architecture for Data Science (DS) Pipelines named JITA-4DS and associated resource management techniques for configuring disaggregated data centers (DCs).

...read moreread less

Abstract: This paper proposes a composable "Just in Time Architecture" for Data Science (DS) Pipelines named JITA-4DS and associated resource management techniques for configuring disaggregated data centers (DCs). DCs under our approach are composable based on vertical integration of the application, middleware/operating system, and hardware layers customized dynamically to meet application Service Level Objectives (SLO-application-aware management). Thereby, pipelines utilize a set of flexible building blocks that can be dynamically and automatically assembled and reassembled to meet the dynamic changes in the workload's SLOs. To assess disaggregated DC's, we study how to model and validate their performance in large-scale settings.

...read moreread less

Journal Article•DOI•

Elastic deployment of container clusters across geographically distributed cloud data centers for web applications

[...]

Yasser Aldwyan¹, Yasser Aldwyan², Richard O. Sinnott¹, Glenn Jayaputera¹•Institutions (2)

University of Melbourne¹, Islamic University²

09 Jun 2021-Concurrency and Computation: Practice and Experience

TL;DR: This article proposes an approach that continuously makes elastic deployment plans aimed at optimizing cost and performance, even during adaptation processes, to meet service level objectives (SLOs) at lower costs.

...read moreread less

Abstract: Containers such as Docker provide a lightweight virtualization technology. They have gained popularity in developing, deploying and managing applications in and across Cloud platforms. Container management and orchestration platforms such as Kubernetes run application containers in virtual clusters that abstract the overheads in managing the underlying infrastructures to simplify the deployment of container solutions. These platforms are well suited for modern web applications that can give rise to geographic fluctuations in use based on the location of users. Such fluctuations often require dynamic global deployment solutions. A key issue is to decide how to adapt the number and placement of clusters to maintain performance, whilst incurring minimum operating and adaptation costs. Manual decisions are naive and can give rise to: over-provisioning and hence cost issues; improper placement and performance issues, and/or unnecessary relocations resulting in adaptation issues. Elastic deployment solutions are essential to support automated and intelligent adaptation of container clusters in geographically distributed Clouds. In this article, we propose an approach that continuously makes elastic deployment plans aimed at optimizing cost and performance, even during adaptation processes, to meet service level objectives (SLOs) at lower costs. Meta-heuristics are used for cluster placement and adjustment. We conduct experiments on the Australia-wide National eResearch Collaboration Tools and Resources Research Cloud using Docker and Kubernetes. Results show that with only a 0.5 ms sacrifice in SLO for the 95th percentile of response times we are able to achieve up to 44.44% improvement (reduction) in cost compared to a naive over-provisioning deployment approach.

...read moreread less

Proceedings Article•DOI•

Impact of Distributed Rate Limiting on Load Distribution in a Latency-sensitive Messaging Service

[...]

Chong Li¹, Jiangnan Liu¹, Chenyang Lu¹, Roch Guerin¹, Christopher Gill¹ - Show less +1 more•Institutions (1)

Washington University in St. Louis¹

01 Sep 2021

TL;DR: In this article, a real-time messaging system motivated by Internet-of-Things (IoT) applications is designed and implemented in a cloud environment, and a solution capable of realizing an effective compromise between load distribution decisions and rate limiting is presented.

...read moreread less

Abstract: The cloud's flexibility and promise of seamless auto-scaling notwithstanding, its ability to meet service level objectives (SLOs) typically calls for some form of control in resource usage. This seemingly traditional problem gives rise to new challenges in a cloud setting, and in particular a subtle yet significant trade-off involving load-distribution decisions (the distribution of workload across available cloud resources to optimize performance), and rate limiting (the capping of individual workloads to prevent global over-commitment). This paper investigates that trade-off through the design and implementation of a real-time messaging system motivated by Internet-of- Things (IoT) applications, and demonstrates a solution capable of realizing an effective compromise. The paper's contributions are in both explicating the source of this trade-off, and in demonstrating a possible solution.

...read moreread less

Journal Article•DOI•

A Utility Game Driven Qos Optimization for Cloud Services

[...]

Yan Wang¹, Jiantao Zhou¹, Xiaoyu Song²•Institutions (2)

Inner Mongolia University¹, Portland State University²

26 Feb 2021-IEEE Transactions on Services Computing

TL;DR: In this paper, a QoS optimization method is designed to obtain a near-optimal QoS solution for a tradeoff between user satisfaction and provider profit for a win-win service application between service providers and users.

...read moreread less

Abstract: Cloud services request lower cost compared to traditional software of self-purchased infrastructure due to the characteristics of on-demand resource provisioning and pay-as-you-go mode.In the cloud services market, service providers attempt to make more profits from their services,while users hope to choose low-cost services with high-quality.The conflict of interests between users and service providers is an important challenge for the booming cloud service market. This paper characterizes this application problem formally based on a utility game model of service providers and users. In the model, QoS is considered as the basis for determining the utilities of both parties.By analyzing the behaviors of users and service providers,we introduce the concept of reputation cost for the first time in the model and find a QoS solution that balances the utilities of users and service providers in service transactions.In such a balance, any change in either party's strategy will result in a loss of utility. And then a QoS optimization method is designed to obtain a near-optimal QoS solution for a tradeoff between user satisfaction and provider profit. Extensive simulation experiments are conducted to substantiate the effectiveness of our method.The results are applicable to win-win service applications between service providers and users.

...read moreread less

Journal Article•DOI•

Towards Deadline Guaranteed Cloud Storage Services

[...]

Guoxin Liu¹, Haiying Shen², Haoyu Wang², Lei Yu³•Institutions (3)

Clemson University¹, University of Virginia², Georgia Institute of Technology³

01 May 2021-IEEE Transactions on Services Computing

TL;DR: A mathematical model is built to derive the upper bound of acceptable request arrival rate on each server of a Deadline Guaranteed storage service that incorporates three basic algorithms and shows the superior performance of DGCloud compared with previous methods in terms of deadline guarantees and system resource utilization, and the effectiveness of its individual algorithms.

...read moreread less

Abstract: More and more organizations move their data and workload to commercial cloud storage systems. However, the multiplexing and sharing of the resources in a cloud storage system present unpredictable data access latency to tenants, which may make online data-intensive applications unable to satisfy their deadline requirements. Thus, it is important for cloud storage systems to provide deadline guaranteed services. In this paper, to meet a current form of service level objective (SLO) that constrains the percentage of each tenant’s data access requests failing to meet its required deadline below a given threshold, we build a mathematical model to derive the upper bound of acceptable request arrival rate on each server. We then propose a Deadline Guaranteed storage service (called DGCloud ) that incorporates three basic algorithms. Its deadline-aware load balancing scheme redirects requests and creates replicas to release the excess load of each server beyond the derived upper bound. Its workload consolidation algorithm tries to maximally reduce servers while still satisfying the SLO to maximize the resource utilization. Its data placement optimization algorithm re-schedules the data placement to minimize the transmission cost of data replication. We further propose three enhancement methods to further improve the performance of DGCloud . A dynamic load balancing method allows an overloaded server to quickly offload its excess workload. A data request queue improvement method sets different priorities to the data responses in a server’s queue so that more requests can satisfy the SLO requirement. A wakeup server selection method selects a sleeping server that stores more popular data to wake up, which allows it to handle more data requests. Our trace-driven experiments in simulation and Amazon EC2 show the superior performance of DGCloud compared with previous methods in terms of deadline guarantees and system resource utilization, and the effectiveness of its individual algorithms.

...read moreread less

Proceedings Article•DOI•

LOOPS: A Holistic Control Approach for Resource Management in Cloud Computing

[...]

Auday Al-Dulaimy¹, Javid Taheri², Alessandro Vittorio Papadopoulos¹, Thomas Nolte¹•Institutions (2)

Mälardalen University College¹, Karlstad University²

09 Apr 2021

TL;DR: In this paper, a multi-loop control approach is proposed to allocate resources to VMs based on the service level agreement (SLA) requirements and the run-time conditions, which can meet applications' performance goals by assigning the resources required by cloud-based applications.

...read moreread less

Abstract: In cloud computing model, resource sharing introduces major benefits for improving resource utilization and total cost of ownership, but it can create technical challenges on the running performance. In practice, orchestrators are required to allocate sufficient physical resources to each Virtual Machine (VM) to meet a set of predefined performance goals. To ensure a specific service level objective, the orchestrator needs to be equipped with a dynamic tool for assigning computing resources to each VM, based on the run-time state of the target environment. To this end, we present LOOPS, a multi-loop control approach, to allocate resources to VMs based on the service level agreement (SLA) requirements and the run-time conditions. LOOPS is mainly composed of one essential unit to monitor VMs, and three control levels to allocate resources to VMs based on requests from the essential node. A tailor-made controller is proposed with each level to regulate contention among collocated VMs, to reallocate resources if required, and to migrate VMs from one host to another. The three levels work together to meet the required SLA. The experimental results have shown that the proposed approach can meet applications' performance goals by assigning the resources required by cloud-based applications.

...read moreread less

Proceedings Article•DOI•

Courier: Real-Time Optimal Batch Size Prediction for Latency SLOs in BigDL

[...]

Diego Albo Martínez¹, Sharwin Bobde¹, Tomasz Motyka¹, Lydia Y. Chen¹•Institutions (1)

Delft University of Technology¹

09 Apr 2021

TL;DR: In this paper, the authors proposed Courier, a model that selects a batch size based on the type of machine learning job such that the response time adheres to the Service Level Objectives (SLO) specified, while also rendering the highest possible accuracy.

...read moreread less

Abstract: Distributed machine learning has seen immense rise in popularity in recent years. Many companies and universities are utilizing computational clusters to train and run machine learning models. Unfortunately, operating such a cluster imposes large costs. It is therefore crucial to attain as high system utilization as possible. Moreover, those who offer computational clusters as a service, apart from keeping high utilization, also have to meet the required Service Level Agreements (SLAs) for the system response time. This becomes increasingly more complex in multitenant scenarios, where the time dedicated to each task has to be limited to achieve fairness. In this work, we analyze how different parameters of the machine learning job influence the response time as well as system utilization and propose Courier. Courier is a model that, based on the type of machine learning job, can select a batch size such that the response time adheres to the Service Level Objectives (SLOs) specified, while also rendering the highest possible accuracy. We gather the data by conducting real-world experiments on a BigDL cluster. Later on, we study the influence of the factors and build several predictive models which lead us to the proposed Courier model.

...read moreread less

Journal Article•DOI•

A Deep Neural Network-Based Multi-Label Classifier for SLA Violation Prediction in a Latency Sensitive NFV Application

[...]

Nikita Jalodia¹, Mohit Taneja¹, Alan Davy•Institutions (1)

Waterford Institute of Technology¹

28 Oct 2021

TL;DR: In this article, a deep neural network-based multi-label classification methodology is proposed to identify and predict multiple categories of SLO breaches associated with an application state, and the performance of the proposed methodology is evaluated with a set of machine learning classifiers.

...read moreread less

Abstract: Recent advancements in the domain of Network Function Virtualization (NFV), and rollout of next-generation networks have necessitated the requirement for the upkeep of latency-critical application architectures in future networks and communications. While Cloud service providers recognize the evolving mission-critical requirements in latency sensitive verticals such as autonomous driving, multimedia, gaming, telecommunications, and virtual reality, there is a wide gap to bridge the Quality of Service (QoS) constraints for the end-user experience. Most latency-critical services are over-provisioned on all fronts to offer reliability, which is inefficient towards scalability in the long run. To address this, we propose a strategy to model frequent violations on the application level as a multi-output target to enable more complex decision-making in the management of virtualised communication networks. In this work, we utilize data from a real-world deployment to configure and draft a realistic set of Service Level Objectives (SLOs) for a voice based NFV application, and develop a deep neural network based multi-label classification methodology to identify and predict multiple categories of SLO breaches associated with an application state. With this, we aim to gain granular SLA and SLO violation insights, enabling us to study and mitigate their impact and inform precision in drafting proactive scaling policies. We further compare the performance against a set of multi-label compatible machine learning classifiers, and address class imbalance in a multi-label setup. We perform a comprehensive evaluation to assess the performance on example-based, label-based and ranking-based measures, and demonstrate the suitability of deep learning in such a use-case.

...read moreread less

Proceedings Article•

QoE-Aware Container Scheduler for Co-located Cloud Environments

[...]

Marcos Carvalho¹, Daniel F. Macedo¹•Institutions (1)

Universidade Federal de Minas Gerais¹

17 May 2021

TL;DR: In this paper, the authors proposed a Kubernetes scheduler extension and resource rescheduling that incorporates QoE metrics into SLO, and evaluated the architecture using the ITU P.1203 standard in the context of video streaming services co-located with other services.

...read moreread less

Abstract: Cloud management has traditionally considered Service Level Objectives (SLO) based on QoS metrics. However, QoS-focused metrics have a limited effect on the Quality of Experience (QoE) experienced by the clients. This paper proposes a Kubernetes scheduler extension and resource rescheduling that incorporates QoE metrics into SLOs. As a proof of concept, this work evaluates the architecture using the QoE metric proposed in the ITU P.1203 standard, in the context of video streaming services co-located with other services. Experimental results show that our scheduler improves the average QoE by 50% compared to other schedulers, while resource rescheduling improved the average QoE by 135%. In addition, our architecture eliminated over-provisioning altogether.

...read moreread less

Posted Content•

Impact of Distributed Rate Limiting on Load Distribution in a Latency-sensitive Messaging Service

[...]

Chong Li¹, Jiangnan Liu¹, Chenyang Lu¹, Roch Guerin¹, Christopher Gill¹ - Show less +1 more•Institutions (1)

Washington University in St. Louis¹

14 Jan 2021-arXiv: Networking and Internet Architecture

TL;DR: In this paper, a real-time messaging system motivated by Internet-of-Things (IoT) applications is designed and implemented in a cloud environment, and a solution capable of realizing an effective compromise between load distribution decisions and rate limiting is presented.

...read moreread less

Abstract: The cloud's flexibility and promise of seamless auto-scaling notwithstanding, its ability to meet service level objectives (SLOs) typically calls for some form of control in resource usage. This seemingly traditional problem gives rise to new challenges in a cloud setting, and in particular a subtle yet significant trade-off involving load-distribution decisions (the distribution of workload across available cloud resources to optimize performance), and rate limiting (the capping of individual workloads to prevent global over-commitment). This paper investigates that trade-off through the design and implementation of a real-time messaging system motivated by Internet-of-Things (IoT) applications, and demonstrates a solution capable of realizing an effective compromise. The paper's contributions are in both explicating the source of this trade-off, and in demonstrating a possible solution.

...read moreread less

Posted Content•

Heuristic and Reinforcement Learning Algorithms for Dynamic Service Placement on Mobile Edge Cloud.

[...]

Dhruv Garg, Nanjangud C. Narendra, Selome Tesfatsion

30 Oct 2021-arXiv: Networking and Internet Architecture

TL;DR: In this paper, the authors proposed and evaluated three dynamic placement strategies, two heuristic (greedy approximation based on set cover, and integer programming based optimization) and one learning-based algorithm, which satisfy the application constraints, minimize infrastructure deployment cost, while ensuring availability of services to all clients and User Equipment (UE) in the network coverage area.

...read moreread less

Abstract: Edge computing hosts applications close to the end users and enables low-latency real-time applications. Modern applications inturn have adopted the microservices architecture which composes applications as loosely coupled smaller components, or services. This complements edge computing infrastructure that are often resource constrained and may not handle monolithic applications. Instead, edge servers can independently deploy application service components, although at the cost of communication overheads. Consistently meeting application service level objectives while also optimizing application deployment (placement and migration of services) cost and communication overheads in mobile edge cloud environment is non-trivial. In this paper we propose and evaluate three dynamic placement strategies, two heuristic (greedy approximation based on set cover, and integer programming based optimization) and one learning-based algorithm. Their goal is to satisfy the application constraints, minimize infrastructure deployment cost, while ensuring availability of services to all clients and User Equipment (UE) in the network coverage area. The algorithms can be extended to any network topology and microservice based edge computing applications. For the experiments, we use the drone swarm navigation as a representative application for edge computing use cases. Since access to real-world physical testbed for such application is difficult, we demonstrate the efficacy of our algorithms as a simulation. We also contrast these algorithms with respect to placement quality, utilization of clusters, and level of determinism. Our evaluation not only shows that the learning-based algorithm provides solutions of better quality; it also provides interesting conclusions regarding when the (more traditional) heuristic algorithms are actually better suited.

...read moreread less