scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Data(base) Engineering Bulletin in 2019"


Journal Article
TL;DR: The vision for a Mixed-Initiative machine Learning Environment (MILE) is outlined, by rethinking the role that automation and human supervision play across the ML development lifecycle, to enable a better user experience and benefits from system optimizations that both leverage human input and are tailored to the fact that MILE interacts with a human in the loop.
Abstract: Machine learning (ML) has gained widespread adoption in a variety of real-world problem domains, ranging from business, to healthcare, to agriculture. However, the development of effective ML solutions requires highlyspecialized experts well-versed in both statistics and programming. This high barrier-of-entry stems from the current process of crafting a customized ML solution, which often involves numerous manual iterative changes to the ML workflow, guided by knowledge or intuition of how those changes impact eventual performance. This cumbersome process is a major pain point for machine learning practitioners [4, 53] and has motivated our prior work on Helix, a declarative ML framework [52] targeted at supporting efficient iteration. To make ML more accessible and effortless, there has been recent interest in AutoML systems, both in industry [2, 1, 21] and in academia [15, 37], that automatically search over a predefined space of ML models for some high-level goal, such as prediction of a target variable. For certain tasks, these systems have been shown to generate models with comparable or better performance than those generated by human ML experts in the same time [35, 26]. However, our preliminary study of ML workflows on OpenML [48] (an online platform for experimenting with and sharing ML workflows and results) shows that AutoML is not widely adopted in practice—accounting for fewer than 2% of all users and workflows. While this may be due to a lack of awareness of these tools, we believe that this sparse usage stems from a more fundamental issue: a lack of usability. Our main observation is that the fully-automated setting that current AutoML systems operate on may not be a one-size-fits-all solution for many users and problem domains. Recent work echoes our sentiment that AutoML’s complete automation over model choices may be inadequate in certain problem contexts [18, 50]. The lack of human control and interpretability is particularly problematic when the user’s domain knowledge may influence the choice of workflow [18], in high-stakes decision-making scenarios where trust and transparency are essential [50], and in exploratory situations where the problem is not well-defined [11]. This trade-off between control and automation has been a century-long debate in HCI [23, 22, 44, 5], with modern reincarnations arising in conversational agents, interactive visual analytics, and autonomous driving. A common interaction paradigm to reconcile these two approaches is a mixed-initiative approach, where “intelligent services and users...collaborate efficiently to achieve the user’s goals” [23]. Along the footsteps of these seminal papers, here, we outline our vision for a Mixed-Initiative machine Learning Environment (MILE), by rethinking the role that automation and human supervision play across the ML development lifecycle. MILE enables a better user experience, and benefits from system optimizations that both leverage human input and are tailored to the fact that MILE interacts with a human in the loop. For example, our earlier work HELIX [52] leveraged the fact that workflow development happens iteratively, to intelligently materialize and reuse intermediate data products to speed up subsequent iterations. Similarly, as discussed later in this paper, leveraging user domain knowledge has the potential to drastically narrow down the exhaustive search space typically employed by existing AutoML systems.

45 citations


Journal Article
TL;DR: This paper proposes to develop interpretability and transparency tools based on the concept of a nutritional label, drawing an analogy to the food industry, where simple, standard labels convey information about the ingredients and production processes.
Abstract: An essential ingredient of successful machine-assisted decision-making, particularly in high-stakes decisions, is interpretability –– allowing humans to understand, trust and, if necessary, contest, the computational process and its outcomes. These decision-making processes are typically complex: carried out in multiple steps, employing models with many hidden assumptions, and relying on datasets that are often used outside of the original context for which they were intended. In response, humans need to be able to determine the “fitness for use” of a given model or dataset, and to assess the methodology that was used to produce it. To address this need, we propose to develop interpretability and transparency tools based on the concept of a nutritional label, drawing an analogy to the food industry, where simple, standard labels convey information about the ingredients and production processes. Nutritional labels are derived automatically or semi-automatically as part of the complex process that gave rise to the data or model they describe, embodying the paradigm of interpretability-by-design. In this paper we further motivate nutritional labels, describe our instantiation of this paradigm for algorithmic rankers, and give a vision for developing nutritional labels that are appropriate for different contexts and stakeholders.

34 citations


Journal Article
TL;DR: It is argued that Optimistic Lock Coupling, rather than a complex and error-prone custom synchronization protocol, should be the default choice for performance-critical data structures.
Abstract: As the number of cores on commodity processors continues to increase, scalability becomes more and more crucial for overall performance. Scalable and efficient concurrent data structures are particularly important, as these are often the building blocks of parallel algorithms. Unfortunately, traditional synchronization techniques based on fine-grained locking have been shown to be unscalable on modern multi-core CPUs. Lock-free data structures, on the other hand, are extremely difficult to design and often incur significant overhead. In this work, we make the case for Optimistic Lock Coupling as a practical alternative to both traditional locking and the lock-free approach. We show that Optimistic Lock Coupling is highly scalable and almost as simple to implement as traditional lock coupling. Another important advantage is that it is easily applicable to most tree-like data structures. We therefore argue that Optimistic Lock Coupling, rather than a complex and error-prone custom synchronization protocol, should be the default choice for performance-critical data structures.

30 citations


Journal Article
TL;DR: Two engineering approaches for integrating ML agents natively in the DBMS’s architecture are discussed and the trade-offs of these approaches are considered in the context of two projects from Carnegie Mellon University (CMU).
Abstract: The limitless number of possible ways to configure database management systems (DBMSs) has rightfully earned them the reputation of being difficult to manage and tune. Optimizing a DBMS to meet the needs of an application has surpassed the abilities of humans. This is because the correct configuration of a DBMS is highly dependent on a number of factors that are beyond what humans can reason about. The problem is further exacerbated in large-scale deployments with thousands or even millions of individual DBMS installations that each have their own tuning requirements. To overcome this problem, recent research has explored using machine learning-based (ML) agents for automated tuning of DBMSs. These agents extract performance metrics and behavioral information from the DBMS and then train models with this data to select tuning actions that they predict will have the most benefit. They then observe how these actions affect the DBMS and update their models to further improve their efficacy. In this paper, we discuss two engineering approaches for integrating ML agents in a DBMS. The first is to build an external tuning controller that treats the DBMS as a black-box. The second is to integrate the ML agents natively in the DBMS’s architecture. We consider the trade-offs of these approaches in the context of two projects from Carnegie Mellon University (CMU).

21 citations


Journal Article
TL;DR: Five levels of AI-native databases are introduced and several open challenges of designing anAI-native database are provided and take autonomous database knob tuning, deep reinforcement learning based optimizer, machine-learning based cardinality estimation, and autonomous index/view advisor as examples to showcase the superiority of AI -native databases.
Abstract: In big data era, database systems face three challenges. Firstly, the traditional empirical optimization techniques (e.g., cost estimation, join order selection, knob tuning) cannot meet the high-performance requirement for large-scale data, various applications and diversified users. We need to design learningbased techniques to make database more intelligent. Secondly, many database applications require to use AI algorithms, e.g., image search in database. We can embed AI algorithms into database, utilize database techniques to accelerate AI algorithms, and provide AI capability inside databases. Thirdly, traditional databases focus on using general hardware (e.g., CPU), but cannot fully utilize new hardware (e.g., ARM, GPU, AI chips). Moreover, besides relational model, we can utilize tensor model to accelerate AI operations. Thus, we need to design new techniques to make full use of new hardware. To address these challenges, we design an AI-native database. On one hand, we integrate AI techniques into databases to provide self-configuring, self-optimizing, self-monitoring, self-diagnosis, self-healing, self-assembling, and self-security capabilities. On the other hand, we enable databases to provide AI capabilities using declarative languages in order to lower the barrier of using AI. In this paper, we introduce five levels of AI-native databases and provide several open challenges of designing an AI-native database. We also take autonomous database knob tuning, deep reinforcement learning based optimizer, machine-learning based cardinality estimation, and autonomous index/view advisor as examples to showcase the superiority of AI-native databases.

20 citations


Journal Article
TL;DR: This work proposes the construction of an engine, a Data Alchemist, which learns how to blend fine-grained data structure design principles to automatically synthesize brand new data structures.
Abstract: We propose a solution based on first principles and AI to the decades-old problem of data structure design. Instead of working on individual designs that each can only be helpful in a small set of environments, we propose the construction of an engine, a Data Alchemist, which learns how to blend fine-grained data structure design principles to automatically synthesize brand new data structures. 1 Computing Instead of Inventing Data Structures Read Memory Udate Prform ance Trae-offs Data Structures Databases Access Patterns Hardware Cloud costs K V K V K V ... Table Table LS M Hash BTree Machine

19 citations


Journal Article
TL;DR: This first version of doppioDB provides a platform for extending traditional relational processing with customizable hardware to support stochastic gradient descent and decision tree ensembles and shows examples of how they could be included into SQL and embedded as part of conventional components of a relational database engine.
Abstract: Advances in hardware are a challenge but also a new opportunity. In particular, devices like FPGAs and GPUs are a chance to extend and customize relational engines with new operations that would be difficult to support otherwise. Doing so would offer database users the possibility of conducting, e.g., complete data analyses involving machine learning inside the database instead of having to take the data out, process it in a different platform, and then store the results back in the database as it is often done today. In this paper we present doppioDB 1.0, an FPGA-enabled database engine incorporating FPGA-based machine learning operators into a main memory, columnar DBMS (MonetDB). This first version of doppioDB provides a platform for extending traditional relational processing with customizable hardware to support stochastic gradient descent and decision tree ensembles. Using these operators, we show examples of how they could be included into SQL and embedded as part of conventional components of a relational database engine. While these results are still a preliminary, exploratory step, they illustrate the challenges to be tackled and the advantages of using hardware accelerators as a way to extend database functionality in a non-disruptive manner.

17 citations


Journal Article
TL;DR: It is argued that the concept of fairness requires causal reasoning, and existing works and future opportunities for applying data management techniques to causal algorithmic fairness are identified.
Abstract: Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflects discrimination, suggesting a data management problem. In this paper, we first make a distinction between associational and causal definitions of fairness in the literature and argue that the concept of fairness requires causal reasoning. We then review existing works and identify future opportunities for applying data management techniques to causal algorithmic fairness.

14 citations


Journal Article
TL;DR: It is argued here that the authors need to acknowledge the subjective aspects of data and decision making and broaden their agenda to incorporate them into the systems they build and understand how to model the emotional aspect of decision making in their systems.
Abstract: Data has become an integral part of our lives. We use data to make a wide spectrum of decisions that affect our well-being, such as what to wear today, where to go for dinner and on vacation, which charity to support, and who to vote for in the elections. Ideally, we’d like to think that if we just had all the facts, decisions would be easy and disagreements would be quickly settled. As a research community, our goal has been to fuel such decision making by focusing on extracting, managing, querying and visualizing data and facts. I argue here that we need to acknowledge the subjective aspects of data and decision making and broaden our agenda to incorporate them into the systems we build. Subjectivity is prevalent in at least three levels. First, the data itself may be subjective: there is no ground truth about whether a restaurant is romantic or a travel destination is relaxing. We need to develop techniques to extract, manage and query subjective data. Second, presentation of the data can be subjective either by introducing bias (perhaps intentionally or even maliciously), or by tailoring the presentation to the frame of mind of the recipient. Third, human decision making is inherently subjective and based on emotions. We need to understand how to model the emotional aspect of decision making in our systems. The following sections will expand on each of these topics. I have already done some research on the first topic [5] and so my comments on it will be more concrete. However, I believe all three areas are equally important.

8 citations



Journal Article
TL;DR: This paper presents the vision on how ethical issues related to fairness, transparency, and bias regarding these new forms of work can be combated in the future of work, and how this will impact the data management research community and future work platforms.
Abstract: The rise of self-employment empowered by platforms such as Amazon Mechanical Turk and Uber has drastically changed our perception of work. The possibility to link requesters and workers from all over the world in a scalable manner has resulted in advancements in the work world that would not have been possible otherwise. However, many ethical concerns related to fairness, transparency, and bias regarding these new forms of work have also been raised. In this paper, we present our vision on these ethical issues, how they can be combated in the future of work, and how this will impact the data management research community and future work platforms.

Journal Article
TL;DR: This article addresses important extensions to the problem of allocating indivisible items to a population of agents by treating each group of agents as a single entity receiving a bundle of items whose valuation is the maximum total utility of matching agents in that group to items in that bundle.
Abstract: In this article, we address important extensions to the problem of allocating indivisible items to a population of agents: The agents are partitioned into disjoint groups on the basis of attributes (e.g., ethnicity) and we want the overall utility of the allocation to respect some notion of diversity and/or fairness with respect to these groups. We study two specific incarnations of this general problem. First, we address a constrained optimization problem, inspired by diversity quotas in some real-world allocation problems, where the items are also partitioned into blocks and there is an upper bound on the number of items from each block that can be assigned to agents in each group. We theoretically analyze the price of diversity – a measure of the overall welfare loss due to these capacity constraints – and report experiments based on two real-world data sets (Singapore public housing and Chicago public school admissions) comparing this constrained optimization-based approach with a lottery mechanism with similar quotas. Next, instead of imposing hard constraints, we cast the problem as a variant of fair allocation of indivisible goods – we treat each group of agents as a single entity receiving a bundle of items whose valuation is the maximum total utility of matching agents in that group to items in that bundle; we present algorithms that achieve a standard relaxation of envy-freeness in conjunction with specific efficiency criteria.

Journal Article
TL;DR: This article surveys the existing scheduling mechanisms targeting the utilization of a multicore server with uniform processing units and revisits them in the context of emerging server hardware composed of many diverse cores and identifies the main challenges.
Abstract: Scheduling various data-intensive tasks over the processing units of a server has been a heavily studied but still challenging effort. In order to utilize modern multicore servers well, a good scheduling mechanism has to be conscious of different dimensions of parallelism offered by these servers. This requires being aware of the micro-architectural features of processors, the hardware topology connecting the processing units of a server, and the characteristics of these units as well as the data-intensive tasks. The increasing levels of parallelism and heterogeneity in emerging server hardware amplify these challenges in addition to the increasing variety of data-intensive applications. This article first surveys the existing scheduling mechanisms targeting the utilization of a multicore server with uniform processing units. Then, it revisits them in the context of emerging server hardware composed of many diverse cores and identifies the main challenges. Finally, it concludes with the description of a preliminary framework targeting these challenges. Even though this article focuses on data-intensive applications on a single server, many of the challenges and opportunities identified here are not unique to such a setup, and would be relevant to other complex software systems as well as resource-constrained or large-scale hardware platforms.

Journal Article
TL;DR: Technical methods to assist human experts in designing fair and stable score-based rankings and to assess and enhance the coverage of a training dataset for machine learning tasks such as classification are presented.
Abstract: Human decision makers often receive assistance from data-driven algorithmic systems that provide a score for evaluating the quality of items such as products, services, or individuals. These scores can be obtained by combining different features either through a process learned by ML models, or using a weight vector designed by human experts, with their past experience and notions of what constitutes item quality. The scores can be used for different evaluation purposes such as ranking or classification. In this paper, we view the design of these scores through the lens of responsibility. We present technical methods (i) to assist human experts in designing fair and stable score-based rankings and (ii) to assess and (if needed) enhance the coverage of a training dataset for machine learning tasks such as



Journal Article
TL;DR: The analysis in [7] shows how modern data caching systems can produce better cost/performance and their exploition of a storage hierarchy hence can serve a greater diversity of data management needs at lower cost.



Journal Article
TL;DR: This article proposes solutions that—by sharing semantics between the application, the database system, the operating system, and the hardware—allow to manage complex and resource-intensive workloads in an efficient and holistic way.
Abstract: With their capability to perform both high-speed transactional processing and complex analytical workloads on the same dataset and at the same time, Operational Analytics Database Management Systems give enormous flexibility to application developers. Particularly, they allow for the development of new classes of enterprise applications by giving analytical insights into operational data sets in real time. From a database system point of view though, these applications are very demanding, as they exhibit a highly diverse combination of different query workloads with inhomogeneous performance and latency requirements. In this article, we discuss the practical implications and challenges for database architects and system designers. We propose solutions that—by sharing semantics between the application, the database system, the operating system, and the hardware—allow to manage complex and resource-intensive workloads in an efficient and holistic way.

Journal Article
TL;DR: This work eliminates the transform-and-load cost using in-situ query processing approaches which adapt to any data format and facilitate querying diverse datasets and adapt access paths on-the-fly to minimize response times.
Abstract: Data preparation is a crucial phase for data analysis applications. Data scientists spend most of their time on collecting and preparing data in order to efficiently and accurately extract valuable insights. Data preparation involves multiple steps of transformations until data is ready for analysis. Users often need to integrate heterogeneous data; to query data of various formats, one has to transform the data to a common format. To accurately execute queries over the transformed data, users have to remove any inconsistencies by applying cleaning operations. To efficiently execute queries, they need to tune access paths over the data. Data preparation, however is i) time-consuming since it involves expensive operations, and ii) lacks knowledge of the workload; a lot of preparation effort is wasted on data never meant to be used. To address the functionality and performance requirements of data analysis, we re-design data preparation in a way that is weaved into data analysis. We eliminate the transform-and-load cost using in-situ query processing approaches which adapt to any data format and facilitate querying diverse datasets. To address the scalability issues of cleaning and tuning tasks, we inject cleaning operations into query processing, and adapt access paths on-the-fly. By integrating the aforementioned tasks into data analysis, we adapt data preparation to each workload and thereby minimize response times.

Journal Article
TL;DR: A cost-oblivious paradigm, where the objective is to return routes that match the preferences of local, or expert, drivers without formalizing costs is envisioned, where costs are associated with arbitrary paths in a road network graph.
Abstract: Vehicular transportation will undergo profound change over the next decades, due to developments such as increasing mobility demands and increasingly autonomous driving. At the same time, rapidly increasing, massive volumes of data that capture the movements of vehicles are becoming available. In this setting, the current vehicle routing paradigm falls short, and we need new data-intensive paradigms. In a data-rich setting, travel costs such as travel time are modeled as time-varying distributions: at a single point in time, the time needed to traverse a road segment is given by a distribution. How can we best build, maintain, and use such distributions? The travel cost of a route is obtained by convolving distributions that model the costs of the segments that make up the route. This process is expensive and yields inaccurate results when dependencies exist among the distributions. To avoid these problems, we need a path-centric paradigm, where costs are associated with arbitrary paths in a road network graph, not just with edges. This paradigm thrives on data: more data is expected to improve accuracy, but also efficiency. Next, massive trajectory data makes it possible to compute different travel costs in different contexts, e.g., for different drivers, by using different subsets of trajectories depending on the context. It is then no longer appropriate to assume that costs are available when routing starts; rather, we need an on-the-fly paradigm, where costs can be computed during routing. Key challenges include how to achieve efficiency and accuracy with sparse data. Finally, the above paradigms assume that the benefit, or cost, of a path is quantified. As an alternative, we envision a cost-oblivious paradigm, where the objective is to return routes that match the preferences of local, or expert, drivers without formalizing costs.

Journal Article
TL;DR: It is argued that distributed transactions on middle-tier servers need to return to the mainstream after a 15-year decline, especially for applications targeted for cloud computing.
Abstract: Over the years, platforms and application requirements change. As they do, technologies come, go, and return again as the preferred solution to certain system problems. In each of its incarnations, the technology’s details change but the principles remain the same. One such technology is distributed transactions on middle-tier servers. Here, we argue that after a 15-year decline, they need to return to the mainstream

Journal Article
TL;DR: This paper explores how mechanisms like hyperupcalls could be used from the perspective of an application, and demonstrates two use cases from an application perspective: one to trace events in both the guest and hypervisor simultaneously and another simple use case where a database installs a hyperupcall so the hypervisor can prioritize certain traffic and improve response latency.
Abstract: Hyperupcalls are a mechanism which we recently proposed to bridge the semantic gap between a hypervisor and its guest virtual machines (VMs) by allowing the guest VM to provide the hypervisor safe, verifiable code to transfer information. With a hyperupcall, a hypervisor can safely read and update data structures inside the guest, such as page tables. A hypervisor could use such a hyperupcall, for example, to determine which pages are free and can be reclaimed in the VM without invoking it. In this paper, we describe hyperupcalls and how they have been used to improve and gain additional insight on virtualized workloads. We also observe that complex applications such as databases hold a wealth of semantic information which the systems they run on top of are unaware of. For example, a database may store records, but the operating system can only observe bytes written into a file, and the hypervisor beneath it blocks written to a disk, limiting the optimizations the system may make: for instance, if the operating system understood the database wished to fetch a database record, it could prefetch related records. We explore how mechanisms like hyperupcalls could be used from the perspective of an application, and demonstrate two use cases from an application perspective: one to trace events in both the guest and hypervisor simultaneously and another simple use case where a database installs a hyperupcall so the hypervisor can prioritize certain traffic and improve response latency.

Journal Article
TL;DR: The impact of 5G on both traditional and emerging technologies are examined, and research challenges and opportunities are discussed.
Abstract: The fifth-generation (5G) mobile communication technologies are on the way to be adopted as the next standard for mobile networking. It is therefore timely to analyze the impact of 5G on the landscape of computing, in particular, data management and data-driven technologies. With a predicted increase of 10-100$\times$ in bandwidth and 5-10$\times$ decrease in latency, 5G is expected to be the main enabler for edge computing which includes accessing cloud-like services, as well as conducting machine learning at the edge. In this paper, we examine the impact of 5G on both traditional and emerging technologies, and discuss research challenges and opportunities.

Journal Article
TL;DR: The goal of this article is to propose an optimization framework by acknowledging human factors to enable label acquisition through active learning to investigate tasks, such as providing and validating labels, or comparing data using active learning techniques.
Abstract: The goal of this article is to propose an optimization framework by acknowledging human factors to enable label acquisition through active learning . In particular, we are interested to investigate tasks, such as, providing (collecting or acquiring) and validating labels, or comparing data using active learning techniques. Our basic approach is to take a set of existing active learning techniques for a few well known supervised and unsupervised algorithms, but study them in the context of crowdsourcing, especially considering worker-centric optimization (i,e., human factors). Our innovation lies in designing optimization functions that appropriately capture these two fundamental yet complementary facets, performing systematic investigation to understand the complexity of such optimization problems, and designing efficient solutions with theoretical guarantees.

Journal Article
TL;DR: This article advocates developing benchmarks for crowdsourcing research and initiates some discussion into this important problem and issues a call-to-arms for the community to work on this important initiative.
Abstract: Online crowdsourcing platforms have proliferated over the last few years and cover a number of important domains, these platforms include from worker-task platforms such Amazon Mechanical Turk, worker-forhire platforms such as TaskRabbit to specialized platforms with specific tasks such as ridesharing like Uber, Lyft, Ola etc. An increasing proportion of human workforce will be employed by these platforms in the near future. The crowdsourcing community has done yeoman’s work in designing effective algorithms for various key components, such as incentive design, task assignment and quality control. Given the increasing importance of these crowdsourcing platforms, it is now time to design mechanisms so that it is easier to evaluate the effectiveness of these platforms. Specifically, we advocate developing benchmarks for crowdsourcing research. Benchmarks often identify important issues for the community to focus and improve upon. This has played a key role in the development of research domains as diverse as databases and deep learning. We believe that developing appropriate benchmarks for crowdsourcing will ignite further innovations. However, crowdsourcing – and future of work, in general – is a very diverse field that makes developing benchmarks much more challenging. Substantial effort is needed that spans across developing benchmarks for datasets, metrics, algorithms, platforms and so on. In this article, we initiate some discussion into this important problem and issue a call-to-arms for the community to work on this important initiative.