scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities

TL;DR: Helix as discussed by the authors is a human-in-the-loop ML system that accelerates the development of machine learning workflows by intelligently tracking changes and intermediate results over time, enabling rapid iteration, quick responsive feedback, introspection and debugging, and background execution and automation.
Abstract: Development of machine learning (ML) workflows is a tedious process of iterative experimentation: developers repeatedly make changes to workflows until the desired accuracy is attained. We describe our vision for a "human-in-the-loop" ML system that accelerates this process: by intelligently tracking changes and intermediate results over time, such a system can enable rapid iteration, quick responsive feedback, introspection and debugging, and background execution and automation. We finally describe Helix, our preliminary attempt at such a system that has already led to speedups of upto 10x on typical iterative workflows against competing systems.

Summary (5 min read)

Introduction

  • To do so, they process published papers to extract entity—gene and disease— mentions, compute embeddings using an approach like word2vec [4], and finally cluster the embeddings to find related entities.
  • By optimizing across iterations, Helix allows data scientists to avoid wasting time running the workflow from scratch every time they make a change and instead run their workflows in time proportional to the complexity of the change made.
  • Unfortunately, this approach is not only wasteful in storage but also potentially very time-consuming due to materialization overhead.
  • The authors provide a brief overview of machine learning workflows, describe the Helix system architecture and present a sample workflow in Helix that will serve as a running example.

2.1 A BRIEF OVERVIEW OF WORKFLOWS

  • A machine learning (ML) workflow accomplishes a specific ML task, ranging from simple ones like classification or clustering, to complex ones like entity resolution or image captioning.
  • The more complex tasks are often broken down into smaller subtasks; e.g., image captioning is broken down into identifying objects or actions via classification, followed by generating sentences using a language model [15].
  • Let R be the raw input data for the ML workflow.
  • The transformation from R to D can involve a variety of operations, such as fine-grained feature definition from individual attributes (e.g., number of vowels in a word), joining in other data sources (e.g., user information into log data), parsing (e.g., a document to words), and aggregation (e.g., aggregating ad clicks).
  • This could include model evaluation, visualizations, or other application-specific activities.

2.2 COMMON PRACTICES IN ITERATION

  • The authors collected statistics from papers from five application domains: computer vision (CV), natural language processing (NLP), web applications (WWW), natural sciences (NS), and social sciences (SS).
  • The statistics collected pertain to the frequency of operations in the three workflow components introduced above.
  • First, WWW and SS are much more likely to incorporate multiple data sources in creating an ML model.

2.3 SYSTEM ARCHITECTURE

  • The Helix system consists of a domain specific language (DSL) in Scala as the programming interface, a compiler for the DSL, and an execution engine, as shown in Figure 2.2.
  • The DAG Optimizer uses this information to produce an optimal physical execution plan that minimizes the one-shot runtime of the workflow, by selectively loading previous results via a Max-Flow-based algorithm (Section 5.1–5.2).
  • The execution engine uses Spark [19] for data processing and domain-specific libraries such as CoreNLP [20] and Deeplearning4j [21] for custom needs.
  • Helix defers operator pipelining and scheduling for asynchronous execution to Spark.
  • The authors discuss optimizations for streaming in Chapter 5. 7.

2.4 THE WORKFLOW LIFECYCLE

  • Given the system components described in the previous section, Figure 2.3 illustrates how they fit into the lifecycle of ML workflows.
  • Starting with W0, an initial version of the workflow, the lifecycle includes the following stages: 1. DAG Compilation.
  • The Workflow The DAG optimizer creates a physical plan GOPTWt to be exe- cuted by pruning and ordering the nodes in GWt and deciding whether any computation can be replaced with loading previous results from disk.
  • Upon execution completion, the user may modify the workflow from Wt to Wt+1 based on the results.
  • The updated workflow Wt+1 fed back to Helix marks the beginning of a new iteration, and the cycle repeats.

2.5 EXAMPLE WORKFLOW

  • The authors demonstrate the usage of Helix with a simple example ML workflow for predicting income using census data from Kohavi [22], shown in Figure 2.4a); this workflow will serve as a running example throughout the paper.
  • The authors overlay the original workflow with an iterative update, with additions annotated with + and deletions annotated with −, while the rest of the lines are retained as is.
  • In lines 5-10, the user declares simple features that are values from specific named columns.
  • Helix reuses results safely, deprecating old results when changes are detected (e.g., predictions is not reused because of the model change).
  • JVM-based libraries can be imported directly into HML to support applicationspecific needs.

3.1 OPERATIONS IN ML WORKFLOWS

  • The authors first introduce F and then enumerate its mapping onto operations in Scikit-learn [25], one of the most comprehensive ML libraries, thereby demonstrating coverage.
  • DPR includes transforming records from one or more data sources from one format to another or into FVs Rd′ ; as well as feature transformations (from Rd to Rd′).
  • 13 Scikit-learn Operations for DPR and L/I. Scikit-learn objects for DPR and L/I implement one or more of the following interfaces [29]: Estimator, used to indicate that an operation has data-dependent behavior via a fit(X[, y]) method, where X contains FVs or raw records, and y contains labels if the operation represents a supervised model.
  • A useful data-dependent feature transformation for a Naive Bayes classifier maps word tokens to positions in a sparse vector and tracks word counts.
  • For model selection, the typical strategy is to define a class that implements methods fit and score.

3.2 HML

  • The basic building blocks of HML are Helix objects, which correspond to the nodes in the DAG.
  • The combination of SUs and examples affords Helix a great deal of flexibility in the physical representation of features.
  • HML provides unified support for training and test data by treating them as a single DC, as done in Line 4 in Figure 2.4a).
  • The authors describe the relationships between operator interfaces in HML and F enumerated in Section 3.1 below.
  • When f is empty, L learns a model using input data designated for model training; when f is populated, L performs inference on the input data using f and outputs the inference results into a DCE.

3.3 SCOPE AND LIMITATIONS

  • In Section 3.1, the authors described how the set of basis operations F they propose covers all major operations in Scikit-learn, one of the most comprehensive ML libraries.
  • While HML’s interfaces are general enough to support all the common use cases, users can additionally manually plug into their interfaces external implementations, such as from MLLib [26] and Weka [30], of missing operations.
  • The authors demonstrate in Chapter 6 that the current set of implemented operations is sufficient for supporting applications across different domains.
  • Since Helix currently relies on its Scala DSL for workflow specification, popular non-JVM libraries, such as TensorFlow [31] and Pytorch [32], cannot be imported easily without significantly degrading performance compared to their native runtime environment.
  • Thus, work on optimizing learning, e.g., [33, 34], orthogonal to (and can therefore be combined with) their work, which operates at a coarser granularity.

4.1 THE WORKFLOW DAG

  • At compile time, Helix’s intermediate code generator constructs a Workflow DAG from HML declarations, with nodes corresponding to operator outputs, (DCs, scalars, or ML models), and edges corresponding to input-output relationships between operators.
  • Nodes for operators involved in DPR are colored purple whereas those involved in L/I and PPR are colored orange.
  • This transformation is straightforward, creating a node for each declared operator and adding edges between nodes based on the linking expressions, e.g., A results from B creates an edge (B,A).
  • These edges connect extractors to downstream DCs in order to automatically aggregate all features for learning.

4.2 TRACKING CHANGES

  • Let To describe the changes between Wt and Wt+1, the authors introduce the notion of equivalence.
  • The programming language community has a large body of work on verifying operational equivalence for specific classes of programs [36, 37, 38].
  • Suppose for contradiction that given the results at t are correct, the results at iteration t+ 1 are incorrect, i.e., ∃ nt+1i s.t. the results for nti are reused when nt+1i is original.
  • Since Helix detects all code changes, it identifies all original operators.

5.1 PRELIMINARIES

  • Root nodes in the Workflow DAG, which correspond to data sources, have li = ci. Operator State.
  • To ensure that nodes in the Compute state have their inputs available, i.e., not pruned, the states in a Workflow DAG GW = (N,E) must satisfy the following execution state constraint: Constraint 5.1 23 Workflow Run Time.
  • Clearly, setting all nodes to Sp trivially minimizes Equation 5.1.
  • Recall that Constraint 4.1 requires all original operators to be rerun.
  • Deciding whether to load or compute the parents can have a cascading effect on the states of their ancestors.

5.2 OPTIMAL EXECUTION PLAN

  • The Optimal Execution Plan (OEP) problem is the core problem solved by Helix’s DAG optimizer, which determines at compile time the optimal execution plan given results and statistics from previous iterations.
  • On the other hand, Constraint 5.1 disallows the parents of computed nodes to be pruned.
  • Problem 5.1 can be solved optimally in polynomial time through a linear time reduction to the project-selection problem (PSP), which is an application of Max-Flow [8].
  • The authors first show that satisfying the prerequisite constraint in PSP leads to satisfying Constraint 5.1 in Opt-Exec-Plan.
  • Impact of change detection precision and recall.

5.3 OPTIMAL MATERIALIZATION PLAN

  • The Opt-Mat-Plan (OMP) problem is tackled by Helix’s materialization optimizer: while running workflow For simplicity, the authors also assume the time to write ni to disk is the same as the time for loading it from disk, i.e., li.
  • The authors now describe the heuristic employed by Helix to approximate OMP while satisfying Constraint 5.2.
  • In essence, Algorithm 5.2 decides to materialize if twice the load cost is less than the cumulative run time for a node.
  • This approach avoids dataset fragmentation that complicates reuse for different workflow versions.

5.4 WORKFLOW DAG PRUNING

  • In addition to optimizations involving intermediate result reuse, Helix further reduces overall workflow execution time by time by pruning extraneous operators from the Workflow DAG.
  • Helix provides two additional mechanisms for pruning operators other than using the lack of output dependency, described next.
  • Data-driven pruning is a powerful technique that can be extended to unlock the possibilities for many more impactful automatic workflow optimizations.
  • Once an operator has finished running, Helix analyzes the DAG to uncache newly out-of-scope nodes.

6.1 SYSTEMS AND BASELINES FOR COMPARISON

  • The authors compare the optimized version of Helix, Helix Opt, against two state-of-the-art ML workflow systems: KeystoneML [6], and DeepDive [7].
  • KeystoneML specializes in classification tasks on structured input data.
  • No intermediate results are materialized in KeystoneML, as it does not optimize execution across iterations.
  • DeepDive. DeepDive [7, 44] is a system, written using Bash scripts and Scala for the main engine, with a database backend, for the construction of end-to-end information extraction pipelines.
  • A version of Helix that uses the same reuse strategy as Helix Opt and always 36 materializes all intermediate results.

6.2 WORKFLOWS

  • The authors conduct their experiments using four real-world ML workflows spanning a range of application domains.
  • Table 6.1 summarizes the characteristics of the four workflows, described next.
  • The authors include a workflow with unsupervised learning and multiple learning steps to verify that the system is able to accommodate variability in the learning task.
  • Each input article contains ≥ 0 spouse pairs, hence creating a one-to-many relationship between input records and learning examples.
  • Each workflow was implemented in Helix, and if supported, in DeepDive and KeystoneML, with X* in Table 6.1 indicating that the authors used an existing implementation by the developers of DeepDive or KeystoneML, which can be found at: Census DeepDive: https://github.com/HazyResearch/deepdive/blob/ master/examples/census/app.ddlog IE DeepDive: https://github.com/HazyResearch/deepdive/blob/ master/examples/spouse/app.ddlog MNIST KeystoneML: https://github.com/amplab/keystone/blob/.

6.3 RUNNING EXPERIMENTS

  • Instead of arbitrarily choosing operators to modify in each iteration, the authors use the iteration frequency in Figure 3 from their literature study [18] to determine the type of modifications to make in each iteration, for the specific domain of each workflow.
  • The authors convert the iteration counts into fractions that represent the likelihood of a certain type of change.
  • Helix is not designed to suggest modifications, and the modifications chosen in their experiments are for evaluating only system run time and storage use.
  • The authors use Postgres as the database backend for DeepDive.

6.4 METRICS

  • The authors evaluate each system’s ability to support diverse ML tasks by qualitative characterization of the workflows and use-cases supported by each system.
  • The authors measure with wall-clock time because it is the latency experienced by the user.
  • Note that the per-iteration time measures both the time to execute the workflow and any time spent to materialize intermediate results.
  • The authors also measure memory usage to analyze the effect of batch processing, and measure storage size to compare the run time reduction to storage ratio of time-efficient approaches.
  • Storage is compared only for variants of Helix since other systems do not support automatic reuse.

6.5 EVALUATION VS. STATE-OF-THE-ART SYSTEMS

  • 5.1 Use Case Support Recall that the four workflows used in their experiments are in social sciences, NLP, computer vision, and natural sciences, respectively.
  • Helix Opt, on the other hand, only shows slight increase in runtime over KeystoneML for DPR and L/I iterations because it is only materializing the L/I results on these iterations, not the nonreusable, large DPR intermediates.
  • Systems that support declarative ML algorithms, such as TensorFlow [31], SystemML [47], OptiML [48], ScalOps [49], and SciDB [50], allow ML experts to program new ML algorithms, by declaratively specifying linear algebra and statistical operations at higher levels of abstraction.
  • The optimization techniques employed by all systems discussed leverage reuse in a simpler manner than does Helix, since the workflows are coarser-grained and computation-heavy, so that the cost of loading cached intermediate results can be considered negligible.
  • Mistique [70], Nectar [84], and ReStore [82] share the goal of efficiently reusing ML workflow intermediates with Helix.

Did you find this useful? Give us your feedback

Figures (15)

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A vision for a Disaster City Digital Twin paradigm that can enable interdisciplinary convergence in the field of crisis informatics and information and communication technology in disaster management and integrate artificial intelligence algorithms and approaches to improve situation assessment, decision making, and coordination among various stakeholders is presented.

160 citations

Proceedings ArticleDOI
25 Jun 2019
TL;DR: Alpine Meadow is able to significantly outperform the other AutoML systems while --- in contrast to the other systems --- providing interactive latencies, but also outperforms in 80% of the cases expert solutions over data sets the authors have never seen before.
Abstract: Statistical knowledge and domain expertise are key to extract actionable insights out of data, yet such skills rarely coexist together. In Machine Learning, high-quality results are only attainable via mindful data preprocessing, hyperparameter tuning and model selection. Domain experts are often overwhelmed by such complexity, de-facto inhibiting a wider adoption of ML techniques in other fields. Existing libraries that claim to solve this problem, still require well-trained practitioners. Those frameworks involve heavy data preparation steps and are often too slow for interactive feedback from the user, severely limiting the scope of such systems. In this paper we present Alpine Meadow, a first Interactive Automated Machine Learning tool. What makes our system unique is not only the focus on interactivity, but also the combined systemic and algorithmic design approach; on one hand we leverage ideas from query optimization, on the other we devise novel selection and pruning strategies combining cost-based Multi-Armed Bandits and Bayesian Optimization. We evaluate our system on over 300 datasets and compare against other AutoML tools, including the current NIPS winner, as well as expert solutions. Not only is Alpine Meadow able to significantly outperform the other AutoML systems while --- in contrast to the other systems --- providing interactive latencies, but also outperforms in 80% of the cases expert solutions over data sets we have never seen before.

71 citations

Journal ArticleDOI
TL;DR: This work conceptualizes the decision-making process in organizations augmented with DL algorithm outcomes (such as predictions or robust patterns from unstructured data) as deep learning–augmented decision- making (DLADM).

66 citations

Journal ArticleDOI
TL;DR: The role of machine learning in accelerating technology development and boosting scientific innovation in multiple fields, including computer vision, medical diagnosis, life sciences, molecular design, and instrumental development is highlighted.
Abstract: Machine learning has provided a huge wave of innovation in multiple fields, including computer vision, medical diagnosis, life sciences, molecular design, and instrumental development. This perspective focuses on the implementation of machine learning in dealing with light-matter interaction, which governs those fields involving materials discovery, optical characterizations, and photonics technologies. We highlight the role of machine learning in accelerating technology development and boosting scientific innovation in the aforementioned aspects. We provide future directions for advanced computing techniques via multidisciplinary efforts that can help to transform optical materials into imaging probes, information carriers and photonics devices.

60 citations

Proceedings ArticleDOI
Jialin Jiao1
23 Jul 2018
TL;DR: This paper first introduces the characteristics and layers of HD Maps; then a formal summary of the workflow of HD Map creation is provided; and most importantly, the machine learning techniques being used by the industry to minimize the amount of manual work in the process ofHD Map creation are presented.
Abstract: In recent years, autonomous driving technologies have attracted broad and enormous interests from both academia and industry and are under rapid development. High-Definition (HD) Maps are widely used as an indispensable component of an autonomous vehicle system by researchers and practitioners. HD Maps are digital maps that contain highly precise, fresh and comprehensive geometric information as well as semantics of the road network and surrounding environment. They provide critical inputs to almost all other components of autonomous vehicle systems, including localization, perception, prediction, motion planning, vehicle control etc. Traditionally, it is very laborious and costly to build HD Maps, requiring a significant amount of manual annotation work. In this paper, we first introduce the characteristics and layers of HD Maps; then we provide a formal summary of the workflow of HD Map creation; and most importantly, we present the machine learning techniques being used by the industry to minimize the amount of manual work in the process of HD Map creation.

54 citations

References
More filters
Posted Content
TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

28,898 citations

Proceedings ArticleDOI
02 Nov 2016
TL;DR: TensorFlow as mentioned in this paper is a machine learning system that operates at large scale and in heterogeneous environments, using dataflow graphs to represent computation, shared state, and the operations that mutate that state.
Abstract: TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.

10,913 citations

Proceedings Article
25 Apr 2012
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Abstract: We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

4,151 citations

Journal Article
TL;DR: MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.
Abstract: Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

1,551 citations

Proceedings ArticleDOI
27 May 2015
TL;DR: Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API, and includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language.
Abstract: Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

1,230 citations