scispace - formally typeset
Search or ask a question
Author

Alexandru A. Ormenisan

Bio: Alexandru A. Ormenisan is an academic researcher from Royal Institute of Technology. The author has contributed to research in topics: Random access & Communications protocol. The author has an hindex of 2, co-authored 5 publications receiving 6 citations.

Papers
More filters
01 Jan 2020
TL;DR: A bottom-up method for capturing provenance information regarding the processing steps and artifacts produced in ML pipelines is used, based on replacing traditional intrusive hooks in application code with standardized change-data-capture support in the systems involved inML pipelines: the distributed file system, feature store, resource manager, and applications themselves.
Abstract: Machine learning pipelines have become the defacto paradigm for productionizing machine learning applications as they clearly abstract the processing steps involved in transforming raw data into engineered features that are then used to train models. In this paper, we use a bottom-up method for capturing provenance information regarding the processing steps and artifacts produced in ML pipelines. Our approach is based on replacing traditional intrusive hooks in application code (to capture ML pipeline events) with standardized change-data-capture support in the systems involved in ML pipelines: the distributed file system, feature store, resource manager, and applications themselves. In particular, we leverage data versioning and time-travel capabilities in our feature store to show how provenance can enable model reproducibility and debugging. 1 From Data Parallel to Stateful ML Pipelines Bulk synchronous parallel processing frameworks, such as Apache Spark [4], are used to build data processing pipelines that use idempotence to enable transparently handling of failures by re-executing failed tasks. As data pipelines are typically stateless, they need lineage support to identify those stages that need to be recomputed when a failure occurs. Caching temporary results at stages means recovery can be optimized to only re-run pipelines from the most recent cached stage. In contrast, database technology uses stateful protocols (like 2-phase commit and agreement protocols like Paxos [10]) to provide ACID properties to build reliable data processing systems. Recently, new data parallel processing frameworks have been extended with the ability to make atomic updates to tabular data stored in columnar file formats (like Parquet [3]) while providing isolation guarantees for concurrent clients. Examples of such frameworks are Delta Lake [8], Apache Hudi [1], and Apache Iceberg [2]. These ACID data lake platforms are important for ML pipelines as they provide the ability to query the value of rows at specific points in time in the past (time-travel queries). The Hopsworks Feature Store is built on the Hudi framework, where data files are stored in HopsFS [11] as parquet files and available as external tables in a modified version of Apache Hive [6] that shares the same metadata layer as HopsFS. HopsFS and Hive have a unified metadata layer, where Hive tables and feature store metadata are extended metadata for HopsFS directories. Foreign keys and transactions in our metadata layer ensure the consistency of extended metadata through Change-Data-Capture(CDC) events. Just like data pipelines, ML pipelines should be able to handle partial failures, but they should also be able to reproducibly train a model even if there are updates to the data lake. The Hopsworks feature store with Hudi enables this, by storing both the features used to train a model and the Hudi commits (updates) for the feature data, see figure 1. In contrast to data pipelines, ML pipelines are stateful. This state is maintained in the metadata store through a series of CDC events as can be seen in figure 1. For example, after a model has been trained and validated, we need state (from the metadata store) to check if the new model has better performance than an existing model running in production. Other systems like TFX [5] and MLFlow [14] also provide a metadata store to enable ML pipelines to make stateful decisions. However, they do so in an obtrusive way they make developers re-write the code at each of the stages with their specific component models. In Hopsworks [9], however, we provide an unobtrusive metadata model based on implicit provenance [12], where change capture APIs in the platform enable metadata about artifacts and executions to be implicitly saved in a metadata store with minimal changes needed to user code that makes up the ML pipeline stages. 2 Versioning Code, Data, and Infrastructure The defacto approach for versioning code is git, and many solutions are trying to apply the same process to the versioning of data. Tools, such as DVC [7] and Pachyderm [13] version data with git-like semantics and track immutable files, instead of changes to files. An alternative to git-like versioning that we chose is to use an ACID Data-Lake with time-travel query Figure 1: Hopsworks ML Pipelines with Implicit Metadata. Application

4 citations

Proceedings ArticleDOI
01 Jun 2017
TL;DR: KompicsMessaging is presented, a messaging middleware that allows for fine-grained control of the network protocol used on a per-message basis and provides an online reinforcement learner that optimises the selection of thenetwork protocol for the current network environment.
Abstract: Distributed applications deployed in multi-datacenter environments need to deal with network connections of varying quality, including high bandwidth and low latency within a datacenter and, more recently, high bandwidth and high latency between datacentres. In principle, for a given network connection, each message should be sent over the best available network protocol, but existing middlewares do not provide this functionality. In this paper, we present KompicsMessaging, a messaging middleware that allows for fine-grained control of the network protocol used on a per-message basis. Rather than always requiring application developers to specify the appropriate protocol for each message, we also provide an online reinforcement learner that optimises the selection of the network protocol for the current network environment. In experiments, we show how connection properties, such as the varying round-trip time, influence the performance of the application and we show how throughput and latency can be improved by picking the right protocol at the right time.

2 citations

Proceedings ArticleDOI
01 Jun 2017
TL;DR: Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of large datasets.
Abstract: Big data has, in recent years, revolutionised an evergrowing number of fields, from machine learning to climate science to genomics. The current state-of-the-art for storing large datasets is either object stores or distributed filesystems, with Hadoop being the dominant open-source platform for managing ‘Big Data’. Existing large-scale storage platforms, however, lack support for the efficient sharing of large datasets over the Internet. Those systems that are widely used for the dissemination of large files, like BitTorrent, need to be adapted to handle challenges such as network links with both high latency and high bandwidth, and scalable storage backends that are optimised for streaming and not random access. In this paper, we introduce Dela, a peer-to-peer data-sharing service integrated into the Hops Hadoop platform that provides an end-to-end solution for dataset sharing. Dela is designed for large-scale storage backends and data transfers that are both non-intrusive to existing TCP network traffic and provide higher network throughput than TCP on high latency, high bandwidth network links, such as transatlantic network links. Dela provides a pluggable storage layer, implementing two alternative ways for clients to access shared data: stream processing of data as it arrives with Kafka, and traditional offline access to data using the Hadoop Distributed Filesystem. Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of large datasets.

2 citations

Posted Content
TL;DR: KompicsTesting is presented, a framework for unit testing components in the Kompics component model that can be used to perform both black box and white box testing of components and its feasibility is illustrated through the design and implementation a prototype based on this approach.
Abstract: In this paper we present KompicsTesting, a framework for unit testing components in the Kompics component model. Components in Kompics are event-driven entities which communicate asynchronously solely by message passing. Similar to actors in the actor model, they do not share their internal state in message-passing, making them less prone to errors, compared to other models of concurrency using shared state. However, they are neither immune to simpler logical and specification errors nor errors such as dataraces that stem from nondeterminism. As a result, there exists a need for tools that enable rapid and iterative development and testing of message passing components in general, in a manner similar to the xUnit frameworks for functions and modular segments code. These frameworks work in an imperative manner, ill suited for testing message-passing components given that the behavior of such components are encoded in the streams of messages that they send and receive. In this work, we present a theoretical framework for describing and verifying the behavior of message-passing components, independent of the model and framework implementation, in a manner similar to describing a stream of characters using regular expressions. We show how this approach can be used to perform both black box and white box testing of components and illustrate its feasibility through the design and implementation a prototype based on this approach, KompicsTesting.

1 citations

Book
01 Jan 2013
TL;DR: Search or social media giants are no longer the only individuals that face the problems of managing Big Data, and many of today’s applications and services experience sudden bursts in growth.
Abstract: Search or social media giants are no longer the only individuals that face the problems of managing Big Data. Many of today’s applications and services experience sudden bursts in growth, with incr ...

Cited by
More filters
Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations

17 Dec 2010
TL;DR: The authors survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000, using a corpus of digitized texts containing about 4% of all books ever printed.
Abstract: L'article, publie dans Science, sur une des premieres utilisations analytiques de Google Books, fondee sur les n-grammes (Google Ngrams) We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can ...

735 citations

01 Aug 2000
TL;DR: In this paper, the authors identify a source of self-similarity previously ignored, a source that is readily controllable, and examine the effects of the TCP stack on network traffic using different implementations of TCP.
Abstract: Distributed computational grids depend on TCP to ensure reliable end-to-end communication between nodes across the wide-area network (WAN). Unfortunately, TCP performance can be abysmal even when buffers on the end hosts are manually optimized. Recent studies blame the self-similar nature of aggregate network traffic for TCP's poor performance because such traffic is not readily amenable to statistical multiplexing in the Internet, and hence computational grids. In this paper we identify a source of self-similarity previously ignored, a source that is readily controllable--TCP. Via an experimental study, we examine the effects of the TCP stack on network traffic using different implementations of TCP. We show that even when aggregate application traffic ought to smooth out as more applications' traffic are multiplexed, TCP induces burstiness into the aggregate traffic loud, thus adversely impacting network performance. Furthermore, our results indicate that TCP performance will worsen as WAN speeds continue to increase.

101 citations

Journal ArticleDOI
TL;DR: An overview of systems and platforms which support the management of ML lifecycle artifacts is given based on a systematic literature review and assessment criteria are derived and applied to a representative selection of more than 60 systems and Platforms.
Abstract: The explorative and iterative nature of developing and operating machine learning (ML) applications leads to a variety of artifacts, such as datasets, features, models, hyperparameters, metrics, software, configurations, and logs. In order to enable comparability, reproducibility, and traceability of these artifacts across the ML lifecycle steps and iterations, systems and tools have been developed to support their collection, storage, and management. It is often not obvious what precise functional scope such systems offer so that the comparison and the estimation of synergy effects between candidates are quite challenging. In this paper, we aim to give an overview of systems and platforms which support the management of ML lifecycle artifacts. Based on a systematic literature review, we derive assessment criteria and apply them to a representative selection of more than 60 systems and platforms.

3 citations