scispace - formally typeset
Open AccessProceedings ArticleDOI

Goods: Organizing Google's Datasets

TLDR
GoodS is a project to rethink how structured datasets at scale are organized at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them.
Abstract
Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in different storage systems, may vary in their formats, may change every day. In this paper, we present GOODS, a project to rethink how we organize structured datasets at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. GOODS extracts metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as similarity and provenance. It then exposes this metadata through services that allow engineers to find datasets within the company, to monitor datasets, to annotate them in order to enable others to use their datasets, and to analyze relationships between them. We discuss the technical challenges that we had to overcome in order to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. We believe that many of the lessons that we learned are applicable to building large-scale enterprise-level data-management systems in general.

read more

Citations
More filters
Journal ArticleDOI

A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective

TL;DR: This survey performs a comprehensive study of data collection from a data management point of view, providing a research landscape of these operations, guidelines on which technique to use when, and identify interesting research challenges.
Proceedings ArticleDOI

Google Dataset Search: Building a search engine for datasets in an open Web ecosystem

TL;DR: Google Dataset Search as discussed by the authors is a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web, relying on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites.
Proceedings ArticleDOI

Data Management Challenges in Production Machine Learning

TL;DR: The goal of the tutorial is to bring forth data-management issues that arise in the context of machine learning pipelines deployed in production, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art.
Journal ArticleDOI

Data Lifecycle Challenges in Production Machine Learning: A Survey

TL;DR: Challenges in data understanding, data validation and cleaning, and data preparation are explored - how different constraints are imposed on the solutions depending on where in the lifecycle of a model the problems are encountered and who encounters them are explored.
Journal ArticleDOI

Dataset search: a survey

TL;DR: This work surveys the state of the art of research and commercial systems and discusses what makes dataset search a field in its own right, with unique challenges and open questions, and looks at approaches and implementations from related areas dataset search is drawing upon.
References
More filters
Proceedings Article

Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).

TL;DR: Bigtable as mentioned in this paper is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.
Journal ArticleDOI

Bigtable: A Distributed Storage System for Structured Data

TL;DR: The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.
Journal ArticleDOI

From databases to dataspaces: a new abstraction for information management

TL;DR: This paper proposes dataspaces and their support systems as a new agenda for data management, which encompasses much of the work going on in data management today, while posing additional research objectives.
Journal ArticleDOI

WebTables: exploring the power of tables on the web

TL;DR: The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine.
Journal ArticleDOI

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

TL;DR: This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of \emphdistinct elements (the cardinality) of very large data ensembles, and makes it possible to estimate cardinalities well beyond $10^9$ with a typical accuracy of 2% while using a memory of only 1.5 kilobytes.