Goods: Organizing Google's Datasets

doi:10.1145/2882903.2903730

Open AccessProceedings ArticleDOI

Goods: Organizing Google's Datasets

- pp 795-806

TLDR

GoodS is a project to rethink how structured datasets at scale are organized at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them.

Abstract:

Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in different storage systems, may vary in their formats, may change every day. In this paper, we present GOODS, a project to rethink how we organize structured datasets at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. GOODS extracts metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as similarity and provenance. It then exposes this metadata through services that allow engineers to find datasets within the company, to monitor datasets, to annotate them in order to enable others to use their datasets, and to analyze relationships between them. We discuss the technical challenges that we had to overcome in order to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. We believe that many of the lessons that we learned are applicable to building large-scale enterprise-level data-management systems in general.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective

Yuji Roh, +2 more

- 01 Apr 2021 -

IEEE Transactions on Knowledge and Data ...

TL;DR: This survey performs a comprehensive study of data collection from a data management point of view, providing a research landscape of these operations, guidelines on which technique to use when, and identify interesting research challenges.

...read moreread less

Proceedings ArticleDOI

Google Dataset Search: Building a search engine for datasets in an open Web ecosystem

Dan Brickley, +2 more

TL;DR: Google Dataset Search as discussed by the authors is a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web, relying on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites.

...read moreread less

Proceedings ArticleDOI

Data Management Challenges in Production Machine Learning

Neoklis Polyzotis, +3 more

TL;DR: The goal of the tutorial is to bring forth data-management issues that arise in the context of machine learning pipelines deployed in production, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art.

...read moreread less

Journal ArticleDOI

Data Lifecycle Challenges in Production Machine Learning: A Survey

Neoklis Polyzotis, +3 more

TL;DR: Challenges in data understanding, data validation and cleaning, and data preparation are explored - how different constraints are imposed on the solutions depending on where in the lifecycle of a model the problems are encountered and who encounters them are explored.

...read moreread less

Journal ArticleDOI

Dataset search: a survey

Adriane Chapman, +6 more

TL;DR: This work surveys the state of the art of research and commercial systems and discusses what makes dataset search a field in its own right, with unique challenges and open questions, and looks at approaches and implementations from related areas dataset search is drawing upon.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).

Fay W. Chang, +8 more

TL;DR: Bigtable as mentioned in this paper is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.

...read moreread less

Journal ArticleDOI

Bigtable: A Distributed Storage System for Structured Data

Fay W. Chang, +8 more

- 01 Jun 2008 -

ACM Transactions on Computer Systems

TL;DR: The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.

...read moreread less

Journal ArticleDOI

From databases to dataspaces: a new abstraction for information management

Michael J. Franklin, +2 more

TL;DR: This paper proposes dataspaces and their support systems as a new agenda for data management, which encompasses much of the work going on in data management today, while posing additional research objectives.

...read moreread less

Journal ArticleDOI

WebTables: exploring the power of tables on the web

Michael Cafarella, +4 more

TL;DR: The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine.

...read moreread less

Journal ArticleDOI

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

Philippe Flajolet, +3 more

- 17 Jun 2007 -

Discrete Mathematics & Theoretical Compu...

TL;DR: This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of \emphdistinct elements (the cardinality) of very large data ensembles, and makes it possible to estimate cardinalities well beyond $10^9$ with a typical accuracy of 2% while using a memory of only 1.5 kilobytes.

...read moreread less

Goods: Organizing Google's Datasets

Citations

A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective

Google Dataset Search: Building a search engine for datasets in an open Web ecosystem

Data Management Challenges in Production Machine Learning

Data Lifecycle Challenges in Production Machine Learning: A Survey

Dataset search: a survey

References

Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).

Bigtable: A Distributed Storage System for Structured Data

From databases to dataspaces: a new abstraction for information management

WebTables: exploring the power of tables on the web

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

Related Papers (5)

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

Constance: An Intelligent Data Lake System

Data integration for the relational web

Data Wrangling: The Challenging Yourney from the Wild to the Lake.

Finding related tables