scispace - formally typeset
S

Steven Euijong Whang

Researcher at KAIST

Publications -  60
Citations -  3691

Steven Euijong Whang is an academic researcher from KAIST. The author has contributed to research in topics: Computer science & Big data. The author has an hindex of 22, co-authored 53 publications receiving 2793 citations. Previous affiliations of Steven Euijong Whang include Stanford University & Google.

Papers
More filters
Journal ArticleDOI

Swoosh: a generic approach to entity resolution

TL;DR: This work formalizes the generic ER problem, treating the functions for comparing and merging records as black-boxes, and identifies four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms.
Journal ArticleDOI

A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective

TL;DR: This survey performs a comprehensive study of data collection from a data management point of view, providing a research landscape of these operations, guidelines on which technique to use when, and identify interesting research challenges.
Proceedings ArticleDOI

TFX: A TensorFlow-Based Production-Scale Machine Learning Platform

TL;DR: TensorFlow Extended (TFX) is presented, a TensorFlow-based general-purpose machine learning platform implemented at Google that was able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions.
Proceedings ArticleDOI

Entity resolution with iterative blocking

TL;DR: A scalable iterative blocking system is implemented that can be more accurate and efficient than blocking for large datasets and reflecting the ER results of blocks to other blocks may generate additional record matches.
Proceedings ArticleDOI

Goods: Organizing Google's Datasets

TL;DR: GoodS is a project to rethink how structured datasets at scale are organized at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them.