S
Steven Euijong Whang
Researcher at KAIST
Publications - 60
Citations - 3691
Steven Euijong Whang is an academic researcher from KAIST. The author has contributed to research in topics: Computer science & Big data. The author has an hindex of 22, co-authored 53 publications receiving 2793 citations. Previous affiliations of Steven Euijong Whang include Stanford University & Google.
Papers
More filters
Journal ArticleDOI
Swoosh: a generic approach to entity resolution
Omar Benjelloun,Hector Garcia-Molina,David Menestrina,Qi Su,Steven Euijong Whang,Jennifer Widom +5 more
TL;DR: This work formalizes the generic ER problem, treating the functions for comparing and merging records as black-boxes, and identifies four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms.
Journal ArticleDOI
A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective
TL;DR: This survey performs a comprehensive study of data collection from a data management point of view, providing a research landscape of these operations, guidelines on which technique to use when, and identify interesting research challenges.
Proceedings ArticleDOI
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
Denis Baylor,Eric Breck,Heng-Tze Cheng,Noah Fiedel,Chuan Yu Foo,Zakaria Haque,Salem Haykal,Mustafa Ispir,Vihan Jain,Levent Koc,Chiu Yuen Koo,Lukasz Lew,Clemens Mewald,Akshay Naresh Modi,Neoklis Polyzotis,Sukriti Ramesh,Sudip Roy,Steven Euijong Whang,Martin Wicke,Jarek Wilkiewicz,Xin Zhang,Martin Zinkevich +21 more
TL;DR: TensorFlow Extended (TFX) is presented, a TensorFlow-based general-purpose machine learning platform implemented at Google that was able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions.
Proceedings ArticleDOI
Entity resolution with iterative blocking
TL;DR: A scalable iterative blocking system is implemented that can be more accurate and efficient than blocking for large datasets and reflecting the ER results of blocks to other blocks may generate additional record matches.
Proceedings ArticleDOI
Goods: Organizing Google's Datasets
Alon Halevy,Flip Korn,Natalya F. Noy,Christopher Olston,Neoklis Polyzotis,Sudip Roy,Steven Euijong Whang +6 more
TL;DR: GoodS is a project to rethink how structured datasets at scale are organized at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them.