scispace - formally typeset
Search or ask a question
Author

Byung Suk Lee

Bio: Byung Suk Lee is an academic researcher from University of Vermont. The author has contributed to research in topics: Query optimization & Tuple. The author has an hindex of 15, co-authored 82 publications receiving 758 citations. Previous affiliations of Byung Suk Lee include KAIST & University of St. Thomas (Minnesota).


Papers
More filters
Proceedings ArticleDOI
09 Jul 2007
TL;DR: This paper presents a novel algorithm for maintaining the reservoir sample after the reservoir size is adjusted such that the resulting uniformity confidence exceeds a given threshold.
Abstract: Reservoir sampling is a well-known technique for sequential random sampling over data streams. Conventional reservoir sampling assumes a fixed-size reservoir. There are situations, however, in which it is necessary and/or advantageous to adaptively adjust the size of a reservoir in the middle of sampling due to changes in data characteristics and/or application behavior. This paper studies adaptive size reservoir sampling over data streams considering two main factors: reservoir size and sample uniformity. First, the paper conducts a theoretical study on the effects of adjusting the size of a reservoir while sampling is in progress. The theoretical results show that such an adjustment may bring a negative impact on the probability of the sample being uniform (called uniformity confidence herein). Second, the paper presents a novel algorithm for maintaining the reservoir sample after the reservoir size is adjusted such that the resulting uniformity confidence exceeds a given threshold. Third, the paper extends the proposed algorithm to an adaptive multi-reservoir sampling algorithm for a practical application in which samples are collected from memory-limited wireless sensor networks using a mobile sink. Finally, the paper empirically examines the adaptivity of the multi-reservoir sampling algorithm with regard to reservoir size and sample uniformity using real sensor networks data sets.

60 citations

Book ChapterDOI
24 Jul 2003
TL;DR: This work is the first comprehensive performance study of main-memory R-tree variants, and provides a useful guideline in selecting the most suitable index structure in various cases.
Abstract: There have been several techniques proposed for improving the performance of main-memory spatial indexes, but there has not been a comparative study of their performance In this paper we compare the performance of six main-memory R-tree variants: R-tree, R*-tree, Hilbert R-tree, CR-tree, CR*-tree, and Hilbert CR-tree CR*-trees and Hilbert CR-trees are respectively a natural extension of R*-trees and Hilbert R-trees by incorporating CR-trees’ quantized relative minimum bounding rectangle (QRMBR) technique Additionally, we apply the optimistic, latch-free index traversal (OLFIT) concurrency control mechanism for B-trees to the R-tree variants while using the GiST-link technique We perform extensive experiments in the two categories of sequential accesses and concurrent accesses, and pick the following best trees In sequential accesses, CR*-trees are the best for search, Hilbert R-trees for update, and Hilbert CR-trees for a mixture of them In concurrent accesses, Hilbert CR-trees for search if data is uniformly distributed, CR*-trees for search if data is skewed, Hilbert R-trees for update, and Hilbert CR-trees for a mixture of them We also provide detailed observations of the experimental results, and rationalize them based on the characteristics of the individual trees As far as we know, our work is the first comprehensive performance study of main-memory R-tree variants The results of our study provide a useful guideline in selecting the most suitable index structure in various cases

58 citations

Journal ArticleDOI
TL;DR: A rigorous system model is developed to facilitate the mapping between an object-oriented model and the relational model and reduces the number of left outer joins and the filters so that the query can be processed more efficiently.
Abstract: One of the approaches for integrating object-oriented programs with databases is to instantiate objects from relational databases by evaluating view queries. In that approach, it is often necessary to evaluate some joins of the query by left outer joins to prevent information loss caused by the tuples discarded by inner joins. It is also necessary to filter some relations with selection conditions to prevent the retrieval of unwanted nulls. The system should automatically prescribe joins as inner or left outer joins and generate the filters, rather than letting them be specified manually for every view definition. We develop such a mechanism in this paper. We first develop a rigorous system model to facilitate the mapping between an object-oriented model and the relational model. The system model provides a well-defined context for developing a simple mechanism. The mechanism requires only one piece of information from users: null options on an object attribute. The semantics of these options are mapped to non-null constraints on the query result. Then the system prescribes joins and generates filters accordingly. We also address reducing the number of left outer joins and the filters so that the query can be processed more efficiently. >

56 citations

Journal ArticleDOI
TL;DR: An aggregation protocol and related algorithms for reaching a quality of service (QoS) goal that has a combined objective of lifetime and error and the key idea is to periodically modify a filter threshold for each sensor in a way that is optimal within the user objective.

46 citations

Journal ArticleDOI
01 Jul 2019
TL;DR: It is asserted that NETS opens a new possibility to real-time data stream outlier detection by realizing set-based early identification of outliers or inliers and taking advantage of the "net effect" between expired and new data points.
Abstract: This paper addresses the problem of efficiently detecting outliers from a data stream as old data points expire from and new data points enter the window incrementally. The proposed method is based on a newly discovered characteristic of a data stream that the change in the locations of data points in the data space is typically very insignificant. This observation has led to the finding that the existing distance-based outlier detection algorithms perform excessive unnecessary computations that are repetitive and/or canceling out the effects. Thus, in this paper, we propose a novel set-based approach to detecting outliers, whereby data points at similar locations are grouped and the detection of outliers or inliers is handled at the group level. Specifically, a new algorithm NETS is proposed to achieve a remarkable performance improvement by realizing set-based early identification of outliers or inliers and taking advantage of the "net effect" between expired and new data points. Additionally, NETS is capable of achieving the same efficiency even for a high-dimensional data stream through two-level dimensional filtering. Comprehensive experiments using six real-world data streams show 5 to 25 times faster processing time than state-of-the-art algorithms with comparable memory consumption. We assert that NETS opens a new possibility to real-time data stream outlier detection.

40 citations


Cited by
More filters
01 Jan 2002

9,314 citations

Journal ArticleDOI
TL;DR: A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications as discussed by the authors, which simplifies, abstracts, reduces, merges, and explains data.
Abstract: For single databases, primary hindrances for end-user access are the volume of data that is becoming available, the lack of abstraction, and the need to understand the representation of the data. When information is combined from multiple databases, the major concern is the mismatch encountered in information representation and structure. Intelligent and active use of information requires a class of software modules that mediate between the workstation applications and the databases. It is shown that mediation simplifies, abstracts, reduces, merges, and explains data. A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications. A model of information processing and information system components is described. The mediator architecture, including mediator interfaces, sharing of mediator modules, distribution of mediators, and triggers for knowledge maintenance, are discussed. >

2,441 citations

Journal ArticleDOI
TL;DR: The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art and aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Abstract: Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.

2,374 citations

Book ChapterDOI
01 Jan 1996
TL;DR: Exploring and identifying structure is even more important for multivariate data than univariate data, given the difficulties in graphically presenting multivariateData and the comparative lack of parametric models to represent it.
Abstract: Exploring and identifying structure is even more important for multivariate data than univariate data, given the difficulties in graphically presenting multivariate data and the comparative lack of parametric models to represent it. Unfortunately, such exploration is also inherently more difficult.

920 citations