Home
/
Authors
/
Lukasz Golab

Author

Lukasz Golab

Other affiliations: AT&T, AT&T Labs, University of Windsor

Bio: Lukasz Golab is an academic researcher from University of Waterloo. The author has contributed to research in topics: Data stream mining & Data warehouse. The author has an hindex of 32, co-authored 148 publications receiving 4573 citations. Previous affiliations of Lukasz Golab include AT&T & AT&T Labs.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2006
2005
2004
2003

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Issues in data stream management

[...]

Lukasz Golab¹, M. Tamer Özsu¹•Institutions (1)

University of Waterloo¹

01 Jun 2003

TL;DR: The purpose of this paper is to review recent work in data stream management systems, with an emphasis on application requirements, data models, continuous query languages, and query evaluation.

...read moreread less

Abstract: Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for on-line analysis of rapidly changing data streams. Limitations of traditional DBMSs in supporting streaming applications have been recognized, prompting research to augment existing technologies and build new systems to manage streaming data. The purpose of this paper is to review recent work in data stream management systems, with an emphasis on application requirements, data models, continuous query languages, and query evaluation.

...read moreread less

1,068 citations

Book Chapter•DOI•

Processing sliding window multi-joins in continuous queries over data streams

[...]

Lukasz Golab¹, M. Tamer Özsu¹•Institutions (1)

University of Waterloo¹

09 Sep 2003

TL;DR: It is shown that hash joins are faster than NLJs for performing equi-joins, and that the overall processing cost is influenced by the strategies used to remove expired tuples from the sliding windows.

...read moreread less

Abstract: We study sliding window multi-join processing in continuous queries over data streams. Several algorithms are reported for performing continuous, incremental joins, under the assumption that all the sliding windows fit in main memory. The algorithms include multiway incremental nested loop joins (NLJs) and multi-way incremental hash joins. We also propose join ordering heuristics to minimize the processing cost per unit time. We test a possible implementation of these algorithms and show that, as expected, hash joins are faster than NLJs for performing equi-joins, and that the overall processing cost is influenced by the strategies used to remove expired tuples from the sliding windows.

...read moreread less

298 citations

Journal Article•DOI•

Profiling relational data: a survey

[...]

Ziawasch Abedjan¹, Lukasz Golab², Felix Naumann³•Institutions (3)

Massachusetts Institute of Technology¹, University of Waterloo², Hasso Plattner Institute³

01 Aug 2015

TL;DR: Data profiling as mentioned in this paper is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases, and encompasses a vast array of methods to examine datasets and produce metadata, including statistics such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values.

...read moreread less

Abstract: Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.

...read moreread less

247 citations

Proceedings Article•DOI•

FastFabric: Scaling Hyperledger Fabric to 20,000 Transactions per Second

[...]

Christian Gorenflo¹, Stephen Lee², Lukasz Golab¹, Srinivasan Keshav¹•Institutions (2)

University of Waterloo¹, University of Massachusetts Amherst²

14 May 2019

TL;DR: This paper re-architects a modern permissioned blockchain system, Hyperledger Fabric, to increase transaction throughput from 3,000 to 20,000 transactions per second, and proposes architectural changes that reduce computation and I/O overhead during transaction ordering and validation to greatly improve throughput.

...read moreread less

Abstract: Blockchain technologies are expected to make a significant impact on a variety of industries. However, one issue holding them back is their limited transaction throughput, especially compared to established solutions such as distributed database systems. In this paper, we re-architect a modern permissioned blockchain system, Hyperledger Fabric, to increase transaction throughput from 3,000 to 20,000 transactions per second. We focus on performance bottlenecks beyond the consensus mechanism, and we propose architectural changes that reduce computation and I/O overhead during transaction ordering and validation to greatly improve throughput. Notably, our optimizations are fully plug-and-play and do not require any interface changes to Hyperledger Fabric.

...read moreread less

191 citations

Journal Article•DOI•

On generating near-optimal tableaux for conditional functional dependencies

[...]

Lukasz Golab¹, Howard Karloff¹, Flip Korn¹, Divesh Srivastava¹, Bei Yu² - Show less +1 more•Institutions (2)

AT&T Labs¹, National University of Singapore²

01 Aug 2008

TL;DR: This paper is the first to formally characterize a "good" pattern tableau, based on naturally desirable properties of support, confidence and parsimony, and shows that the problem of generating an optimal tableau for a given FD is NP-complete but can be approximated in polynomial time via a greedy algorithm.

...read moreread less

Abstract: Conditional functional dependencies (CFDs) have recently been proposed as a useful integrity constraint to summarize data semantics and identify data inconsistencies. A CFD augments a functional dependency (FD) with a pattern tableau that defines the context (i.e., the subset of tuples) in which the underlying FD holds. While many aspects of CFDs have been studied, including static analysis and detecting and repairing violations, there has not been prior work on generating pattern tableaux, which is critical to realize the full potential of CFDs.This paper is the first to formally characterize a "good" pattern tableau, based on naturally desirable properties of support, confidence and parsimony. We show that the problem of generating an optimal tableau for a given FD is NP-complete but can be approximated in polynomial time via a greedy algorithm. For large data sets, we propose an "on-demand" algorithm providing the same approximation bound, that outperforms the basic greedy algorithm in running time by an order of magnitude. For ordered attributes, we propose the range tableau as a generalization of a pattern tableau, which can achieve even more parsimony. The effectiveness and efficiency of our techniques are experimentally demonstrated on real data.

...read moreread less

187 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

Data Mining Practical Machine Learning Tools and Techniques

[...]

อนิรุธ สืบสิงห์

01 Jan 2014-Journal of management science

9,185 citations

Proceedings Article•DOI•

Mining time-changing data streams

[...]

Geoff Hulten¹, Laurie Spencer, Pedro Domingos¹•Institutions (1)

University of Washington¹

26 Aug 2001

TL;DR: An efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner is proposed, called CVFDT, which stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate.

...read moreread less

Abstract: Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.

...read moreread less

1,790 citations

Journal Article•DOI•

WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs.

[...]

Yuxing Liao¹, Jing Wang¹, Eric J. Jaehnig¹, Zhiao Shi¹, Bing Zhang¹ - Show less +1 more•Institutions (1)

Baylor College of Medicine¹

02 Jul 2019-Nucleic Acids Research

TL;DR: In the 2019 update, WebGestalt supports 12 organisms, 342 gene identifiers and 155 175 functional categories, as well as user-uploaded functional databases and has completely redesigned result visualizations and user interfaces to improve user-friendliness and to provide multiple types of interactive and publication-ready figures.

...read moreread less

Abstract: WebGestalt is a popular tool for the interpretation of gene lists derived from large scale -omics studies. In the 2019 update, WebGestalt supports 12 organisms, 342 gene identifiers and 155 175 functional categories, as well as user-uploaded functional databases. To address the growing and unique need for phosphoproteomics data interpretation, we have implemented phosphosite set analysis to identify important kinases from phosphoproteomics data. We have completely redesigned result visualizations and user interfaces to improve user-friendliness and to provide multiple types of interactive and publication-ready figures. To facilitate comprehension of the enrichment results, we have implemented two methods to reduce redundancy between enriched gene sets. We introduced a web API for other applications to get data programmatically from the WebGestalt server or pass data to WebGestalt for analysis. We also wrapped the core computation into an R package called WebGestaltR for users to perform analysis locally or in third party workflows. WebGestalt can be freely accessed at http://www.webgestalt.org.

...read moreread less

1,789 citations

Journal Article•DOI•

Data streams: algorithms and applications

[...]

S. Muthukrishnan¹•Institutions (1)

Rutgers University¹

01 Aug 2005-Foundations and Trends in Theoretical Computer Science

TL;DR: Data Streams: Algorithms and Applications surveys the emerging area of algorithms for processing data streams and associated applications, which rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity.

...read moreread less

Abstract: In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [1].

...read moreread less

1,598 citations

Book•

Data Streams: Algorithms and Applications

[...]

S. Muthukrishnan¹•Institutions (1)

Rutgers University¹

01 Jan 2005

TL;DR: In this paper, the authors present a survey of basic mathematical foundations for data streaming systems, including basic mathematical ideas, basic algorithms, and basic algorithms and algorithms for data stream processing.

...read moreread less

Abstract: 1 Introduction 2 Map 3 The Data Stream Phenomenon 4 Data Streaming: Formal Aspects 5 Foundations: Basic Mathematical Ideas 6 Foundations: Basic Algorithmic Techniques 7 Foundations: Summary 8 Streaming Systems 9 New Directions 10 Historic Notes 11 Concluding Remarks Acknowledgements References

...read moreread less

1,506 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse