Home
/
Authors
/
Christian Sohler

Author

Christian Sohler

Other affiliations: Technical University of Dortmund, University of Paderborn, University of Bonn

Bio: Christian Sohler is an academic researcher from University of Cologne. The author has contributed to research in topics: Property testing & Approximation algorithm. The author has an hindex of 36, co-authored 142 publications receiving 5104 citations. Previous affiliations of Christian Sohler include Technical University of Dortmund & University of Paderborn.

Papers published on a yearly basis

2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1997

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering

[...]

Dan Feldman¹, Melanie Schmidt², Christian Sohler²•Institutions (2)

Massachusetts Institute of Technology¹, Technical University of Dortmund²

06 Jan 2013

TL;DR: The authors' coresets with the merge-and-reduce approach obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering, and a simple recursive coreset construction that produces coresets of size.

...read moreread less

Abstract: @d can be approximated up to (1 + e)-factor, for an arbitrary small e > 0, using the O(k/e2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + e)-approximated by an optimal k-means clustering of their projection on the O(k/e2) first right singular vectors (principle components) of A.A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + e)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n)O(jk) for any j, k ≥ 1 and constant e e (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ~ n.Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d.For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size

...read moreread less

353 citations

Posted Content•

Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

[...]

Dan Feldman, Melanie Schmidt, Christian Sohler

12 Jul 2018-arXiv: Data Structures and Algorithms

TL;DR: In this paper, the authors developed and analyzed a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set.

...read moreread less

Abstract: We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergable. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets (this property has recently also been called composability, see Indyk et. al., PODS 2014). It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on projecting the points on a low dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis and subspace clustering. The main conceptual contribution is a new coreset definition that allows to charge costs that appear for every solution to an additive constant.

...read moreread less

318 citations

Journal Article•DOI•

StreamKM++: A clustering algorithm for data streams

[...]

Marcel R. Ackermann¹, Marcus Märtens¹, Christoph Raupach¹, Kamil Swierkot¹, Christiane Lammersen², Christian Sohler³ - Show less +2 more•Institutions (3)

University of Paderborn¹, Simon Fraser University², Technical University of Dortmund³

22 May 2012-ACM Journal of Experimental Algorithms

TL;DR: In this article, a new k-means clustering algorithm for data streams of points from a Euclidean space is proposed, which computes a small weighted sample of the data stream and solves the problem on the sample using the kmeans++ algorithm of Arthur and Vassilvitskii (SODA '07).

...read moreread less

Abstract: We develop a new k-means clustering algorithm for data streams of points from a Euclidean space. We call this algorithm StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm of Arthur and Vassilvitskii (SODA '07). To compute the small sample, we propose two new techniques. First, we use an adaptive, nonuniform sampling approach similar to the k-means++ seeding procedure to obtain small coresets from the data stream. This construction is rather easy to implement and, unlike other coreset constructions, its running time has only a small dependency on the dimensionality of the data. Second, we propose a new data structure, which we call coreset tree. The use of these coreset trees significantly speeds up the time necessary for the adaptive, nonuniform sampling during our coreset construction.We compare our algorithm experimentally with two well-known streaming implementations: BIRCH [Zhang et al. 1997] and StreamLS [Guha et al. 2003]. In terms of quality (sum of squared errors), our algorithm is comparable with StreamLS and significantly better than BIRCH (up to a factor of 2). Besides, BIRCH requires significant effort to tune its parameters. In terms of running time, our algorithm is slower than BIRCH. Comparing the running time with StreamLS, it turns out that our algorithm scalesmuch better with increasing number of centers. We conclude that, if the first priority is the quality of the clustering, then our algorithm provides a good alternative to BIRCH and StreamLS, in particular, if the number of cluster centers is large. We also give a theoretical justification of our approach by proving that our sample set is a small coreset in low-dimensional spaces.

...read moreread less

285 citations

Proceedings Article•DOI•

Counting triangles in data streams

[...]

Luciana S. Buriol¹, Gereon Frahling², Stefano Leonardi³, Alberto Marchetti-Spaccamela³, Christian Sohler² - Show less +1 more•Institutions (3)

Universidade Federal de Santa Maria¹, University of Paderborn², Sapienza University of Rome³

26 Jun 2006

TL;DR: Two space bounded random sampling algorithms that compute an approximation of the number of triangles in an undirected graph given as a stream of edges are presented and they provide a basic tool to analyze the structure of large graphs.

...read moreread less

Abstract: We present two space bounded random sampling algorithms that compute an approximation of the number of triangles in an undirected graph given as a stream of edges. Our first algorithm does not make any assumptions on the order of edges in the stream. It uses space that is inversely related to the ratio between the number of triangles and the number of triples with at least one edge in the induced subgraph, and constant expected update time per edge. Our second algorithm is designed for incidence streams (all edges incident to the same vertex appear consecutively). It uses space that is inversely related to the ratio between the number of triangles and length 2 paths in the graph and expected update time O(log|V|⋅(1+s⋅|V|/|E|)), where s is the space requirement of the algorithm. These results significantly improve over previous work [20, 8]. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size and so they provide a basic tool to analyze the structure of large graphs. They have many applications, for example, in the discovery of Web communities, the computation of clustering and transitivity coefficient, and discovery of frequent patterns in large graphs.We have implemented both algorithms and evaluated their performance on networks from different application domains. The sizes of the considered graphs varied from about 8,000 nodes and 40,000 edges to 135 million nodes and more than 1 billion edges. For both algorithms we run experiments with parameter s=1,000, 10,000, 100,000, 1,000,000 to evaluate running time and approximation guarantee. Both algorithms appear to be time efficient for these sample sizes. The approximation quality of the first algorithm was varying significantly and even for s=1,000,000 we had more than 10% deviation for more than half of the instances. The second algorithm performed much better and even for s=10,000 we had an average deviation of less than 6% (taken over all but the largest instance for which we could not compute the number of triangles exactly).

...read moreread less

277 citations

Proceedings Article•

StreamKM++: a clustering algorithm for data streams

[...]

Marcel R. Ackermann¹, Christiane Lammersen², Marcus Märtens¹, Christoph Raupach¹, Christian Sohler², Kamil Swierkot¹ - Show less +2 more•Institutions (2)

University of Paderborn¹, Technical University of Dortmund²

16 Jan 2010

TL;DR: A new k-means clustering algorithm for data streams of points from a Euclidean space that provides a good alternative to BIRCH and StreamLS, in particular, if the number of cluster centers is large.

...read moreread less

Abstract: We develop a new k-means clustering algorithm for data streams, which we call StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm [1]. To compute the small sample, we propose two new techniques. First, we use a non-uniform sampling approach similar to the k-means++ seeding procedure to obtain small coresets from the data stream. This construction is rather easy to implement and, unlike other coreset constructions, its running time has only a low dependency on the dimensionality of the data. Second, we propose a new data structure which we call a coreset tree. The use of these coreset trees significantly speeds up the time necessary for the non-uniform sampling during our coreset construction. We compare our algorithm experimentally with two well-known streaming implementations (BIRCH [16] and StreamLS [4, 9]). In terms of quality (sum of squared errors), our algorithm is comparable with StreamLS and significantly better than BIRCH (up to a factor of 2). In terms of running time, our algorithm is slower than BIRCH. Comparing the running time with StreamLS, it turns out that our algorithm scales much better with increasing number of centers. We conclude that, if the first priority is the quality of the clustering, then our algorithm provides a good alternative to BIRCH and StreamLS, in particular, if the number of cluster centers is large. We also give a theoretical justification of our approach by proving that our sample set is a small coreset in low dimensional spaces.

...read moreread less

257 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Data streams: algorithms and applications

[...]

S. Muthukrishnan¹•Institutions (1)

Rutgers University¹

01 Aug 2005-Foundations and Trends in Theoretical Computer Science

TL;DR: Data Streams: Algorithms and Applications surveys the emerging area of algorithms for processing data streams and associated applications, which rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity.

...read moreread less

Abstract: In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [1].

...read moreread less

1,598 citations

Book•

Data Streams: Algorithms and Applications

[...]

S. Muthukrishnan¹•Institutions (1)

Rutgers University¹

01 Jan 2005

TL;DR: In this paper, the authors present a survey of basic mathematical foundations for data streaming systems, including basic mathematical ideas, basic algorithms, and basic algorithms and algorithms for data stream processing.

...read moreread less

Abstract: 1 Introduction 2 Map 3 The Data Stream Phenomenon 4 Data Streaming: Formal Aspects 5 Foundations: Basic Mathematical Ideas 6 Foundations: Basic Algorithmic Techniques 7 Foundations: Summary 8 Streaming Systems 9 New Directions 10 Historic Notes 11 Concluding Remarks Acknowledgements References

...read moreread less

1,506 citations

Book•

Computational geometry

[...]

F. Frances Yao

02 Jan 1991

1,377 citations

Journal Article•DOI•

Visual Place Recognition: A Survey

[...]

Stephanie Lowry¹, Niko Sünderhauf¹, Paul Newman², John J. Leonard³, David D. Cox⁴, Peter Corke¹, Michael Milford¹ - Show less +3 more•Institutions (4)

Queensland University of Technology¹, University of Oxford², Massachusetts Institute of Technology³, Harvard University⁴

01 Feb 2016-IEEE Transactions on Robotics

TL;DR: A survey of the visual place recognition research landscape is presented, introducing the concepts behind place recognition, how a “place” is defined in a robotics context, and the major components of a place recognition system.

...read moreread less

Abstract: Visual place recognition is a challenging problem due to the vast range of ways in which the appearance of real-world places can vary. In recent years, improvements in visual sensing capabilities, an ever-increasing focus on long-term mobile robot autonomy, and the ability to draw on state-of-the-art research in other disciplines—particularly recognition in computer vision and animal navigation in neuroscience—have all contributed to significant advances in visual place recognition systems. This paper presents a survey of the visual place recognition research landscape. We start by introducing the concepts behind place recognition—the role of place recognition in the animal kingdom, how a “place” is defined in a robotics context, and the major components of a place recognition system. Long-term robot operations have revealed that changing appearance can be a significant factor in visual place recognition failure; therefore, we discuss how place recognition solutions can implicitly or explicitly account for appearance change within the environment. Finally, we close with a discussion on the future of visual place recognition, in particular with respect to the rapid advances being made in the related fields of deep learning, semantic scene understanding, and video description.

...read moreread less

933 citations

Journal Article•

Property Testing and its connection to Learning and Approximation

[...]

Oded Goldreich¹, Shafi Goldwasser², Dana Ron²•Institutions (2)

Weizmann Institute of Science¹, Massachusetts Institute of Technology²

01 Jan 1996-Electronic Colloquium on Computational Complexity

TL;DR: In this paper, the authors consider the question of determining whether a function f has property P or is e-far from any function with property P. In some cases, it is also allowed to query f on instances of its choice.

...read moreread less

Abstract: In this paper, we consider the question of determining whether a function f has property P or is e-far from any function with property P. A property testing algorithm is given a sample of the value of f on instances drawn according to some distribution. In some cases, it is also allowed to query f on instances of its choice. We study this question for different properties and establish some connections to problems in learning theory and approximation.In particular, we focus our attention on testing graph properties. Given access to a graph G in the form of being able to query whether an edge exists or not between a pair of vertices, we devise algorithms to test whether the underlying graph has properties such as being bipartite, k-Colorable, or having a p-Clique (clique of density p with respect to the vertex set). Our graph property testing algorithms are probabilistic and make assertions that are correct with high probability, while making a number of queries that is independent of the size of the graph. Moreover, the property testing algorithms can be used to efficiently (i.e., in time linear in the number of vertices) construct partitions of the graph that correspond to the property being tested, if it holds for the input graph.

...read moreread less

870 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse