scispace - formally typeset
Open AccessProceedings ArticleDOI

Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior

Reads0
Chats0
TLDR
This work defines a knee formally for continuous functions using the mathematical concept of curvature and compares its definition against alternatives, and evaluates Kneedle's accuracy against existing algorithms on both synthetic and real data sets and its performance in two different applications.
Abstract
Computer systems often reach a point at which the relative cost to increase some tunable parameter is no longer worth the corresponding performance benefit. These ``knees'' typically represent beneficial points that system designers have long selected to best balance inherent trade-offs. While prior work largely uses ad hoc, system-specific approaches to detect knees, we present Kneedle, a general approach to on line and off line knee detection that is applicable to a wide range of systems. We define a knee formally for continuous functions using the mathematical concept of curvature and compare our definition against alternatives. We then evaluate Kneedle's accuracy against existing algorithms on both synthetic and real data sets, and evaluate its performance in two different applications.

read more

Content maybe subject to copyright    Report

Finding a “Kneedle” in a Haystack:
Detecting Knee Points in System Behavior
Ville Satop
¨
a
¨
a
, Jeannie Albrecht
, David Irwin
, and Barath Raghavan
§
Williams College, Williamstown, MA
University of Massachusetts Amherst, Amherst, MA
§
International Computer Science Institute, Berkeley, CA
Abstract—Computer systems often reach a point at which the
relative cost to increase some tunable parameter is no longer
worth the corresponding performance benefit. These “knees” typ-
ically represent beneficial points that system designers have long
selected to best balance inherent trade-offs. While prior work
largely uses ad hoc, system-specific approaches to detect knees,
we present Kneedle, a general approach to online and offline
knee detection that is applicable to a wide range of systems.
We define a knee formally for continuous functions using the
mathematical concept of curvature and compare our definition
against alternatives. We then evaluate Kneedle’s accuracy against
existing algorithms on both synthetic and real data sets, and
evaluate its performance in two different applications.
I. INTRODUCTION
Selecting the “right” operating point for a given system is
often thought of as an art form, since the direct and indirect
costs and benefits of changing different system parameters
are difficult or even impossible to quantify. For example, an
important operating point in a large MapReduce job occurs
when the job should no longer wait for “slow” tasks to finish,
but instead speculatively re-execute work on other nodes in
hopes of finishing the job sooner [1]. Since MapReduce’s goal
is to finish all tasks as fast as possible, it must decide when the
cost, in terms of a job’s running time and cluster utilization,
is worth the corresponding performance benefit, in terms of
task completion percentage. Congestion-responsive network
protocols face a related challenge when setting a sending rate:
a protocol must decide a rate that maximizes performance
without exceeding its fair share and causing congestion.
In prior work, the issue has frequently been couched as
identifying one or more “knees”—operating points, based on
recent trends, where the perceived cost to alter a system param-
eter is no longer worth the expected performance benefit. For
MapReduce, triggering speculative execution after observing
a knee in the task completion percentage ensures that the
system re-executes tasks that are significantly slower than
other similar tasks that have finished execution. In the case
of a network protocol, successive increases to the sending
rate should cease if delay signals congestion by increasing
steeply, forming a knee. However, while the problem of
knee detection—finding “good” operating points in system
behavior—seems straightforward, to the best of our knowledge
there exists neither an accepted definition of a knee nor a
general systematic approach for detecting one.
Numerous researchers in widely disparate areas frequently
encounter knee detection problems similar to those we de-
scribe [1], [2], [3], [4], [5]. In these systems, researchers
either use ad hoc or system-specific approaches to detect
knees, or defer the problem to future work. While a finely-
crafted system-specific approach will perform better than a
general knee detection approach, a designer may not take
the time to design one. Thus, our aim is not to improve
or optimize a specific system or protocol, but to provide
system designers a general tool for improving the parts of
their system they generally do not take the time to optimize.
In network protocol and system design, rules-of-thumb often
serve researchers and operators well in the absence of an
optimal solution. We believe that a tool for knee detection
adds to their problem solving arsenal. Our hypothesis is that
a knee detection algorithm that does not require tuning for a
specific system or operational characteristics is applicable in a
wide range of settings where developers do not take the time
to design, test, and optimize a system-specific algorithm.
II. DEFINING AND DETECTING KNEES
While the notion of a knee is well-known, we are not
aware of a broadly accepted definition in prior literature.
The confusion stems from the fact that researchers, in many
cases unknowingly, use knees as a substitite for a more
comprehensive cost-benefit analysis that is either difficult
or impossible to perform. Performing a direct cost-benefit
analysis is often complex, since it is inherently system-,
platform-, and workload-specific. Further, many systems are
not predictable due to volatile operating conditions.
For example, unpredictable failure rates in large clusters,
which may change over time, are the root cause of stragglers in
MapReduce jobs [1]. Likewise, since multiple flows share net-
work links in the Internet, network protocols cannot predict in
advance the rapidly changing level of TCP-friendly bandwidth
available, but must instead continuously adapt to the indirect
signals of packet loss and delay [6]. In lieu of a complex
system-specific analysis, operators tend to select operating
points, or knees, that are “good enough” by observing where
performance improvements start to level off as a function of
one or more tunable system parameters. Note that we focus on
knee detection for complex systems that change their behavior
according to volatile, and potentially unpredictable, operating
conditions, and not for simple systems that permit standard
closed-form models, e.g., M/M/1 queues [7].

2
Ville Satop
¨
a
¨
a Knee Detection - Spring 2010
-4 -2 0 2 4
0.0 0.4 0.8
Gaussian Curvature
Time
Arrivals
-4 -2 0 2 4
-0.2 0.0 0.2
x
K
The top graph represents the CDF and the bottom graph is the associated curvature. The vertical line indicates the
maximum curvature, i.e. the knee, This seems to match the intuitive definition of a knee very precisely.
Fig. 1: CDF of a standard Gaussian distribution with mean=0
and standard deviation=1. Vertical bar indicates point of maximum
curvature. The inflection point of this curve occurs at x =0.
A. Knee Definition
The difficulty with defining a knee formally is that “good
enough” in one system may not be “good enough” in another.
Since knees only serve as an approximation, operators interpret
them differently in different situations. Thus, knee detection is
an inherently heuristic process. However, to design a general
application-independent knee detection algorithm, we require
a consistent definition applicable to any system. In this work,
as in [8], we use the mathematical definition of curvature for
a continuous function as the basis for our knee definition. For
any continuous function f, there exists a standard closed-form
K
f
(x) that defines the curvature of f at any point as a function
of its first and second derivative:
K
f
(x)=
f
��
(x)
(1 + f
(x)
2
)
1.5
The point of maximum curvature is well-matched to the ad
hoc methods operators use to select a knee, since curvature is
a mathematical measure of how much a function differs from
a straight line. As a result, maximum curvature captures the
leveling off effect operators use to identify knees. Importantly,
unlike other common definitions, curvature is application-
independent and (i) does not depend on the relationship
between system parameters and performance, or (ii) require
setting system-specific thresholds. Note that knee detection
does depend on the selection of proper adjustable system
parameters and performance metrics, as we show for our
examples in Section V.
It is important to realize why a knee definition based only on
the first derivative is not enough to identify a knee. Consider
the simple example in Figure 1, where the y-axis represents
some performance metric, the x-axis represents a tunable
system parameter, and the vertical bar represents the point
of maximum curvature. The maximum of the first derivative
is the inflection point of the curve, which occurs at x =0
in Figure 1. The inflection point is not representative of the
knee since performance continues to improve significantly
beyond it. Instead, the inflection point only captures where the
rate of performance increase reaches a maximum. In contrast,
the curvature definition precisely matches the concept of a
knee. [8] includes a survey of a range of other knee defini-
tions from prior work, primarily in the context of clustering
algorithms [7], [9], [10], [11], [12]. We discuss alternative
definitions below.
While curvature is well-defined for continuous functions,
it is not well-defined for discrete data sets. In the discrete
case, we could determine curvature by fitting a continuous
function to the data and using the function’s point of maximum
curvature. However, fitting a continuous function to a set of
arbitrary data points is difficult, especially if the data is noisy.
Further, determining the maximum curvature of the resulting
function may not be sufficient, since the curvature at any point
of a function is dependent on the entire function, including
points not in the relevant data set. Thus, maximum curvature
may fall outside the data’s valid range or be one of the set’s
end-points. Since an approximation of curvature requires at
least three points—the minimum number of points that define
a circle—end-points in a data set do not have curvature values
by definition. Thus, using the closed-form formulation as a
direct basis for knee detection on discrete data is not possible.
B. Knee Detection in Discrete Data Sets
Researchers have proposed multiple previous approaches
to detecting knees in discrete data. Before formulating our
curvature-inspired algorithm in Section III, we present two
existing approaches—Angle-based and EWMA—from prior
research for comparison, as well as another approach we
formulate based on Menger curvature, a direct discrete equiv-
alent of continuous curvature. Note that the Angle-based and
Menger algorithms are designed specifically for offline cases,
where the entire data set is known in advance, while EWMA is
designed to detect knees online as data points become known.
Angle-based. The geometric “angle-based” approach of
Zhao et al. [13] is an extension of the L-method for detecting
knees in clustering applications [8]. The Angle-based approach
first finds the local minima of the successive differences
(y
1
+ y
3
2y
2
) for each consecutive triple of points. For
example, consider a straight line that goes through the con-
secutive points (x
1
,y
1
), (x
2
,y
2
), and (x
3
,y
3
). Assuming x-
values are evenly spaced, then y
1
+ y
3
2y
2
=0for any
straight segment. However, if these three points form a knee,
(x
2
,y
2
) must be above the the straight line that goes through
(x
1
,y
1
) and (x
3
,y
3
). In this case y
1
+y
3
2y
2
< 0. “Sharper”
knees have more negative difference values.
Next, since successive differences are local measures and
ignore the overall trend of the curve, the algorithm combines
the differences with an angle value. After obtaining the local
minima of the successive differences, the algorithm sorts the
minima, and, starting from the point with the largest difference
value, calculates the two angles formed by the y -axis and the
line going through each successive pair of points associated
with the corresponding difference value. The sum of these
two angles is the angle value. Knees are detected at the local
maxima of these angle values.
Menger Curvature. While curvature is not well-defined
for arbitrary discrete data sets, Menger curvature defines the
curvature for three discrete points as the curvature of the
circle circumscribed about those points [14]. Thus, we define
the Menger curvature for each point p
i
=(x
i
,y
i
) in an
n point data set as being equal to 1/r for the circle of
radius r circumscribed about p
1
, p
i
, and p
n
. The curvature
of the circumscribed circle is straightforward to compute and

3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Difference
Threshold
(a) (b) (c)
Fig. 2: Kneedle algorithm for online knee detection. (a) depicts the smoothed and normalized data, with dashed bars indicating the
perpendicular distance from y = x with the maximum distance indicated. (b) shows the same data, but this time the dashed bars are
rotated 45 degrees. The magnitude of these bars correspond to the difference values used in Kneedle. (c) shows the plot of these difference
values and the corresponding threshold values (with S =1). The knee is found at x =0.22 and is detected after receiving the point x =0.55.
is simply a function of the lengths of the sides of the triangle
with the points as vertices. However, as we show in Section IV,
while Menger closely approximates curvature for offline data
drawn from ideal continuous functions, it does not work well
for the noisy online data sets typical of computing systems.
EWMA. The EWMA approach uses techniques similar
to those employed by Bollinger Bands [15] and Geometric
Moving Average algorithms for change detection [16]. The
algorithm that we use is based on the methodology described
by Albrecht et al. in their work on partial barriers [3], which
derives from previous work on MONET [17]. EWMA is an
online algorithm that uses two exponentially weighted moving
averages. The first EWMA, called arr, is used to smooth
the input data, which is viewed as host arrival times. The
second EWMA, arrvar, keeps track of the average deviation
from arr, and is an estimate of the variance in arrival times.
Finally, these two values are used to compute a maximum wait
threshold of arr +4· arrvar, which represents the maximum
amount of time to wait for the next point to arrive. If the
point arrives after this threshold, or the threshold is reached
without seeing the next arrival, EWMA declares a knee. One
important attribute of this algorithm is that EWMA does not
directly report where the knee point is—it only determines if
a knee has been passed. As a result, EWMA is only applicable
in an online setting.
III. KNEEDLE ALGORITHM
Kneedle is based on the notion that the points of maximum
curvature in a data set—the knees—are approximately the set
of points in a curve that are local maxima if the curve is rotated
θ degrees clockwise about (x
min
,y
min
) through the line formed
by the points (x
min
,y
min
) and (x
max
,y
max
). We choose this line
because we want to preserve the overall behavior of the data
set—using a line of best fit, for example, risks cutting off the
end points due to a higher concentration of points in the middle
of the curve. After rotating about this line, the local maxima—
and thus knees—are the points at which the curve differs most
from the straight line segment connecting the first and last data
point, thereby approximating the point of maximum curvature
for a discrete set of points. Since maximum curvature is an
inherent measure of the point where a continuous function
differs most from a straight line, Kneedle uses a literal measure
of the point that differs most from the straight line connecting
the set’s end-points.
Figure 2 depicts how Kneedle works for data points drawn
from the curve y = 1/x +5 where x-values are between 0
and 1. Note that we assume that the curves under consideration
have negative concavity. For curves with consistently positive
concavity (e.g., forming “elbows” rather than knees) it is trivial
to invert the graph by replacing each y
i
with y
max
y
i
and x
i
with x
max
x
i
.
We summarize Kneedle below. Put simply, knees occur
when a curve becomes more “flat, indicating a decrease in
curvature. The algorithm works as follows:
1. First we use a smoothing spline to preserve the shape of
the original data set as much as possible, although other
smoothing techniques, such as an exponentially weighted
moving average, could also be used. Let D
s
represent the
finite set of x- and y-values that define a smooth curve, i.e.,
one that has been fit to a smoothing spline.
D
s
= {(x
s
i
,y
s
i
) R
2
| x
s
i
,y
s
i
0}.
2. We want our algorithm to function in the same way
regardless of the magnitude of the values in the underlying
data. Thus, we next normalize the points of the smooth
curve to the unit square, as shown in Figure 2(a). This does
not change the shape or trends of the data set:
D
sn
= {(x
sn
i
,y
sn
i
)}, where
x
sn
i
=(x
s
i
min{x
s
})/(max{x
s
}min{x
s
}),
y
sn
i
=(y
s
i
min{y
s
})/(max{y
s
}min{y
s
})}.
3. Next, we let D
d
represent the set of differences between
the x- and y-values, i.e., the set of points (x, y x) as
illustrated in Figure 2(b). The goal is to find out when
the difference curve changes from horizontal to sharply
decreasing, since this indicates the presence of a knee in the
original data set. Note that the actual values of the difference
points are irrelevant. We are only interested in observing the
trends of the difference curve, as seen in Figure 2(c).
D
d
= {(x
d
i
,y
d
i
)}, where
x
d
i
= x
sn
i
,
y
d
i
= y
sn
i
x
sn
i
.

4
0 20 40 60 80 100
0 20 40 60 80 100
Definition
Kneedle
Menger
Anglebased
EWMA
Fig. 3: Kneedle, Menger, Angle-based,
and EWMA for synthetic data set. Max-
imum curvature occurs at x = 60.
Kneedle
Menger
Anglebased
Fig. 4: Measured offline F-Score of knee
detection algorithms using NoisyGaus-
sian data.
Kneedle
Menger
Anglebased
Fig. 5: Histogram showing measured off-
line distances (numbers of x-values) to
“correct” knees.
4. To find the knee points in the normalized curve, e.g., the
places where the curve flattens out, we calculate the local
maxima of the difference curve. These points indicate the
instances where the rate of increase of y begins to decrease.
Each of these local maximum points are a candidate knee
point in the original data curve:
D
lmx
= {(x
lmx
i
,y
lmx
i
)}, where
x
lmx
i
= x
d
i
,
y
lmx
i
= y
d
i
| y
d
i1
<y
d
i
,y
d
i+1
<y
d
i
.
5. For each local maximum (x
lmx
i
,y
lmx
i
) in the difference
curve, we define a unique threshold value, T
lmx
i
, that is
based on the average difference between consecutive x-
values and a sensitivity parameter, S. The sensitivity param-
eter allows us to adjust how aggressive we want Kneedle
to be when detecting knees. Smaller values for S detect
knees quicker, while larger values are more conservative.
Put simply, S is a measure of how many “flat” points we
expect to see in the unmodified data curve before declaring
a knee. We explore the choice of S in Section IV. In
Figure 2(c), the threshold line is plotted with S =1.
T
lmx
i
= y
lmx
i
S ·
n1
i=1
x
sn
i+1
x
sn
i
n 1
6. If any difference value (x
d
j
,y
d
j
), where j>i, drops
below the threshold y = T
lmx
i
for (x
lmx
i
,y
lmx
i
) before the
next local maximum in the difference curve is reached,
Kneedle declares a knee at the x-value of the corresponding
local maximum x = x
lmx
i
. If the difference values reach
a local minimum and starts to increase before y = T
lmx
i
is reached, we reset the threshold value to 0 and wait for
another local maximum to be reached.
Note that Kneedle can be run offline or online. In the online
case, Kneedle can “correct” old knee values if necessary as
points are received. Kneedle’s online run time for any given
n pairs of x- and y-values is bounded by
n
i=1
i = O(n
2
).
IV. EVA L UAT I N G KNEEDLE
We compare the performance of Kneedle to the offline
(Angle-based, Menger) and online (EWMA) algorithms sep-
arately, since their goals are different. In offline settings, our
aim is to determine a base-line accuracy for each algorithm
using synthetic data sets drawn from continuous functions
where the true knees are well-known. After showing that
Kneedle closely approximates the true knees, we then compare
its online behavior against EWMA to evaluate how quickly it
is able to detect knees once they “appear” in the data.
A. Detecting Knees in Synthetic Data Sets
To evaluate Kneedle, we developed a synthetic data source
which we call NoisyGaussian that yields data similar to many
of the real data sets of interest, but allows us to vary the overall
shape of the curve. To generate a NoisyGaussian, we start
with a Gaussian function with a randomly selected standard
deviation and mean. Then we generate the NoisyGaussian
data set using the cumulative count of the randomly generated
points whose value is less than x. The resulting curve is similar
to a Gaussian cumulative distribution function in overall shape.
The benefit of evaluating the knee detection algorithms
using NoisyGaussian is that an approximate closed-form
solution exists for the point of maximum curvature. We derive
the point of maximum curvature by computing it for the
underlying Gaussian CDF in terms of standard deviation σ
and mean µ. Although we omit the details for brevity, the
point of maximum curvature is approximately x µ + σ with
a small bounded error. We use this closed-form expression to
represent the “correct” knee in our evaluation.
To illustrate the general behavior of each knee detector, we
plot the knees each algorithm detects in Figure 3 for a sample
NoisyGaussian data set with µ = 50 and σ = 10.
B. Offline Accuracy
To evaluate offline accuracy, we use three common statisti-
cal metrics: precision, recall, and F-Score. Precision measures
the correctness of each knee an algorithm detects. A low preci-
sion value indicates the presence of numerous false positives,
where a false positive is any detected knee that does not
align with maximum curvature. Recall measures completeness
by quantifying the percentage of correct knees an algorithm
detects out of the total number of correct knees. Note, however,
that recall does not penalize for incorrect detections. Our third
metric, F-Score, is the harmonic mean of precision and recall.
Since an ideal knee detection algorithm has both high recall

5
Kneedle
EWMA
Fig. 6: Online detection latency. Nega-
tive values indicate early detections.
Fig. 7: Measured offline F-Scores for
varying sensitivity values in Kneedle.
Sensitivity
0.001
1.0
5.0
Fig. 8: Measured online F-Scores for
varying sensitivity values in Kneedle.
and high precision, we use F-Score to capture both measures
of accuracy in a single value. An F-Score value of 1 is best.
To evaluate our algorithms, we generate 10,000 Noisy-
Gaussian data sets. Since none of the algorithms detect knees
at exactly the point of maximum curvature, we vary how
many data points we allow for error. For example, suppose
our data set includes points at x =1, 2, 3, 4, 5, and the point
of maximum curvature is x =4. With an allowable error of
1, we declare the algorithm as finding a “correct” knee if it
detects a knee at x =3, 4, or 5. Figure 4 shows that Kneedle’s
F-Score is better than the Angle-based or Menger algorithm.
Using the closed-form approximation for the point of max-
imum curvature in our NoisyGaussian data sets, we can
identify “true” knees in the data. This allows us to quantify
the accuracy of each algorithm by measuring the distance, in
terms of the number of x-values, between the true knees and
the detected knees. Figure 5 shows the results of measuring
the distance, in terms of the number of x-values, between the
true knees and the detected knees. In this histogram, we see
that Kneedle approximates the point of maximum curvature
much more closely than either Menger or Angle-based, since
the density of the histogram is highest between 0 and 25, while
Menger and Angle-based show a wider variation.
C. Online Detection Latency
In this section, we evaluate detection latency—the number
of data points beyond the knee required for detection—for
both EWMA and Kneedle. For online Kneedle, we execute
the knee detection algorithm after receiving each new data
point, in order of increasing x. For both EWMA and Kneedle,
we compute the detection latency as the number of data points
between when the algorithm detects a knee and the actual knee
point as determined by the point of maximum curvature. For
example, suppose the data set has points at x =1, 2, 3, 4, and
5, with a true knee at x =3. Now suppose that after receiving
the point at x =5, the knee detection algorithm detects a
knee. In this case, we compute the the latency as 5 3=2.
In Figure 6 we plot a histogram of the detection latency for
EWMA and Kneedle with S =1. The experiment highlights
the fact that Kneedle rarely has a significant detection latency,
while EWMA often has high detection latencies.
D. Sensitivity
To better understand the importance of sensitivity, S, to
Kneedle’s performance, we again use F-Score. Figures 7 and 8
show the results of our sensitivity analysis in offline and online
settings respectively. In both graphs, we compute Kneedle’s F-
Score using a wide range of sensitivity values. We compare
the F-Score from 10,000 data sets for each value of S. In the
offline graph, we use the points of maximum curvature as the
true knees, and compute the F-Score based on those values. In
the online graph, our goal is to determine how quickly Kneedle
approaches the offline case, and thus we use the knees detected
by offline Kneedle as the correct knees. Not surprisingly, in
offline settings where Kneedle has perfect information, the
highest F-Score occurs when S =0. In online settings, the
results vary depending on the number of points received, but
overall S =1has the best results.
V. A PPLICATION RESULTS
This section demonstrates Kneedle’s usefulness in real ap-
plications. First, we identify knees in a data set from prior
work, and show that we find close to the same knees that
the authors found with system-specific techniques. Next we
evaluate Kneedle’s performance for two sample applications: a
MapReduce-like system and a TCP-friendly network protocol.
A. Using Kneedle in Existing Applications
Figure 9 applies knees to object replication, where the knees
represent the optimal degrees of replication for high avail-
ability given various object distributions (data from Figure 5
in [5]). The application requires the detection of multiple knees
in object popularity curves, each of which has considerable
noise. Unlike other knee detection algorithms, such as Menger,
Kneedle is capable of detecting multiple knees, where the
sensitivity of this detection depends on the selected value of
S. Note that we consider this knee detection application to
be offline, since Zhong et al. observe: “[w]e expect that the
replica adjustment overhead due to object request popularity
changes would not be excessive in practice...our analysis of
real system object request traces in Section 3.2 suggests that
the popularities of most data objects tend to remain stable
over multi-week periods. The knees found by Kneedle in this
graph concur with those identified by the original authors.

Citations
More filters
Journal ArticleDOI

Snorkel: rapid training data creation with weak supervision

TL;DR: Snorkel as mentioned in this paper is a system that enables users to train state-of-the-art models without hand labeling any training data, which can have unknown accuracies and correlations.
Journal ArticleDOI

An agent-based model to evaluate the COVID-19 transmission risks in facilities.

TL;DR: An agent-based model to evaluate the COVID-19 transmission risks in facilities is presented and experimental results have demonstrated that the simulations provide useful information to produce strategies for reducing the transmission risks of CO VID-19 within the facilities.
Journal ArticleDOI

Snorkel: Rapid Training Data Creation with Weak Supervision

TL;DR: Snorkel is a first-of-its-kind system that enables users to train state- of- the-art models without hand labeling any training data and proposes an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution.
Journal ArticleDOI

Inferring clonal composition from multiple sections of a breast cancer.

TL;DR: A generative model for NGS data derived from multiple subsections of a single tumor is proposed, and an expectation-maximization procedure for estimating the clonal genotypes and relative frequencies is described, and it is demonstrated that this algorithm predicts clonal relationships that are both phylogenetically and spatially plausible.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Proceedings Article

A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Proceedings Article

A density-based algorithm for discovering clusters in large spatial Databases with Noise

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Journal ArticleDOI

Detection of abrupt changes: theory and application

TL;DR: A unified framework for the design and the performance analysis of the algorithms for solving change detection problems and links with the analytical redundancy approach to fault detection in linear systems are established.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in "Finding a “kneedle” in a haystack: detecting knee points in system behavior" ?

While prior work largely uses ad hoc, system-specific approaches to detect knees, the authors present Kneedle, a general approach to online and offline knee detection that is applicable to a wide range of systems. The authors then evaluate Kneedle ’ s accuracy against existing algorithms on both synthetic and real data sets, and evaluate its performance in two different applications. 

Using the closed-form approximation for the point of maximum curvature in their NoisyGaussian data sets, the authors can identify “true” knees in the data. 

Since an approximation of curvature requires at least three points—the minimum number of points that define a circle—end-points in a data set do not have curvature values by definition. 

The benefit of evaluating the knee detection algorithms using NoisyGaussian is that an approximate closed-form solution exists for the point of maximum curvature. 

When Kneedle returned a knee, the authors simply reallocated unfinished tasks to idle nodes, reducing the total completion time from 827 seconds down to 143 seconds. 

To test the effectiveness of Kneedle in their own MapReducelike setting, the authors integrated their algorithm into a prototypical distributed batch computing system that farms out tasks to PlanetLab nodes [18]. 

In this work, as in [8], the authors use the mathematical definition of curvature for a continuous functi n as the basis for ur knee definition. 

The authors derive the point of maximum curvature by computing it for the underlying Gaussian CDF in terms of standard deviation σ and mean µ. 

For each local maximum (xlmxi , ylmxi) in the difference curve, the authors define a unique threshold value, Tlmxi , that is based on the average difference between consecutive xvalues and a sensitivity parameter, S. The sensitivity parameter allows us to adjust how aggressive the authors want Kneedle to be when detecting knees. 

The point of maximum curvature is well-matched to the ad hoc methods operators use to select a knee, since curvature is a mathematical measure of how much a function differs from a straight line. 

The authors increment the rate every time a packet is transmitted and pace the packets evenly; for every 100 packets sent, the authors compute the knee point and use it as the new target rate. 

Figure 10 demonstrates that Kneedle can be successfully integrated into existing systems with minimal effort: the only change required to their work allocation system was a single function call.